<< Chapter < Page Chapter >> Page >

Gprof’s flat profile

As we mentioned previously, gprof also produces a timing profile (also called a “flat” profile, just to confuse things) similar to the one produced by prof . A few of the fields are different from prof , and there is some extra information, so it will help if we explain it briefly. The following example shows the first few lines from a gprof flat profile for stuff . You will recognize the top three routines from the original program. The others are library functions included at link-time.


% cumulative self self total time seconds seconds calls ms/call ms/call name39.9 3.23 3.23 2 1615.07 1615.07 _c_ [6] 39.9 6.46 3.23 1 3230.14 4845.20 _b_ [4]20.0 8.08 1.62 1 1620.07 3235.14 _a_ [5] 0.1 8.09 0.01 3 3.33 3.33 _ioctl [9]0.0 8.09 0.00 64 0.00 0.00 .rem [12] 0.0 8.09 0.00 64 0.00 0.00 _f_clos [177]0.0 8.09 0.00 20 0.00 0.00 _sigblock [178] ... .... .... . . . ......

Here’s what each column means:

  • %time Again, we see a field that describes the runtime for each routine as a percent- age of the overall time taken by the program. As you might expect, all the entries in this column should total 100% (nearly).
  • cumulative seconds For any given routine, the column called “cumulative seconds” tallies a running sum of the time taken by all the preceding routines plus its own time. As you scan towards the bottom, the numbers asymptotically approach the total runtime for the program.
  • self seconds Each routine’s individual contribution to the runtime.
  • calls The number of times this particular routine was called.
  • self ms/call Seconds spent inside the routine, divided by the number of calls. This gives the average length of time taken by each invocation of the routine. The figure is presented in milliseconds.
  • total ms/call Seconds spent inside the routine plus its descendants, divided by the number of calls.
  • name The name of the routine. Notice that the cross-reference number appears here too.

Accumulating the results of several gprof runs

It is possible to accumulate statistics from multiple runs so that you can get a picture of how a program is doing with a variety of data sets. For instance, say that you wanted to profile an application — call it bar — with three different sets of input data. You could perform the runs separately, saving the gmon.out files as you go, and then combine the results into a single profile at the end:


% f77 -pg bar.f -o bar % bar<data1.input % mv gmon.out gmon.1% bar<data2.input % mv gmon.out gmon.2% bar<data3.input % gprof bar -s gmon.1 gmon.2 gmon.out>gprof.summary.out

In the example profile, each run along the way creates a new gmon.out file that we renamed to make room for the next one. At the end, gprof combines the infor- mation from each of the data files to produce a summary profile of bar in the file gprof.summary.out . Additionally (you don’t see it here), gprof creates a file named gmon.sum that contains the merged data from the original three data files. gmon.sum has the same format as gmon.out , so you can use it as input for other merged profiles down the road.

In form, the output from a merged profile looks exactly the same as for an individual run. There are a couple of interesting things you will note, however. For one thing, the main routine appears to have been invoked more than once — one time for each run, in fact. Furthermore, depending on the application, multiple runs tend to either smooth the contour of the profile or exaggerate its features. You can imagine how this might happen. If a single routine is consistently called while others come and go as the input data changes, it takes on increasing importance in your tuning efforts.

A few words about accuracy

For processors running at 600 MHz and more, the time between 60 Hz and 100 Hz samples is a veritable eternity. Furthermore, you can experience quantization errors when the sampling frequency is fixed, as is true of steady 1/100th or 1/60th of a second samples. To take an exaggerated example, assume that the timeline in [link] shows alternating calls to two subroutines, BAR and FOO . The tick marks represent the sample points for profiling.

Quantization errors in profiling

This figure is a line of alternating grey and dark grey sections. The grey sections are wider, and are labeled ..bar.., and the dark grey sections are labeled .foo.

BAR and FOO take turns running. In fact, BAR takes more time than FOO . But because the sampling interval closely matches the frequency at which the two subroutines alternate, we get a quantizing error: most of the samples happen to be taken while FOO is running. Therefore, the profile tells us that FOO took more CPU time than BAR .

We have described the tried and true UNIX subroutine profilers that have been available for years. In many cases, vendors have much better tools available for the asking or for a fee. If you are doing some serious tuning, ask your vendor representative to look into other tools for you.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, High performance computing. OpenStax CNX. Aug 25, 2010 Download for free at http://cnx.org/content/col11136/1.5
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'High performance computing' conversation and receive update notifications?

Ask