<< Chapter < Page | Chapter >> Page > |
As we mentioned previously, gprof also produces a timing profile (also called a “flat” profile, just to confuse things) similar to the one produced by prof . A few of the fields are different from prof , and there is some extra information, so it will help if we explain it briefly. The following example shows the first few lines from a gprof flat profile for stuff . You will recognize the top three routines from the original program. The others are library functions included at link-time.
% cumulative self self total
time seconds seconds calls ms/call ms/call name39.9 3.23 3.23 2 1615.07 1615.07 _c_ [6]
39.9 6.46 3.23 1 3230.14 4845.20 _b_ [4]20.0 8.08 1.62 1 1620.07 3235.14 _a_ [5]
0.1 8.09 0.01 3 3.33 3.33 _ioctl [9]0.0 8.09 0.00 64 0.00 0.00 .rem [12]
0.0 8.09 0.00 64 0.00 0.00 _f_clos [177]0.0 8.09 0.00 20 0.00 0.00 _sigblock [178]
... .... .... . . . ......
Here’s what each column means:
%time
Again, we see a field that describes the runtime for each routine as a percent- age of the overall time taken by the program. As you might expect, all the entries in this column should total 100% (nearly).cumulative
seconds For any given routine, the column called “cumulative seconds” tallies a running sum of the time taken by all the preceding routines plus its own time. As you scan towards the bottom, the numbers asymptotically approach the total runtime for the program.self
seconds Each routine’s individual contribution to the runtime.calls
The number of times this particular routine was called.self ms/call
Seconds spent inside the routine, divided by the number of calls. This gives the average length of time taken by each invocation of the routine. The figure is presented in milliseconds.total ms/call
Seconds spent inside the routine plus its descendants, divided by the number of calls.name
The name of the routine. Notice that the cross-reference number appears here too.It is possible to accumulate statistics from multiple runs so that you can get a picture of how a program is doing with a variety of data sets. For instance, say that you wanted to profile an application — call it bar — with three different sets of input data. You could perform the runs separately, saving the gmon.out files as you go, and then combine the results into a single profile at the end:
% f77 -pg bar.f -o bar
% bar<data1.input
% mv gmon.out gmon.1% bar<data2.input
% mv gmon.out gmon.2% bar<data3.input
% gprof bar -s gmon.1 gmon.2 gmon.out>gprof.summary.out
In the example profile, each run along the way creates a new gmon.out file that we renamed to make room for the next one. At the end, gprof combines the infor- mation from each of the data files to produce a summary profile of bar in the file gprof.summary.out . Additionally (you don’t see it here), gprof creates a file named gmon.sum that contains the merged data from the original three data files. gmon.sum has the same format as gmon.out , so you can use it as input for other merged profiles down the road.
In form, the output from a merged profile looks exactly the same as for an individual run. There are a couple of interesting things you will note, however. For one thing, the
main
routine appears to have been invoked more than once — one time for each run, in fact. Furthermore, depending on the application, multiple runs tend to either smooth the contour of the profile or exaggerate its features. You can imagine how this might happen. If a single routine is consistently called while others come and go as the input data changes, it takes on increasing importance in your tuning efforts.
For processors running at 600 MHz and more, the time between 60 Hz and 100 Hz samples is a veritable eternity. Furthermore, you can experience quantization errors when the sampling frequency is fixed, as is true of steady 1/100th or 1/60th of a second samples. To take an exaggerated example, assume that the timeline in
[link] shows alternating calls to two subroutines,
BAR
and
FOO
. The tick marks represent the sample points for profiling.
BAR
and
FOO
take turns running. In fact,
BAR
takes more time than
FOO
. But because the sampling interval closely matches the frequency at which the two subroutines alternate, we get a quantizing error: most of the samples happen to be taken while
FOO
is running. Therefore, the profile tells us that
FOO
took more CPU time than
BAR
.
We have described the tried and true UNIX subroutine profilers that have been available for years. In many cases, vendors have much better tools available for the asking or for a fee. If you are doing some serious tuning, ask your vendor representative to look into other tools for you.
Notification Switch
Would you like to follow the 'High performance computing' conversation and receive update notifications?