



ggpprrooff ---- AA CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr            PPSSDD::1188--11


ggpprrooff:: aa CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr1

by _S_u_s_a_n _L_. _G_r_a_h_a_m _P_e_t_e_r _B_. _K_e_s_s_l_e_r _M_a_r_s_h_a_l_l _K_. _M_c_K_u_s_i_c_k

Computer Science Division Electrical  Engineering  and  Com-
puter  Science Department University of California, Berkeley
Berkeley, California 94720





AAbbssttrraacctt

     Large complex programs are composed of many small  rou-
tines that implement abstractions for the routines that call
them.  To be useful, an execution  profiler  must  attribute
execution  time in a way that is significant for the logical
structure of a program as well as for its textual decomposi-
tion.   This  data  must  then be displayed to the user in a
convenient and informative way.  The ggpprrooff profiler accounts
for  the running time of called routines in the running time
of the routines that call them.  The design and use of  this
profiler is described.

11..  PPrrooggrraammss ttoo bbee PPrrooffiilleedd

     Software  research  environments  normally include many
large programs both for  production use and for experimental
investigation.   These  programs  are  typically modular, in
accordance with generally accepted principles of  good  pro-
gram  design.  Often they consist of numerous small routines
that implement various abstractions.  Sometimes  such  large
programs  are  written  by one programmer who has understood
the requirements for these abstractions, and has  programmed
them  appropriately.   More  frequently  the program has had
multiple authors and has evolved  over  time,  changing  the
demands  placed  on  the  implementation of the abstractions
without changing the implementation  itself.   Finally,  the
program  may  be  assembled  from  a  library of abstraction
implementations unexamined by the programmer.

     Once a large program is executable, it is often  desir-
able  to increase its speed, especially if small portions of
the program are found to dominate its execution  time.   The
purpose  of  the  ggpprrooff  profiling  tool is to help the user
evaluate alternative implementations  of  abstractions.   We
developed  this tool in response to our efforts to improve a
____________________
   1This  work  was  supported by grant MCS80-05144 from the
National Science Foundation.












PPSSDD::1188--22            ggpprrooff ---- aa CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr


code generator we were writing [Graham82].

     The ggpprrooff design takes advantage of the fact  that  the
programs to be measured are large, structured and hierarchi-
cal.  We provide a profile in which the execution time for a
set  of  routines that implement an abstraction is collected
and charged to that abstraction.  The profile can be used to
compare and assess the costs of various implementations.

     The  profiler can be linked into a program without spe-
cial planning by the programmer.   The  overhead  for  using
ggpprrooff  is  low; both in terms of added execution time and in
the volume of profiling information recorded.

22..  TTyyppeess ooff PPrrooffiilliinngg

     There are several different uses for program  profiles,
and  each  may  require  different information from the pro-
files, or different presentation  of  the  information.   We
distinguish  two  broad  categories  of profiles: those that
present counts of  statement  or  routine  invocations,  and
those  that  display  timing information about statements or
routines.  Counts are typically presented in  tabular  form,
often in parallel with a listing of the source code.  Timing
information could be similarly presented; but more than  one
measure  of  time might be associated with each statement or
routine.  For example, in the framework used by  ggpprrooff  each
profiled  segment  would display two times: one for the time
used by the segment itself, and another for the time  inher-
ited from code segments it invokes.

     Execution  counts  are used in many different contexts.
The exact number of times a routine or  statement  is  acti-
vated can be used to determine if an algorithm is performing
as expected.  Cursory inspection of such counters  may  show
algorithms whose complexity is unsuited to the task at hand.
Careful  interpretation  of  counters  can   often   suggest
improvements  to acceptable algorithms.  Precise examination
can uncover subtle errors in an algorithm.  At  this  level,
profiling counters are similar to debugging statements whose
purpose is to show the number of times a piece  of  code  is
executed.   Another view of such counters is as boolean val-
ues.  One may be interested that a portion of code has  exe-
cuted  at  all, for exhaustive testing, or to check that one
implementation of an abstraction completely replaces a  pre-
vious one.

     Execution  counts  are  not necessarily proportional to
the amount of time required to execute the routine or state-
ment.   Further, the execution time of a routine will not be
the same for all calls on the  routine.   The  criteria  for
establishing  execution  time must be decided.  If a routine
implements an abstraction by  invoking  other  abstractions,










ggpprrooff ---- AA CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr            PPSSDD::1188--33


the  time  spent  in the routine will not accurately reflect
the time required by the abstraction it  implements.   Simi-
larly,  if an abstraction is implemented by several routines
the time required by the  abstraction  will  be  distributed
across those routines.

     Given  the execution time of individual routines, ggpprrooff
accounts to each routine the time spent for it by  the  rou-
tines  it  invokes.  This accounting is done by assembling a
_c_a_l_l _g_r_a_p_h with nodes that are the routines of  the  program
and  directed  arcs  that represent calls from call sites to
routines.  We distinguish among three different call  graphs
for  a  program.   The  _c_o_m_p_l_e_t_e _c_a_l_l _g_r_a_p_h incorporates all
routines and all potential arcs, including arcs that  repre-
sent calls to functional parameters or functional variables.
This graph contains the other two graphs as subgraphs.   The
_s_t_a_t_i_c  _c_a_l_l  _g_r_a_p_h  includes  all routines and all possible
arcs that are not calls to functional  parameters  or  vari-
ables.   The _d_y_n_a_m_i_c _c_a_l_l _g_r_a_p_h includes only those routines
and arcs traversed by the profiled execution of the program.
This  graph  need  not  include  all  routines,  nor need it
include all potential arcs between the routines  it  covers.
It  may,  however,  include arcs to functional parameters or
variables that the static call graph may omit.   The  static
call graph can be determined from the (static) program text.
The dynamic call graph is determined only  by  profiling  an
execution  of  the  program.   The complete call graph for a
monolithic program could be determined by data flow analysis
techniques.   The  complete  call  graph  for  programs that
change during execution, by modifying themselves or  dynami-
cally loading or overlaying code, may never be determinable.
Both the static call graph and the dynamic  call  graph  are
used  by ggpprrooff, but it does not search for the complete call
graph.

33..  GGaatthheerriinngg PPrrooffiillee DDaattaa

     Routine calls or statement executions can  be  measured
by  having  a compiler augment the code at strategic points.
The additions can be inline increments to counters [Knuth71]
[Satterthwaite72]  [Joy79]  or  calls to monitoring routines
[Unix].  The counter increment overhead is low, and is suit-
able  for  profiling  statements.   A call of the monitoring
routine has an overhead comparable with a call of a  regular
routine, and is therefore only suited to profiling on a rou-
tine by routine  basis.   However,  the  monitoring  routine
solution  has  certain  advantages.   Whatever  counters are
needed by the monitoring routine can be managed by the moni-
toring  routine itself, rather than being distributed around
the code.  In particular, a monitoring routine can easily be
called from separately compiled programs.  In addition, dif-
ferent monitoring routines can be linked  into  the  program
being  measured to assemble different profiling data without










PPSSDD::1188--44            ggpprrooff ---- aa CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr


having to change the compiler or recompile the program.   We
have  exploited  this  approach;  our  compilers for C, For-
tran77, and Pascal can insert calls to a monitoring  routine
in  the  prologue  for  each routine.  Use of the monitoring
routine requires no planning on part of a  programmer  other
than to request that augmented routine prologues be produced
during compilation.

     We are interested in gathering three pieces of informa-
tion  during  program  execution:  call counts and execution
times for each profiled routine, and the arcs of the dynamic
call  graph  traversed by this execution of the program.  By
post-processing of this data we can build the  dynamic  call
graph  for this execution of the program and propagate times
along the edges of this graph to attribute  times  for  rou-
tines to the routines that invoke them.

     Gathering  of  the  profiling  information  should  not
greatly interfere with the running of  the  program.   Thus,
the  monitoring  routine  must not produce trace output each
time it is invoked.  The volume of data thus produced  would
be  unmanageably  large,  and the time required to record it
would overwhelm the running time of  most  programs.   Simi-
larly, the monitoring routine can not do the analysis of the
profiling data (e.g. assembling the call graph,  propagating
times  around  it,  discovering cycles, etc.) during program
execution.  Our solution is to gather profiling data in mem-
ory during program execution and to condense it to a file as
the profiled program exits.  This file is then processed  by
a  separate  program  to  produce the listing of the profile
data.  An advantage of this approach  is  that  the  profile
data  for several executions of a program can be combined by
the post-processing to provide a profile of many executions.

     The  execution time monitoring consists of three parts.
The first part allocates and initializes the  runtime  moni-
toring  data structures before the program begins execution.
The second part is the monitoring routine invoked  from  the
prologue of each profiled routine.  The third part condenses
the data structures and writes them to a file as the program
terminates.   The  monitoring routine is discussed in detail
in the following sections.

33..11..  EExxeeccuuttiioonn CCoouunnttss

     The ggpprrooff monitoring routine counts the number of times
each  profiled  routine  is  called.  The monitoring routine
also records the arc in the call graph  that  activated  the
profiled  routine.   The count is associated with the arc in
the call graph rather than with the  routine.   Call  counts
for routines can then be determined by summing the counts on
arcs directed into that  routine.   In  a  machine-dependent
fashion,   the  monitoring  routine  notes  its  own  return










ggpprrooff ---- AA CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr            PPSSDD::1188--55


address.  This address is in the prologue of  some  profiled
routine  that  is  the  destination of an arc in the dynamic
call graph.   The  monitoring  routine  also  discovers  the
return  address  for that routine, thus identifying the call
site, or source of the arc.  The source of the arc is in the
_c_a_l_l_e_r,  and the destination is in the _c_a_l_l_e_e.  For example,
if a routine A calls a routine B, A is the caller, and B  is
the  callee.   The  prologue of B will include a call to the
monitoring routine that will note the arc from A  to  B  and
either initialize or increment a counter for that arc.

     One  can not afford to have the monitoring routine out-
put tracing information as each arc is  identified.   There-
fore,  the  monitoring  routine maintains a table of all the
arcs discovered, with counts of the numbers of times each is
traversed during execution.  This table is accessed once per
routine call.  Access to it must be as fast as  possible  so
as  not  to  overwhelm the time required to execute the pro-
gram.

     Our solution is to access the table through a hash  ta-
ble.   We  use  the  call  site  as the primary key with the
callee address being the secondary  key.   Since  each  call
site typically calls only one callee, we can reduce (usually
to one) the number of minor lookups  based  on  the  callee.
Another  alternative would use the callee as the primary key
and the call site as the secondary key.  Such  an  organiza-
tion  has the advantage of associating callers with callees,
at the expense of longer lookups in the monitoring  routine.
We  are fortunate to be running in a virtual memory environ-
ment, and (for the sake of  speed)  were  able  to  allocate
enough  space  for the primary hash table to allow a one-to-
one mapping from call site addresses to the primary hash ta-
ble.   Thus  our  hash  function is trivial to calculate and
collisions occur only for call sites that call multiple des-
tinations  (e.g.  functional parameters and functional vari-
ables).  A one level hash function using both call site  and
callee  would  result  in  an unreasonably large hash table.
Further, the number of dynamic call sites and callees is not
known during execution of the profiled program.

     Not  all  callers  and callees can be identified by the
monitoring routine.  Routines that were compiled without the
profiling augmentations will not call the monitoring routine
as part of their prologue, and thus no arcs will be recorded
whose destinations are in these routines.  One need not pro-
file all the routines in a program.  Routines that  are  not
profiled  run  at  full  speed.   Certain  routines, notably
exception handlers,  are  invoked  by  non-standard  calling
sequences.   Thus the monitoring routine may know the desti-
nation of an arc (the callee),  but  find  it  difficult  or
impossible  to determine the source of the arc (the caller).
Often in these cases the apparent source of the arc is not a










PPSSDD::1188--66            ggpprrooff ---- aa CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr


call  site  at all.  Such anomalous invocations are declared
``spontaneous''.

33..22..  EExxeeccuuttiioonn TTiimmeess

     The execution times for routines can be gathered in  at
least two ways.  One method measures the execution time of a
routine by measuring the elapsed time from routine entry  to
routine  exit.   Unfortunately,  time measurement is compli-
cated on time-sharing systems by  the  time-slicing  of  the
program.   A  second method samples the value of the program
counter at some interval, and infers execution time from the
distribution  of the samples within the program.  This tech-
nique is particularly suited to time-sharing systems,  where
the  time-slicing  can  serve  as the basis for sampling the
program counter.  Notice  that,  whereas  the  first  method
could provide exact timings, the second is inherently a sta-
tistical approximation.

     The sampling method need not require support  from  the
operating  system:  all that is needed is the ability to set
and respond to ``alarm clock'' interrupts that run  relative
to  program  time.   It  is imperative that the intervals be
uniform since the sampling of  the  program  counter  rather
than  the  duration of the interval is the basis of the dis-
tribution.  If sampling is done too often, the interruptions
to  sample the program counter will overwhelm the running of
the profiled program.  On the other hand, the  program  must
run  for  enough  sampled intervals that the distribution of
the samples accurately represents the distribution  of  time
for  the  execution  of  the  program.  As with routine call
tracing, the monitoring routine can  not  afford  to  output
information for each program counter sample.  In our comput-
ing environment, the operating system  can  provide  a  his-
togram  of the location of the program counter at the end of
each clock tick (1/60th of a  second)  in  which  a  program
runs.   The  histogram is assembled in memory as the program
runs.  This facility is enabled by our  monitoring  routine.
We  have  adjusted  the granularity of the histogram so that
program counter values map one-to-one  onto  the  histogram.
We  make the simplifying assumption that all calls to a spe-
cific routine require the same amount of  time  to  execute.
This assumption may disguise that some calls (or worse, some
call sites) always invoke a routine such that its  execution
is  faster  (or  slower) than the average time for that rou-
tine.

     When the profiled program terminates, the arc table and
the  histogram  of  program  counter samples is written to a
file.  The arc table is condensed to consist of  the  source
and  destination  addresses  of the arc and the count of the
number of times the arc was traversed by this  execution  of
the program.  The recorded histogram consists of counters of










ggpprrooff ---- AA CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr            PPSSDD::1188--77


the number of times the program counter was found to  be  in
each  of  the  ranges  covered by the histogram.  The ranges
themselves are summarized as a lower and upper bound  and  a
step size.

44..  PPoosstt PPrroocceessssiinngg

     Having  gathered  the arcs of the call graph and timing
information for an execution of the program, we  are  inter-
ested  in  attributing the time for each routine to the rou-
tines that call it.  We build a dynamic call graph with arcs
from  caller  to callee, and propagate time from descendants
to ancestors by topologically sorting the call graph.   Time
propagation  is  performed from the leaves of the call graph
toward the roots, according to the order assigned by a topo-
logical  numbering  algorithm.   The  topological  numbering
ensures that all edges in the graph go from higher  numbered
nodes  to lower numbered nodes.  An example is given in Fig-
ure 1.  If  we  propagate  time  from  nodes  in  the  order
assigned  by the algorithm, execution time can be propagated
from descendants to ancestors after a  single  traversal  of
each arc in the call graph.  Each parent receives some frac-
tion of a child's time.  Thus time is charged to the  caller
in addition to being charged to the callee.

     Let  _C_e  be the number of calls to some routine, _e, and
_C_r_e be the number of calls from a caller _r  to  a  callee  _e.
Since we are assuming each call to a routine takes the aver-
age amount of time for all calls to that routine, the caller
is  accountable  for  _C_r_e/_C_e of the time spent by the callee.
Let the _S_e be the _s_e_l_f_t_i_m_e of a routine, _e.  The selftime of
a  routine  can  be  determined  from the timing information
gathered during profiled program execution.  The total time,
_T_r,  we wish to account to a routine _r, is then given by the


                   8                   9


                        3         7


                   2         5         6


                        1         4


                    Topological ordering
                         Figure 1.













PPSSDD::1188--88            ggpprrooff ---- aa CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr


recurrence equation:

                     _T_r=_S_r+_rCALLS_e_T_ex_C_C___r_e_e__

where _rCALLS_e is a relation showing all routines _e called by
a  routine  _r.   This  relation is easily available from the
call graph.

     However, if the execution contains recursive calls, the
call  graph  has cycles that cannot be topologically sorted.
In these cases, we discover strongly-connected components in
the  call graph, treat each such component as a single node,
and then sort the resulting graph.  We use  a  variation  of
Tarjan's  strongly-connected  components algorithm that dis-
covers strongly-connected  components  as  it  is  assigning
topological order numbers [Tarjan72].

     Time  propagation  within strongly connected components
is a problem.  For  example,  a  self-recursive  routine  (a
trivial  cycle in the call graph) is accountable for all the
time it uses in all its recursive  instantiations.   In  our
scheme, this time should be shared among its call graph par-
ents.  The arcs from a routine to itself  are  of  interest,
but do not participate in time propagation.  Thus the simple
equation for time propagation does not work within  strongly
connected  components.  Time is not propagated from one mem-
ber of a  cycle  to  another,  since,  by  definition,  this
involves  propagating  time  from  a  routine to itself.  In
addition, children of one member of a cycle must be  consid-
ered  children of all members of the cycle.  Similarly, par-
ents of one member of the cycle must inherit all members  of
the  cycle  as descendants.  It is for these reasons that we
collapse connected components.  Our  solution  collects  all
members  of  a  cycle  together,  summing  the time and call
counts for all members.  All calls into the cycle  are  made
to share the total time of the cycle, and all descendants of
the cycle propagate time into the cycle as a  whole.   Calls
among  the  members  of the cycle do not propagate any time,
though they are listed in the call graph profile.

     Figure 2 shows a modified version of the call graph  of
Figure  1,  in  which the nodes labelled 3 and 7 in Figure 1
are mutually  recursive.   The  topologically  sorted  graph
after the cycle is collapsed is given in Figure 3.

     Since  the  technique described above only collects the
dynamic call graph, and the program typically does not  call
every  routine  on  each execution, different executions can
introduce different cycles in the dynamic call graph.  Since
cycles  often have a significant effect on time propagation,
it is desirable to incorporate the static call graph so that
cycles will have the same members regardless of how the pro-
gram runs.










ggpprrooff ---- AA CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr            PPSSDD::1188--99




                   O                   O


                        +o         +o


                   O         O         O


                        O         O


                   Cycle to be collapsed.
                         Figure 2.


                   7                   8


                         66       66
                             |

                   2         4         5


                        1         3


       Topological numbering after cycle collapsing.
                         Figure 3.

     The static call  graph  can  be  constructed  from  the
source text of the program.  However, discovering the static
call graph from the source text would require two moderately
difficult  steps:  finding  the  source text for the program
(which may not be available), and scanning and parsing  that
text, which may be in any one of several languages.

     In  our programming system, the static calling informa-
tion is also contained in the executable version of the pro-
gram,  which we already have available, and which is in lan-
guage-independent form.  One can examine the instructions in
the  object program, looking for calls to routines, and note
which routines can be called.  This technique allows  us  to
add  arcs  to those already in the dynamic call graph.  If a
statically discovered arc already exists in the dynamic call
graph,  no  action  is required.  Statically discovered arcs
that do not exist in the dynamic call graph are added to the
graph  with  a traversal count of zero.  Thus they are never
responsible for any time  propagation.   However,  they  may
affect  the structure of the graph.  Since they may complete










PPSSDD::1188--1100           ggpprrooff ---- aa CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr


strongly connected components, the static  call  graph  con-
struction is done before topological ordering.

55..  DDaattaa PPrreesseennttaattiioonn

     The data is presented to the user in two different for-
mats.  The first  presentation  simply  lists  the  routines
without  regard to the amount of time their descendants use.
The second presentation incorporates the call graph  of  the
program.

55..11..  TThhee FFllaatt PPrrooffiillee

     The flat profile consists of a list of all the routines
that are called during execution of the  program,  with  the
count  of the number of times they are called and the number
of seconds of execution time for which they  are  themselves
accountable.  The routines are listed in decreasing order of
execution time.  A list  of  the  routines  that  are  never
called  during execution of the program is also available to
verify that nothing important is omitted by this  execution.
The flat profile gives a quick overview of the routines that
are used, and shows the routines that are themselves respon-
sible  for  large fractions of the execution time.  In prac-
tice, this profile usually shows that no single function  is
overwhelmingly  responsible  for  the total time of the pro-
gram.  Notice that for this profile,  the  individual  times
sum to the total execution time.

55..22..  TThhee CCaallll GGrraapphh PPrrooffiillee

     Ideally,  we  would like to print the call graph of the
program, but we are limited by the two-dimensional nature of
our  output  devices.  We cannot assume that a call graph is


+----------------------------------------------------------------------------+
|                                     called/total     parents               |
|index   %time   self   descendants   called+self    name              index |
|                                     called/total     children              |
+----------------------------------------------------------------------------+
|                0.20      1.20           4/10         CALLER1         [7]   |
|                0.30      1.80           6/10         CALLER2         [1]   |
|[2]     41.5    0.50      3.00           10+4       EXAMPLE           [2]   |
|                1.50      1.00          20/40         SUB1 <cycle1>   [4]   |
|                0.00      0.50           1/5          SUB2            [9]   |
|                0.00      0.00           0/5          SUB3            [11]  |
+----------------------------------------------------------------------------+
                 Profile entry for EXAMPLE.
                         Figure 4.













ggpprrooff ---- AA CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr           PPSSDD::1188--1111


planar, and even if it is, that we can print a  planar  ver-
sion  of  it.   Instead,  we  choose  to  list each routine,
together with information about the routines  that  are  its
direct parents and children.  This listing presents a window
into the call graph.  Based on our experience,  both  parent
information  and  child information is important, and should
be available without searching through the output.

     The major entries of the call  graph  profile  are  the
entries  from the flat profile, augmented by the time propa-
gated to each routine from its descendants.  This profile is
sorted  by  the  sum of the time for the routine itself plus
the time inherited from its descendants.  The profile  shows
which  of  the higher level routines spend large portions of
the total execution time in the  routines  that  they  call.
For  each routine, we show the amount of time passed by each
child to the routine, which  includes  time  for  the  child
itself  and  for  the descendants of the child (and thus the
descendants of the routine).  We also  show  the  percentage
these  times  represent  of  the total time accounted to the
child.  Similarly, the parents of each routine  are  listed,
along with time, and percentage of total routine time, prop-
agated to each one.

     Cycles are handled as single entities.  The cycle as  a
whole  is  shown  as though it were a single routine, except
that members of the cycle are listed in place of  the  chil-
dren.   Although  the  number  of  calls of each member from
within the cycle are shown, they do not affect time propaga-
tion.   When  a child is a member of a cycle, the time shown
is the appropriate fraction of the time for the whole cycle.
Self-recursive  routines  have  their calls broken down into
calls from the outside and self-recursive calls.   Only  the
outside calls affect the propagation of time.

     The  following  example is a typical fragment of a call
graph.



              CALLER1             CALLER2


                        EXAMPLE
                            |

                SUB1      SUB2      SUB3




The entry in the call graph profile listing for this example
is shown in Figure 4.










PPSSDD::1188--1122           ggpprrooff ---- aa CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr


     The  entry is for routine EXAMPLE, which has the Caller
routines as its parents, and the Sub routines as  its  chil-
dren.   The  reader should keep in mind that all information
is given _w_i_t_h _r_e_s_p_e_c_t _t_o _E_X_A_M_P_L_E.  The index  in  the  first
column shows that EXAMPLE is the second entry in the profile
listing.  The EXAMPLE routine  is  called  ten  times,  four
times  by  CALLER1,  and six times by CALLER2.  Consequently
40% of EXAMPLE's time is propagated to CALLER1, and  60%  of
EXAMPLE's  time  is  propagated  to  CALLER2.   The self and
descendant fields of the parents show the amount of self and
descendant time EXAMPLE propagates to them (but not the time
used by the parents  directly).   Note  that  EXAMPLE  calls
itself  recursively  four  times.  The routine EXAMPLE calls
routine SUB1 twenty times, SUB2 once, and never calls  SUB3.
Since  SUB2 is called a total of five times, 20% of its self
and descendant time is propagated  to  EXAMPLE's  descendant
time  field.   Because SUB1 is a member of _c_y_c_l_e _1, the self
and descendant times and call count fraction are  those  for
the  cycle  as  a whole.  Since cycle 1 is called a total of
forty times (not counting calls among members of the cycle),
it propagates 50% of the cycle's self and descendant time to
EXAMPLE's descendant time field.  Finally each name is  fol-
lowed  by  an  index that shows where on the listing to find
the entry for that routine.

66..  UUssiinngg tthhee PPrrooffiilleess

     The profiler is a useful tool for improving  a  set  of
routines  that  implement an abstraction.  It can be helpful
in identifying poorly coded routines, and in evaluating  the
new  algorithms  and  code  that  replace them.  Taking full
advantage of the profiler requires a careful examination  of
the  call  graph  profile,  and  a thorough knowledge of the
abstractions underlying the program.

     The easiest optimization that can  be  performed  is  a
small  change  to a control construct or data structure that
improves the running time of the program.  An obvious start-
ing point is a routine that is called many times.  For exam-
ple, suppose an output routine is the only parent of a  rou-
tine  that  formats  the  data.   If  this format routine is
expanded inline in the output routine,  the  overhead  of  a
function  call  and  return can be saved for each datum that
needs to be formatted.

     The drawback to  inline  expansion  is  that  the  data
abstractions  in  the program may become less parameterized,
hence less clearly defined.  The profiling will also  become
less  useful since the loss of routines will make its output
more granular.  For example, if the symbol  table  functions
``lookup'', ``insert'', and ``delete'' are all merged into a
single parameterized  routine,  it  will  be  impossible  to
determine the costs of any one of these individual functions










ggpprrooff ---- AA CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr           PPSSDD::1188--1133


from the profile.

     Further potential for  optimization  lies  in  routines
that  implement data abstractions whose total execution time
is long.  For example, a lookup routine might be called only
a few times, but use an inefficient linear search algorithm,
that might be replaced with a binary  search.   Alternately,
the  discovery  that  a  rehashing  function is being called
excessively, can lead to a  different  hash  function  or  a
larger  hash table.  If the data abstraction function cannot
easily be speeded up, it may be advantageous  to  cache  its
results,  and  eliminate  the need to rerun it for identical
inputs.  These and other ideas for program  improvement  are
discussed in [Bentley81].

     This  tool  is best used in an iterative approach: pro-
filing the program, eliminating one bottleneck, then finding
some  other part of the program that begins to dominate exe-
cution time.  For instance, we have used  ggpprrooff  on  itself;
eliminating, rewriting, and inline expanding routines, until
reading data files (hardly a target for optimization!)  rep-
resents the dominating factor in its execution time.

     Certain  types  of  programs are not easily analyzed by
ggpprrooff.  They are typified by programs that exhibit  a  large
degree  of  recursion,  such as recursive descent compilers.
The problem is that most of the major routines  are  grouped
into  a  single  monolithic  cycle.   As in the symbol table
abstraction that is placed in one routine, it is  impossible
to  distinguish  which  members of the cycle are responsible
for the execution time.  Unfortunately  there  are  no  easy
modifications  to  these programs that make them amenable to
analysis.

     A completely different use of the profiler is  to  ana-
lyze  the  control  flow  of  an unfamiliar program.  If you
receive a program from another user that you need to  modify
in  some  small  way,  it is often unclear where the changes
need to be made.  By running the program on an  example  and
then using ggpprrooff, you can get a view of the structure of the
program.

     Consider an example in which you  need  to  change  the
output  format of the program.  For purposes of this example
suppose that the call graph of the  output  portion  of  the
program has the following structure:

















PPSSDD::1188--1144           ggpprrooff ---- aa CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr




                 CALC1   CALC2  CALC3
                    |              |

                FORMAT1        FORMAT2


                        "WRITE"



Initially  you  look through the ggpprrooff output for the system
call ``WRITE''.  The format routine you will need to  change
is  probably  among  the parents of the ``WRITE'' procedure.
The next step is to look at the profile entry  for  each  of
parents  of ``WRITE'', in this example either ``FORMAT1'' or
``FORMAT2'', to determine which one to change.  Each  format
routine  will  have  one  or  more  parents, in this example
``CALC1'', ``CALC2'',  and  ``CALC3''.   By  inspecting  the
source  code  for  each  of these routines you can determine
which format routine generates the output that you  wish  to
modify.  Since the ggpprrooff entry shows all the potential calls
to the format routine you intend to change, you  can  deter-
mine if your modifications will affect output that should be
left  alone.   If  you  desire  to  change  the  output   of
``CALC2'', but not ``CALC3'', then formatting routine ``FOR-
MAT2'' needs to be split into two separate routines, one  of
which implements the new format.  You can then retarget just
the call by ``CALC2'' that needs the new format.  It  should
be  noted  that  the static call information is particularly
useful here since the test case you run  probably  will  not
exercise the entire program.

77..  CCoonncclluussiioonnss

     We  have created a profiler that aids in the evaluation
of modular programs.  For each routine in the  program,  the
profile shows the extent to which that routine helps support
various  abstractions,  and  how  that  routine  uses  other
abstractions.   The  profile accurately assesses the cost of
routines at all levels of the  program  decomposition.   The
profiler  is  easily used, and can be compiled into the pro-
gram without any prior planning by the programmer.  It  adds
only  five  to thirty percent execution overhead to the pro-
gram being profiled, produces  no  additional  output  until
after  the  program  finishes,  and allows the program to be
measured in its actual environment.  Finally,  the  profiler
runs on a time-sharing system using only the normal services
provided by the operating system and compilers.













ggpprrooff ---- AA CCaallll GGrraapphh EExxeeccuuttiioonn PPrrooffiilleerr           PPSSDD::1188--1155


88..  RReeffeerreenncceess

[Bentley81]
     Bentley, J. L., ``Writing Efficient Code'',  Department
     of Computer Science, Carnegie-Mellon University, Pitts-
     burgh, Pennsylvania, CMU-CS-81-116, 1981.

[Graham82]
     Graham, S. L., Henry, R.  R.,  Schulman,  R.  A.,  ``An
     Experiment  in  Table Driven Code Generation'', SIGPLAN
     '82 Symposium on Compiler Construction, June, 1982.

[Joy79]
     Joy, W. N., Graham, S. L., Haley, C. B. ``Berkeley Pas-
     cal  User's  Manual'',  Version  1.1,  Computer Science
     Division University of California, Berkeley, CA.  April
     1979.

[Knuth71]
     Knuth,  D.  E.  ``An  empirical  study  of FORTRAN pro-
     grams'',  Software  -  Practice  and   Experience,   1,
     105-133. 1971

[Satterthwaite72]
     Satterthwaite, E. ``Debugging Tools for High Level Lan-
     guages'',  Software  -  Practice  and  Experience,   2,
     197-217, 1972

[Tarjan72]
     Tarjan,  R.  E.,  ``Depth first search and linear graph
     algorithm,'' _S_I_A_M _J_. _C_o_m_p_u_t_i_n_g 11:2, 146-160, 1972.

[Unix]
     Unix Programmer's Manual, ``pprrooff command'', section  1,
     Bell Laboratories, Murray Hill, NJ. January 1979.
























