unixdev.net


Switch to SpeakEasy.net DSL

The Modular Manual Browser

Home Page
Manual: (OSF1-V5.1-alpha)
Page:
Section:
Apropos / Subsearch:
optional field



prof_intro(1)							prof_intro(1)



NAME

  prof_intro - Introduction to application profilers, profiling, optimiza-
  tion,	and performance	analysis

DESCRIPTION

  Tru64	UNIX supports four approaches to performance improvement:

    +  Automatic and profile-directed optimizations. For example:
	    pixie -update a.out	data/*
	    cc -non_shared -O3 -spike -feedback	a.out *.c

    +  Manual design and code optimizations. For example:
	    hiprof -all	-display program data/*	| more
	    uprofile -heavy program data/* | more

    +  Minimizing system-resource usage. For example:
	    third -display program data/* | more

    +  Verifying significance of test cases. For example:
	    pixie -testcoverage	program	data/* | more



  One approach might be	enough,	but more might be beneficial if	no single
  approach addresses all aspects of a program's	performance. The following
  sections describe each approach and the tools	provided by Tru64 UNIX to
  support them.

AUTOMATIC AND PROFILE-DIRECTED OPTIMIZATIONS

  Techniques


  Automatic and	profile-directed optimizations are the simplest	approaches to
  improving application	performance.

  Some degree of automatic optimization	can be achieved	by using the
  compiler's and linker's optimization options.	These can help in the genera-
  tion of minimal instruction sequences	that make best use of the CPU archi-
  tecture and cache memory.

  However, the compiler	and linker can improve their optimizations if they
  are given information	on which instructions are executed most	often when
  the program is run with its normal input data	and environment. While the
  default optimizations	give improved performance for most common situations,
  the optimizers can do	even better if they can	tune the program in favor of
  the heavily used instruction sequences as determined from a sample run.

  Tru64	UNIX helps you provide the optimizers with this	information on pro-
  cessing hot-spots by allowing	a profiler's results to	be fed back into a
  recompilation. This customized, profile-directed optimization	can be used
  in conjunction with automatic	optimization.






  Tools	and Examples


  The cc compiler command's automatic optimization options are selected	with
  -O, -fast, -inline, -spike, and other	related	options. See cc(1) for
  details and Chapter 10 of the	Programmer's Guide for more information	on
  the many options and tradeoffs available.

  For example, this command selects a high degree of optimization in both the
  compiler and the linker:

       cc -non_shared -O3 -spike *.c

  The pixie profiler provides profile information that the cc command's
  -feedback and	-spike options can use to tune the generated instruction
  sequences to the demands placed on the program by particular sets of input
  data.

  The steps, shown in the following example, consist of	(1) preparing the
  program for profile-directed optimization, (2) creating an instrumented
  version of the program and running it	to collect profiling statistics, and
  (3) feeding that information back to the compiler and	linker to help them
  optimize the executable code:

       rm -f program
       cc -non_shared -feedback	program	-o program -O3 *.c
       pixie -update program
       cc -non_shared -feedback	program	-o program -O3 -spike *.c


  To apply profile-directed optimizations to shared libraries, generate	pro-
  file data with an exerciser program, and store it in the shared library
  prior	to recompiling with that feedback. For example:

       rm -f libexample.so
       cc -feedback libexample.so -o libexample.so -shared -O3 lib*.c
       cc -o exerciser exerciser.c -L. -lexample
       pixie -L. -incobj libexample.so -run exerciser
       prof -pixie -update libexample.so exerciser.Counts
       cc -spike -feedback libexample.so -o libexample.so -shared -O3 lib*.c

MANUAL DESIGN AND CODE OPTIMIZATIONS

  Techniques


  The effectiveness of the automatic optimizations described above is limited
  by the efficiency of the algorithms that the program uses. A program's per-
  formance can be further improved by manually optimizing its algorithms and
  data structures. Such	optimizations may include reducing complexity from
  N-squared to log-N, avoiding copying of data,	and reducing the amount	of
  data used. It	may also extend	to tuning the algorithm	to the architecture
  of the particular machine it will be run on -	for example, processing	large
  arrays in small blocks such that each	block remains in the data cache	for
  all processing, instead of the whole array being read	into the cache for
  each processing phase.

  Tru64	UNIX supports manual optimization with its profiling tools, which
  identify the parts of	the application	that use most CPU resources - CPU
  cycles, cache	misses,	and so on. By evaluating different profiles of a pro-
  gram,	you can	identify which parts of	the program use	most CPU resources
  and your can then redesign or	recode algorithms in those parts to use	less
  resources. The profiles also make this exercise more cost-effective by
  helping you to focus on the most demanding code rather than on the least
  demanding.

  Tools	and Examples


  (a) CPU-Time Profiling with Call-Graph


  A call-graph profile shows how much CPU time is used by each procedure, and
  how much is used by all the other procedures that it calls. This can show
  which	phases or subsystems in	a program spend	most of	the total CPU time,
  which	can help in gaining a general understanding of the program's perfor-
  mance.

  The hiprof profiler instruments the program and records a call graph while
  the instrumented program executes. The hiprof	profiler does not require
  that the program be compiled in any particular way, but the names of local
  (for example,	static)	procedures will	be hidden if the cc command's default
  -g0 option was used, and procedures will be hidden if	they are inlined. For
  example:

       cc -g1 -O2 -o program *.c
       hiprof -all -display program data/* | more

  By default, hiprof uses a low-frequency sampling technique   and estimates
  the cost of procedure	calls. It can profile all the code executed by the
  program, including all selected libraries, though its	call graph excludes
  procedures in	threads-related	system libraries. It can also provide
  detailed profiles at the level of source lines or machine instructions.

  For non-threaded programs, hiprof can	alternatively count the	number of
  machine cycles used or page faults suffered by the program. The cost of
  each procedure call is individually measured,	and the	CPU time or page-
  fault	count reported for the instrumented routines includes that for the
  uninstrumented routines that they call. This can summarize the costs and
  reduce the run-time overhead,	but note that the machine-cycle	counter	wraps
  if no	instrumented procedure is called at least every	few seconds.

  The cc compiler's -pg	option uses the	same sampling technique	as hiprof,
  but the program needs	to be instrumented by compiling	with the -pg option.
  Only the executable is profiled (not shared libraries), and few system
  libraries are	instrumented to	generate a call-graph profile; so, hiprof may
  be preferred.	However, the cc	command's -pg option and gprof are supported
  in a very similar way	on different vendors' UNIX systems, so this may	be an
  advantage. For example:

       cc -g1 -O2 -pg -o program *.c
       ./program data/*
       gprof program gmon.out |	more

  The optional dxprof command provides a graphical display of various call-
  graph	profiles.

  (b) CPU-Time/Event Profiles for Sourcelines/Instructions


  A good performance-improvement strategy may start with a procedure-level
  profile of the whole program (perhaps	with a call graph too, to give the
  big picture),	but it will often progress to detailed profiling of indivi-
  dual source-lines and	instructions.

  The uprofile profiler	uses a sampling	technique to generate a	profile	of
  the CPU-time or events such as cache misses associated with each procedure
  or source-line or instruction. The sampling frequency	depends	on the pro-
  cessor type and the statistic	being sampled, but for CPU-time	it is on the
  order	of a millisecond. The profiler achieves	this without modifying the
  target program at all, by using hardware counters that are built into	the
  Alpha	CPU.  Running the uprofile command with	no arguments yields a list of
  all the kinds	of events that a particular machine can	profile, depending on
  the nature of	its architecture. The default is to profile machine cycles,
  resulting in a CPU-time profile. The following example shows how to display
  a profile of the source-lines	that suffered the top 90% of data cache
  misses on an EV56 Alpha:

       cc -g1 -O2 -o program *.c
       uprofile	-h -q 90cum% dcacheldmisses program data/* | more

  This technique has the advantage of very low run-time	overhead. Also,	the
  detailed information it can provide on the costs of executing	individual
  instructions or source-lines is essential in identifying exactly which
  operation in a procedure is slowing the program down.

  The disadvantages of uprofile	are that only executables can be profiled,
  the results can be skewed unless all processors have the same	cycle speed,
  only one program can be profiled with	the hardware counters at one time,
  threads can not be profiled individually, and	the Alpha EV6 architecture's
  execution of instructions out	of sequence can	significantly reduce the
  accuracy of fine-grained profiles.

  If hiprof's call counting is not too intrusive, its default sampling tech-
  nique	can provide the	same fine-grain	profiles as uprofile (CPU time only),
  it is	accurate even with mixed processor cycle speeds, and it	can profile
  all the shared libraries of a	program	as well	as individual threads. For
  example:

       hiprof -h -all program data/* | more

  The cc compiler's -p option uses the same low-frequency sampling technique
  as hiprof. It	is common to many UNIX systems,	and (on	Tru64 UNIX) it is
  able to profile all the shared libraries used	by a program. The program
  needs	to be relinked with the	-p option, but it does not need	to be recom-
  piled	from source, so	long as	the original compilation used an acceptable
  debug	level, such as the -g1 compiler	option.	For example, to	profile	indi-
  vidual instructions of a program:

       cc -p -o	program	*.o
       setenv PROFFLAGS	'-all -stride 1'
       ./program data/*
       prof -all -asm -quit 5% program mon.out | more

  The pixie tool can also profile source-lines and instructions	(including
  shared libraries), but note that when	it displays counts of "Cycles",	it is
  actually reporting counts of instructions executed, not machine cycles. For
  example:

       cc -g1 -O2 -o program *.c
       pixie -all -lines -quit 20 program data/* | more

  The optional dxprof command provides a graphical display of profiles col-
  lected by either pixie or the	cc command's -p	option.








MINIMIZING SYSTEM RESOURCE USAGE





  Techniques


  The above techniques can improve an application's use	of just	the CPU.
  Further performance improvements can be made by improving the	efficiency
  with which the application uses the other components of the computer sys-
  tem: heap memory, disk files,	network	connections, etc.

  As with CPU profiling, the first phase of a resource usage improvement pro-
  cess is to monitor how much memory, data I/O and disk	space, elapsed time,
  and so on, is	used. Then the throughput of the computer can be increased or
  tuned	in ways	that help the program, or the program's	design can be tuned
  to make better use of	the computer resources that are	available. For exam-
  ple:

    +  Reduce the size of the data files that the program reads	and writes.

    +  Use memory-map files instead of regular I/O.

    +  Allocate	memory incrementally on	demand instead of allocating at
       start-up	the maximum that could be required.

    +  Fix heap	leaks, and do not leave	allocated memory unused.  See the
       System Configuration and	Tuning manual for a broader discussion of
       analyzing and tuning a Tru64 UNIX system.

  Tools	and Examples


  (a) System Monitors


  The Tru64 UNIX base system commands ps u, swapon -s, and vmstat 3 can	show
  the currently	active processes' usage	of system resources such as CPU-time,
  physical and virtual memory, swap space, page	faults,	and so on.

  The optional pview command provides a	graphical display of similar informa-
  tion for the processes that comprise an application.

  The time commands provided by	the Tru64 UNIX system and command shells pro-
  vide an easy way to measure the total	elapsed	and CPU	times for a program
  and it descendants.

  The collect tool is an optional, low overhead, system	performance monitor.

  Many other related commands are described in the System Configuration	and
  Tuning manual.

  (b) Heap Memory Analyzers


  The third command reports heap memory	leaks in a program, by instrumenting
  it with the Third Degree memory-usage	checker, running it, and displaying a
  log of leaks detected	at program exit. For example:

       third -display program data/* | more

  If you are interested	only in	leaks occurring	during the normal operation
  of the program, not during startup or	shutdown, you can specify additional
  places to check for previously unreported leaks. For example,	the pre-
  shutdown leak	report will give this information:

       third -display -after startup -before shutdown program data/* | more


  Third	Degree can also	detect various kinds of	bugs that may be affecting
  the correctness or performance of a program. See the Programmer's Guide for
  further details on debugging and leak-detection.

  The optional dxheap command provides a graphical display of Third Degree's
  heap and bug reports.

  The optional mview command provides a	graphical analysis of heap usage over
  time.	This view of a program's heap can clearly show the presence (if	not
  the cause) of	significant leaks or other undesireable	trends such as wasted
  memory.

VERIFYING SIGNIFICANCE OF TEST CASES

  Techniques


  Most of the above profiling techniques are effective only if you profile
  and optimize or tune the parts of the	program	that are executed in the
  scenarios whose performance is important. Careful selection of the data
  used for the profiled	test-runs is often sufficient, but you may want	a
  quantitative analysis	of which code was and was not executed in a given set
  of tests.

  Tools	and Examples


  The pixie command's -t[estcoverage] option reports lines of code that	were
  not executed in a given test run. For	example:

       pixie -t	program	data/* | more

  Conversely, pixie's -p[rocedure], -h[eavy], and -a[sm] options show which
  procedures, source lines, and	instructions were executed.

  If multiple test runs	are needed to build up a typical scenario, the prof
  command can be run separately	on a set of profile data files:

       pixie -pids program
       ./program.pixie data1/*
       ./program.pixie data2/*
       prof -pixie -t program program.Counts.*

SEE ALSO

  Optimizing:	cc(1), spike(1)

  Profiling:   hiprof(1), pixie(1), third(1), uprofile(1)

  System Monitoring:   collect(8), ps(1), swapon(1), vmstat(1)

  Graphical tools, available from the Graphical	Program	Analysis subset	of
  the Tru64 UNIX Associated Products installation media, or as part of
  Compaq's Enterprise Toolkit for Windows/NT desktops with Microsoft's Visual
  Studio 97: dxheap(1),	dxprof(1), mview(1), pview(1)

  Programmer's Guide

  System Configuration and Tuning