Chapter 4. Setting Up Performance Analysis Experiments

Chapter 4. Setting Up Performance Analysis Experiments
Prev		Next

In performance analysis, you set up the experiment, run the executable file, and analyze the results. To make the setup easier, the Performance Analyzer provides predefined tasks that help you establish an objective and ensure that the appropriate performance data will be collected.

This chapter tells you how to conduct performance tasks and what to notice. The following topics are discussed in this chapter:

It covers these topics:

“Experiment Setup”, describes the steps in a standard experiment.
“Selecting a Performance Task”, explains how to select a task to enable data collection.
“Setting Sample Traps”, explains how to define sample traps to record data when specific conditions occur.
“Displaying Data from the Parallel Analyzer”, describes how to use other tools to display data that has been parallelized.

Experiment Setup

Performance tuning typically consists of examining machine resource usage, breaking down the process into phases, identifying the resource bottleneck within each phase, and correcting the cause. Generally, you run the first experiment to break your program down into phases and run subsequent experiments to examine each phase individually. After you have solved a problem in a phase, you then reexamine machine resource usage to see if there is a further opportunity for performance improvement.

Each experiment has these steps:

Specify the performance task (see “Selecting a Performance Task” for complete details).

The Performance Analyzer provides predefined tasks for conducting experiments. When you select a task, the Performance Analyzer automatically enables the appropriate performance data items for collection. See “Understanding Predefined Tasks”, for more details about the predefined tasks.

The predefined tasks ensure that only the appropriate data collection is enabled. Selecting too much data can bog down the experiment and skew the data for collection. If you need a mix of performance data not available in the predefined tasks, you can select Custom from the Select Task submenu. It lets you enable combinations of the data collection options.
Specify where to capture the data.

If you want to gather information for the complete program, this step is not needed. If you want data at specific points in the process, you need to set sample traps (see “Setting Sample Traps”, for a description of traps or see the ProDev WorkShop: Debugger User's Guide for an in-depth discussion).

The Performance Analyzer records samples at the beginning and end of the process automatically. If you want to analyze data within phases, set sample traps at the beginning of each phase and at intermediate points.
Specify the experiment configuration parameters.

This step is not necessary if you use the defaults; if you want to make configuration changes, select Configs from the Perf menu. The dialog box lets you specify a number of configuration options, many of which depend on the experiment you plan to run. The dialog box in Figure 5-1, shows the runtime configuration choices, and the options are described in “Configuring the Experiment” in Chapter 5.
Run the program to collect the data.

You run the experiment from the WorkShop Debugger window. If you are running a small experiment to capture resource usage, you may be able to watch the experiment in real time in the Process Meter. The results are stored in the designated experiment subdirectory.
Analyze the results.

After the experiment completes, you can look at the results in the Performance Analyzer window and its associated views. Use the calipers to get information for phases separately.

Selecting a Performance Task

To set up a Performance Analyzer experiment, choose a task from the Select Task submenu in the Perf menu in the WorkShop Debugger window. The Performance Analyzer will then automatically enable data collection of the pertinent performance data items.

Selecting a task enables data collection. The mode indicator in the upper right corner of the WorkShop Debugger window changes from Debug Only to Performance.

Understanding Predefined Tasks

If you are unfamiliar with performance analysis, it is very easy to request more data collection than you actually need. Doing this can slow down the Performance Analyzer and skew results. To help you record the data that is appropriate to your current objective, WorkShop provides predefined combinations of tasks, which are available in the Select Task submenu in the Perf menu. These tasks are described in the following sections. When you select a task, the required data collection is automatically enabled.

Profiling/PC Sampling

Use the Profiling/PC Sampling task selection when you are identifying which parts of your program are using the most CPU time. PC profiling results in a statistical histogram of the program counter. The exclusive CPU time is presented as follows:

By function in the function list
By source line in Source View
By instruction in Disassembly View
By machine resource usage data, captured at 1-second intervals and at sample points

This task gathers data by sampling the program counter (PC) value every 10 milleseconds (ms).

User Time/Callstack Sampling

Use the User Time/Callstack Sampling task selection to tune a CPU-bound phase or program. It enables you to display the time spent in the CPU by function, source line, and instruction. This task records the following:

The call stack every 30 milleseconds (ms)
Machine resource usage data at 1-second intervals and at sample points

Data is measured by periodically sampling the call stack. The program's call stack data is used to do the following:

Attribute exclusive user time to the function at the bottom of each call stack (that is, the function being executed at the time of the sample).
Attribute inclusive user time to all the functions above the one currently being executed.

The time spent in a procedure is determined by multiplying the number of times an instruction for that procedure appears in the stack by the average time interval between call stacks. Call stacks are gathered whether the program was running or blocked; hence, the time computed represents the total time, both within and outside the CPU. If the target process was blocked for a long time as a result of an instruction, that instruction will show up as having a high time.

User time runs should incur a program execution slowdown of no more than 15%. Data from a usertime experiment is statistical in nature and shows some variance from run to run.

Ideal Time/Pixie

Use the Ideal Time/Pixie task selection to tune a CPU-bound phase. The name ideal time is a historical name first used when processors would execute instructions in a more linear manner than the current modern processors do. Ideal time experiments would represent the best possible performance for your program.

The analysis determines the cost on a per-basic block basis; it does not deal with data dependencies between basic blocks. A basic blocks is a set of instructions with a single entry point, a single exit point, and no branches into or out of the instructions. It is useful when used in conjunction with the Profiling/PC Sampling task. Comparing the two lets you examine actual versus ideal time. The difference is the time spent as a result of the following:

Performing load operations, which take a minimum of two cycles if the data is available in primary cache and much longer if the data has to be accessed from the swap area, secondary cache, or main memory.
Performing store operations, which cause the CPU to stall if the write buffer in the CPU gets filled.
Waiting for a CPU stalled as a result of data dependencies between basis blocks.

This task records the following:

Basic block counts
Counts of branches taken
Machine resource usage data at 1-second intervals and at sample points
Function pointer traces with counts

The following results can be displayed in the function list, the Source View, and the Disassembly View:

Execution counts.
Resulting machine instructions.
A count of resulting loads, stores, and floating-point instructions.
An approximation of the time spent with the CPU stalling because of data and functional unit interlocks. Interlocks are situations caused when resources, such as data, are not available.

The task requires instrumentation of the target executable. Counter code is inserted at the beginning of each basic block.

After the instrumented executable runs, the Performance Analyzer multiplies the number of times a basic block was executed by the number of instructions in it. This yields the total number of instructions executed as a result of that basic block (and similarly for other specific kinds of instructions, like loads or stores).

While ideal time creates a complete call graph and points out where the program spends the most time if the processor was linear in execution of instructions, it is best to run a PC sampling experiment also and compare the results.

The following is a typical case of why ideal time is not enough to learn about application performance. Because ideal only knows about instructions executed and the number of times a cycle has been executed, the loop

DO i=1,size
   DO j=1,size
      DO k=1,size
         u(i,j) = u(i,j) + v(i,k)*w(k,j)
      END DO
   END DO
END DO

would be the same for ideal as

    DO j=1,size
DO i=1,size
         u(i,j) = u(i,j) + v(i,k)*w(k,j)
      END DO
   END DO
END DO

and the same as

       DO k=1,size
   DO j=1,size
DO i=1,size
         u(i,j) = u(i,j) + v(i,k)*w(k,j)
      END DO
   END DO
END DO

But if you run it, just by using time or timex, you will see big differences.

Remember that ideal knows nothing about software pipelining, memory, page faults, caches, etc. only plain instructions executed one after another. It just counts the number of times each instruction was executed.

So, accessing u(i,j) or u(j,i) is the same for ideal, but it can make a big difference for the application runtime performance.

Floating-Point Exception Trace

Use the Floating Point Exception Trace task selection when you suspect that large, unaccounted for periods of time are being spent in floating-point exception handlers. The task records the call stack at each floating-point exception. The number of floating-point exceptions is presented as follows:

By function in the function list
By source line in the Source View
By instruction in Disassembly View

To observe the pattern of floating-point exceptions over time, look in the floating-point exceptions event chart in the Usage View (Graphical) window.

I/O Trace

Use the I/O Trace task selection when your program is being slowed down by I/O calls, and you want to find the responsible code. This task records call stacks at every read(2), write(2), readv(2), writev(2), open(2), close(2), pipe(2), dup(2), and creat(2) system call. It also records file descriptor information and the number of bytes read or written.

The number of bytes read and written is presented as follows:

By function in the function list
By source line in the Source View
By instruction in the Disassembly View

Memory Leak Trace

Use the Memory Leak Trace task selection to determine where memory leaks and bad calls to free may occur in a process. The task records the call stacks, address, and number of bytes at every malloc, realloc, and free call. The bytes currently allocated by malloc (that might represent leaks) and the list of double calls to free are presented in Malloc Error View and the other memory analysis views. The number of bytes allocated by malloc is presented:

By function in the function list
By source line in the Source View
By instruction in the Disassembly View

R10000 and R12000 Hardware Counters

If you are running your application on a system using either the R10000 or the R12000 series CPU, you can use the R10k/R12k Hardware Counters task selection from the WorkShop Debugger window once you have focused in on the source of your problem. This task gives low-level, detailed information about hardware events. It counts the following events:

Graduated instructions. The graduated instruction counter is incremented by the number of instructions that were graduated on the previous cycle.
Machine cycles. The counter is incremented on each clock cycle.
Primary instruction cache misses. This counter is incremented one cycle after an instruction fetch request is entered into the miss handling table.
Secondary instruction cache misses. This counter is incremented after the last 16-byte block of a 64-byte primary instruction cache line is written into the instruction cache.
Primary data cache misses. This counter is incremented on the cycle after a primary cache data refill is begun.
Secondary data cache misses. This counter is incremented on the cycle after the second 16-byte block of a primary data cache line is written into the data cache.
TLB (task lookaside buffer) misses. This counter is incremented on the cycle after the TLB mishandler is invoked.
Graduated floating-point instructions. This counter is incremented by the number of floating-point instructions that graduated on the previous cycle.
Failed store conditionals.

You can also choose hardware counter profiling based on either PC sampling or call stack sampling.

You can generate other hardware counter experiments by using the ssrun command. See the ssrun(1) man page or the SpeedShop User's Guide for more information.

Custom

Use the Custom task selection when you need to collect a combination of performance data that is not available through the predefined tasks. Selecting Custom brings up the same tab panel screen displayed by the Configs... selection (see Figure 5-1).

The Custom task lets you select and tune the following:

Sampling data. This includes profiling intervals, counter size, and whether rld(1) will be involved in data collection.
Tracing data. This includes malloc and free trace, I/O system call trace, and floating-point exception trace.
Recording intervals. This includes the frequency of data recording for usage data or usage or call stack data at caliper points. You can also specify this with marching orders. For more information on marching orders, see the ssrun(1) man page.
Call stack. This includes sampling intervals and the type of timing.
Ideal experiments. This specifies whether or not the basic block count data is collected. It also builds a complete call graph. See “Ideal Time/Pixie” for more information.
Hardware counter specification. This specifies the hardware event you want to count, the counter overflow value, and the profiling style (PC or call stack). Hardware counter experiments are possible only on R10000 and R12000 systems.
Runtime. This specifies the same as those listed for the Configs menu selection. See “Configuring the Experiment” in Chapter 5.

Remember the basic warnings in this chapter about collecting data:

Too much data can slow down the experiment.
Call stack profiling is not compatible with count operations or PC profiling.
If you combine count operations with PC profiling, the results will be skewed due to the amount of instrumented code that will be profiled.

Setting Sample Traps

Sample traps allow you to record data when a specified condition occurs. You set traps from the WorkShop Debugger window: choose either the Trap Manager or the Source View from the Views menu. For a complete discussion of setting traps, see ProDev WorkShop: Debugger User's Guide.

Note: In order for trap-based caliper points to work, you must activate the Attach Debugger toggle on the Runtime tab window. That window is available from the Configs... menu item on the Perf menu of the WorkShop Debugger window.

You can define sample traps:

At function entry or exit points
At source lines
For events
Conditionally
Manually during an experiment

Sample traps at function entry and exit points are preferable to source line traps, because they are more likely to be preserved as your program evolves. This better enables you to save a set of traps in the Trap Manager in a file for subsequent reuse.

Manual sample traps are triggered when you click the Sample button in the WorkShop Debugger . They are particularly useful for applications with graphical user interfaces. If you have a suspect operation in an experiment, a good technique is to take a manual sample before and after you perform the operation. You can then examine the data for that operation.

Displaying Data from the Parallel Analyzer

The Performance Analyzer can also display data that has been parallelized for execution on a multiprocessor system. It supports Fortran 77, Fortran 90, C, and C++ with either of the following parallelizing models:

The automatic parallelization performed by the compilers. This is enabled by including the -apo option on the compiler command line. For more information on automatic parallelization, see the programmer's guide for your compiling system.
OpenMP, a set of compiler pragmas or directives, library routines, and environment variables that help you distribute loop iterations and data among multiple processors. For details about OpenMP, see the programmer's guide for your compiling system.

ProDev WorkShop ProMP is a companion product to the WorkShop suite of tools. It specifically analyzes a program that has been parallelized. It is integrated with WorkShop to let you examine a program's loops in conjunction with a performance experiment on either a single processor or multiprocessor run. For more information, see the ProDev WorkShop: ProMP User's Guide.

The cvpav(1) command reads and displays analysis files generated by the MIPSpro compilers. When you plan to view one of these files in the Performance Analyzer, use the -e option to cvpav, and specify the program executable as the argument, as follows:

% cvpav -e a.out

From the Parallel Analyzer user interface, choose the Admin -> Launch Tool -> Performance Analyzer menu item. Once the new window comes up, choose Excl. Percentage from the Sort... window under the Config menu. Doing so will list the loops in order, with the most expensive at the top, allowing you to concentrate your attention on the most compute-intensive loops.

Prev	Table of Contents	Next
Chapter 3. Performance Analyzer Tutorial		Chapter 5. Performance Analyzer Reference