Chapter 5. Performance Analyzer Reference

Chapter 5. Performance Analyzer Reference
Prev		Next

This chapter provides detailed descriptions of the Performance Analyzer toolset, including the following:

Selecting Performance Tasks

You choose performance tasks from the Select Task submenu of the Perf menu in WorkShop Debugger window. You should have an objective in mind before you start an experiment. The tasks ensure that only the appropriate data collection is enabled. Selecting too much data can slow down the experiment and skew the data for collection.

The tasks are summarized in Table 5-1. The Task column identifies the task as it appears in the Select Task menu of the WorkShop Debugger's Perf menu. The Clues column provides an indication of symptoms and situations appropriate for the task. The Data Collected column indicates performance data set by the task. Note that call stacks are collected automatically at sample points, poll points, and process events. The Description column describes the technique used.

Table 5-1. Summary of Performance Analyzer Tasks

Task	Clues	Data Collected	Description
Profiling/PC Sampling	CPU-bound	PC profile counts Fine-grained usage (1sec.) Call stacks	Tracks CPU time spent in functions, source code lines, and instructions. Useful for CPU-bound conditions. CPU time metrics help you separate CPU-bound from non-CPU-bound instructions.
User Time/Call- stack Sampling	Not CPU-bound	Fine-grained usage (1sec.) Call stack profiling (30ms) Call stacks	Tracks the user time spent by function, source code line, and instruction.
Ideal Time/Pixie	CPU-bound	Basic block counts Fine-grained usage (1sec.) Call stacks	Calculates the ideal time, that is, the time spent in each basic block with the assumption of one instruction per machine cycle. Useful for CPU-bound conditions. Ideal time metrics also give counts, total machine instructions, and loads/stores/floating point instructions. It is useful to compare ideal time with the CPU time in an experiment that identifies high CPU time.
Floating Point Exception Trace	High system time in usage charts; presence of floating point operations; NaNs	FPE exception trace Fine-grained usage (1sec.) Call stacks	Useful when you suspect that time is being wasted in floating-point exception handlers. Captures the call stack at each floating-point exception. Lists floating-point exceptions by function, source code line, and instruction.
I/O trace	Process blocking due to I/O	I/O system call trace Fine-grained usage (1sec.) Call stacks	Captures call stacks at every I/O-oriented system call. The file description and number of bytes are available in I/O View.
Memory Leak Trace	Swelling in process size	`malloc`/`free` trace Fine-grained usage (1sec.) Call stacks	Determines memory leaks by capturing the call stack, address, and size at all `malloc`, `realloc`, and `free` routines and displays them in a memory map. Also indicates double `free` routines.
R10k/R12k Hardware Counters...	Need more detailed information	Wide range of hardware-level counts	On R10000 and R12000 systems only, returns low-level information by counting hardware events in special registers. An overflow value is assigned to the relevant counter. The number of overflows is returned.
Custom...		Call stacks User's choice	Lets you select the performance data to be collected. Remember that too much data can skew results.

Specifying a Custom Task

When you choose Custom... from the Select Task submenu in the Perf menu in the Main View, a dialog box appears. This section provides an explanation of most of the windows involved in setting up a custom task.

The Custom...Runtime and HWC Spec (the hardware counters) windows are identical to the Configs...Runtime and HWC Spec windows. For an illustration of Runtime, see Figure 5-1. For information on HWC Spec, see “R10000 and R12000 Hardware Counters” in Chapter 4.

Specifying Data to be Collected

Data is collected and recorded at every sample point. The following data collection methods are available:

Call stack (the CallStack window). See the following section.
Basic block counts (the Ideal window). See “Basic Block Count Sampling”.
PC profile counts (the PC Sampling window). See “PC Profile Counts”.

Call Stack Profiling

The Performance Analyzer performs call stack data collection automatically. There is no instrumentation involved. This corresponds to the SpeedShop usertime experiment.

The CallStack window lets you choose from real time, virtual time, and profiling time and specify the sampling interval.

Real time is also known as wall-clock time and total time. It is the total time a program takes to execute, including the time it takes waiting for a CPU.

Virtual time is also called process virtual time. It is the time spent when a program is actually running, as opposed to when it is swapped out and waiting for a CPU or when the operating system is in control, such as performing I/O for the program.

Profiling time is time the process has actually been running on the CPU, whether in user or system mode. It is the default for the usertime experiment. It is also called CPU time or user time.

For the sampling interval, you can select one of the following intervals:

Standard (every 30 milleseconds)
Fast (every 20 milliseconds)
Custom (enter your own interval)

Note: The experiment may run slowly in programs with very deep call stacks and many DSOs. In such cases, increasing the sampling interval will help.

Basic Block Count Sampling

Basic block counts are translated to ideal CPU time (as shown in the SpeedShop ideal experiment) and are displayed at the function, source line, and machine line levels. The experiment uses the number of cycles for each instruction and other resources present within the type of processor being used for the experiment in calculating ideal CPU time. Actual time usage will be different.

See “Ideal Time/Pixie” in Chapter 4 for more information.

Memory loads and stores are assumed to take constant time, so if the program has a large number of cache misses, the actual execution time will be longer than that calculated by the ideal experiment.

The end result might be better described as ideal user CPU time.

The Ideal window lets you select the counter size, either 16 or 32 bits, and the option to use rld(1) profiling.

The data is gathered by first instrumenting the target executable. This involves dividing the executable into basic blocks consisting of sets of machine instructions that do not contain branches into or out of them. A few instructions are inserted for every basic block to increment a counter every time that basic block is executed. The basic block data is actually generated, and when the instrumented target executable is run, the data is written out to disk whenever a sample trap fires. Instrumenting an executable increases its size by a factor of three and greatly modifies its performance behavior.

Caution: Running the instrumented executable causes it to run more slowly. By instrumenting, you might be changing crucial resources; during analysis, the instrumented executable might appear to be CPU-bound, whereas the original executable is I/O-bound.

PC Profile Counts

Enabling PC profile counts causes the Program Counter (PC) of the target executable to be sampled every 10 milliseconds when it is in the CPU. PC profiling is a lightweight, high-speed operation done with kernel support. Every 10 milliseconds, the kernel stops the process if it is in the CPU, increments a counter for the current value of the PC, and resumes the process. It corresponds to the SpeedShop pcsamp experiment.

PC profile counts are translated to the actual CPU time displayed at the function, source line, and machine line levels. The actual CPU time is calculated by multiplying the PC hit count by 10 milliseconds.

A major discrepancy between actual CPU time and ideal CPU time indicates one or more of the following:

Cache misses in a single process application.
Secondary cache invalidations in a multiprocess application run on a multiprocessor.

Note: This comparison is inaccurate over a single run if you collect both basic block and PC profile counts simultaneously. In this situation, the ideal CPU time will factor out the interference caused by instrumenting; the actual CPU time will not.

A comparison between basic block counts and PC profile counts is shown in Table 5-2.

Table 5-2. Basic Block Counts and PC Profile Counts Compared

Basic Block Counts	PC Profile Counts
Used to compute ideal CPU time	Used to estimate actual CPU time
Data collection by instrumenting	Data collection done with the kernel
Slows program down	Has minimal impact on program speed
Generates an exact count	Approximates counts

Specifying Tracing Data

Tracing data records the time at which an event of the selected type occurred. There are five types of tracing data:

malloc and free Heap Analysis, see “malloc and free Heap Analysis ”.
I/O (read, write) Operations, see “I/O Operations”.
Floating-Point Exceptions, see “Floating-Point Exceptions”.
Message Passing Interface (MPI) Stats Trace, see “MPI Stats Trace”.

Note: These features should be used with care; enabling tracing data adds substantial overhead to the target execution and consumes a great deal of disk space.

`malloc` and `free` Heap Analysis

Tracing malloc and free allows you to study your program's use of dynamic storage and to quickly detect memory leaks (malloc routines without corresponding free routines) and bad free routines (freeing a previously freed pointer). This data can be analyzed in the Malloc Error View, Leak View, Malloc View, and Heap View (see “Analyzing Memory Problems”).

I/O Operations

I/O tracing records every I/O-related system call that is made during the experiment. It traces read(2), write(2), readv(2), writev(2), open(2), close(2), dup(2), pipe(2), and creat(2), along with the call stack at the time, and the number of bytes read or written. This is useful for I/O-bound processes.

Floating-Point Exceptions

Floating-point exception tracing records every instance of a floating-point exception. This includes problems like underflow and NaN (not a number) values. If your program has a substantial number of floating-point exceptions, you may be able to speed it up by correcting the algorithms.

The floating-point exceptions are as follows:

Overflow
Underflow
Divide-by-zero
Inexact result
Invalid operand (for example, infinity)

MPI Stats Trace

MPI tracing lets you track message-passing activity in any process of a multiprocessing job. You can view the results in the Performance Analyzer window with either the MPI Stats View (Graphs) or MPI Stats View (Numerical) selections from the Views menu. For examples, see “MPI Stats View (Graphs)” in Chapter 2 and “MPI Stats View (Numerical)” in Chapter 2.

Unlike other performance tasks, this one cannot be initiated from the Debugger View; use the SpeedShop ssrun(1) command in combination with the mpirun(1) command. First, set the MPI_RLD_HACK_OFF environment variable for safety reasons and then compile the application with the MPI library:

setenv MPI_RLD_HACK_OFF 1
f90 -o comm comm.f -lmpi

Next run the ssrun as part of the mpirun command:

mpirun -np 4 ssrun -mpi comm

For this 4-processor application, five experiment files will be generated: one for each processor (the IDs begins with f) and one for the master process (the ID begins with m).

comm.mpi.f3221936
comm.mpi.f3224241
comm.mpi.f3225085
comm.mpi.f3227246
comm.mpi.m3226551

You can view any of the files with cvperf:

cvperf comm.mpi.f3225085

Specifying Polling Data

The following categories of polling data are available by using caliper points:

Pollpoint Sampling, see “Pollpoint Sampling”.
Call Stack Profiling, see “Call Stack Profiling”.

Entering a positive nonzero value in their fields turns them on and sets the time interval at which they will record data.

Pollpoint Sampling

Setting pollpoint sampling on the Runtime tab window sets caliper points that specify a regular time interval for capturing performance data, including resource usage and any enabled sampling or tracing functions. Since pollpoint sampling occurs frequently, it is best used with call stack data only, rather than other profiling data. Its primary use is to enable you to set boundary points for phases. In subsequent runs, you can set sample points to collect the profiling data at the phase boundaries.

Call Stack Profiling

Enabling call stack profiling in the CallStack tab window causes the call stack of the target executable to be sampled at the specified time interval (a minimum of 10 milliseconds) and saved. The call stack continues to be sampled when the program is not running: that is, while it is internally or externally blocked. Call stack profiling is used in the User Time/Callstack Sampling task to calculate total times.

You can choose the type of time you want to eventually display: real time, virtual time, or profiling time. See the glossary for definitions.

By setting the sampling interval to a lower number, you can sample more often and receive better finer grained results.

Call stack profiling is accomplished by the Performance Analyzer views and not by the kernel. As a result, it is less accurate than PC profiling. Collecting call stack profiling data is far more intrusive than collecting PC profile data.

Caution: Collecting basic block data causes the text of the executable to be modified. Therefore, if call stack profiling data is collected along with basic block counts, the cumulative total time displayed in Usage View (Graphs) is potentially erroneous.

Table 5-3 compares call stack profiling and PC profiling.

Table 5-3. Call Stack Profiling and PC Profiling Compared

PC Profiling	Call Stack Profiling
Done by kernel	Done by Performance Analyzer process
Accurate, nonintrusive	Less accurate, more intrusive
Used to compute CPU time	Used to compute total time

Configuring the Experiment

To specify the experiment configuration, choose Configs... from the Perf menu. See Figure 5-1, for an illustration of the resulting window. While you can access other tabs, the only ones that are active are the Runtime and General tabs.

Figure 5-1. Runtime Configuration Dialog Box

Specifying the Experiment Directory

The Experiment Directory field lets you specify the directory where you want the data to be stored. The Performance Analyzer provides a default directory named test0000 for your first experiment. If you use the default or any other name that ends in four digits, the four digits are used as a counter and will be incremented automatically for each subsequent session. Note that the Performance Analyzer does not remove (or overwrite) experiment directories. You need to remove directories yourself.

Other Options

The following configuration options are available on the Runtime display:

The File Basename: specifies the base name of the experiment file (if blank, it is the name of the executable).
You can specify whether you want the Performance Analyzer to gather performance data for any processes launched by one or more of the following:
- exec()
- fork()
- sproc()
- system()
- Follow fork() to exec() processes
The center column lets you choose the following options:
- Verbose output yields more explanatory information in the Execution View.
- Reuse File Descriptors opens and closes the file descriptors for the output files every time performance data is to be written. If the target program is using chdir(), the _SPEEDSHOP_REUSE_FILE_DESCRIPTORS environment variable is set to the value selected by this configuration option.
- Compress Experiment Data saves disk space.
- Disable Stack Unwind suppresses the stack unwind as is done in the SpeedShop usertime, totaltime, and other call stack-based experiments.
- Disable Signal Handlers disables the normal setting of signal handlers for all fatal and exit signals.
- Attach Debugger lets you debug the running program.
- Generate Callgraph displays which functions called, and were called by, other functions.
CaliperPoint Signal sets the value of the signal sent by the sample button to cause the process to write out a caliper point. The default value is 40.
PollPoint Caliper Interval (seconds) specifies the interval at which pollpoint caliper points are taken.
AutoLaunch Analyzer launches the Performance Analyzer automatically when the experiment finishes.

The Performance Analyzer Main Window

The Performance Analyzer main window is used for analysis after the performance data has been captured. It contains a time line area indicating when events took place over the span of the experiment, a list of functions with their performance data, and a resource usage chart. The following sections describe the window:

The Performance Analyzer main window can be invoked from the Launch Tool submenu in the Debugger Admin menu or from the command line, by typing one of the following:

cvperf [-exp] directory

cvperf speedshop_exp_files

cvperf [-pixie] pixie.counts_files

The arguments to these commands are as follows:

directory: a directory containing data from old WorkShop performance experiments.
speedshop_exp_files: one or more experiment files generated either by the ssrun(1) command or by using the Select Task ... submenu of the Perf menu on the WorkShop Debugger window.
pixie.counts_files: an output file from pixie(1) measuring code execution frequency. The ideal task generates a pixie.counts_file .

Task Field

The Task field identifies the task for the current experiment and is read-only. See “Selecting Performance Tasks”, for a summary of the performance tasks.

Function List Display and Controls

The function list area displays the program's functions with the associated performance metrics. It also provides buttons for displaying function performance data in other views. See Figure 5-2.

Figure 5-2. Typical Function List Area

The main features of the function list are:

Function list display area: shows all functions in the program annotated with their associated performance data. The column headings identify the metrics.

You select the performance data to display from the Preferences... selection in the Config menu. The order of ranking is set by the Sort... selection in the Config menu. The default order of sorting (depending on availability) is:
1. Inclusive time
2. Exclusive time
3. Counts
Search field: lets you look for a function in the list and in any active views.
Hide 0 Functions toggle button: Lets you filter functions with 0 time from the list.
Show Node button: displays the specified node in the Call Graph View.
Source button: displays the Source View window corresponding to the selected function. The Source View window displays performance metrics in the annotation column. Source View can also be displayed by double-clicking a function in the function list or a node or arc (lines between nodes) in the call graph.
Disassembled Source button: displays the Disassembly View window corresponding to the selected function. The Disassembly View is annotated with the performance metrics.

Usage Chart Area

The usage chart area in the Performance Analyzer main window displays the stripchart most relevant to the current task. The upper subwindow displays the legend for the stripchart, and the lower subwindow displays the stripchart itself. This gives you some useful information without having to open the Usage View (Graphs) window. Table 5-4, shows you the data displayed in the usage chart area for each task.

Table 5-4. Task Display in Usage Chart Area

Task	Data in Usage Chart Area
User Time/Callstack Sampling	User versus system time
Profiling/PC Sampling	User versus system time
Ideal Time/Pixie	User versus system time
Floating Point Exception Trace	Floating-point exception event chart
I/O Trace	`read()`, `write()` system calls
Memory Leak Trace	Process Size stripchart
R10000 or R12000 Hardware Counters	Depends on experiment
Custom task	User versus system time, unless one of the tracing tasks from this list has been selected

You can expand either subwindow to show more information by dragging the boxes at the right of the subwindow.

Time Line Area and Controls

The time line shows when each sample event in the experiment occurred. Figure 2-2, shows the time line portion of the Performance Analyzer window with typical results.

The Time Line Calipers

The time line calipers let you define an interval for performance analysis. You can set the calipers in the time line to any two sample event points using the caliper controls or by dragging them. The calipers appear solid for the current interval. If you drag them with the mouse (left or middle button), they appear dashed to give you visual feedback. When you stop dragging a caliper, it appears in outlined form denoting a tentative and as yet unconfirmed selection.

The following steps show how to set the calipers:

Set the left caliper to the sample event at the beginning of the interval.

You can drag the left caliper with the left or middle mouse button or by using the left caliper control buttons in the control area. Note that calipers always snap to sample events. (It does not matter whether you start with the left or right caliper.)
Set the right caliper to the sample event at the end of the interval. This is similar to setting the left caliper.
Confirm the change by clicking the OK button in the control area.

After you confirm the new position, the solid calipers move to the current position of the outlined calipers and change the data in all views to reflect the new interval.

Clicking Cancel or clicking with the right mouse button before the change is confirmed restores the outlined calipers to the solid calipers.

Current Event Selection

If you want to get more information on an event in the time line or in the charts in the Usage View (Graphs), you can click an event with the left button. The Event field displays the following:

Event number
Description of the trap that triggered the event

In addition, the Call Stack View window updates to the appropriate times, stack frames, and event type for the selected event. A black diamond-shaped icon appears in the time line and charts to indicate the selected event. You can also select an event using the event controls below the caliper controls; they work in similar fashion to the caliper controls.

Time Line Scale Menu

The time line scale menu lets you change the number of seconds of the experiment displayed in the time line area. The Full Scale selection displays the entire experiment on the time line. The other selections are time values; for example, if you select 1 min, the length of the time line displayed will span 1 minute.

Admin Menu

The Admin menu has selections common to the other WorkShop tools. The following selections are different in the Performance Analyzer:

Experiment...: lets you change the experiment directory and displays a dialog box (see Figure 5-3).
Save As Text...: records a text file with preference information selected in the view and displays a dialog box. You can use the default file name or replace it with another name in the Selection dialog box that displays. You can specify the number of lines to be saved. The data can be saved as a new file or appended to an existing one.

Figure 5-3. Experiment Window

Config Menu

The main purpose of the Config menu in the Performance Analyzer main window is to let you select the performance metrics for display and for ranking the functions in the function list. However, your selections also apply elsewhere, such as the Call Graph View window.

The selections in the Config menu are as follows:

Preferences...: brings up the Data Display Options window, which lets you select which metrics display and whether they appear as absolute times and counts or percentages. Remember, you can only select the types of metrics that were collected in the experiment. You can also specify how C++ file names (if appropriate) are to display:
- Demangled shows the function and its argument types.
- As Is uses the translator-generated C-style name.
- Function shows the function name only.
- Class Function shows the class and function.
Sort...: brings up the Sort Options window, which lets you establish the order in which the functions appear; this helps you find questionable functions. The default order of sorting (depending on availability) is:
1. Inclusive time or counts
2. Exclusive time or counts
3. Counts

The selections for the Display Data Options window and the Sort Options window are similar. The difference between the inclusive (Incl.) and exclusive (Excl.) metrics is that inclusive data includes data from other functions called by the function, and exclusive data comes only from the function.

The toggle buttons in both the Data Display Options and Sort Options windows are as follows:

Incl. Percentage, Excl. Percentage: percentage of the total time spent inside and outside of the CPU (by a function, source line, or instruction).
Incl. Total Time, Excl. Total Time: Ttime spent inside and outside of the CPU (by a function, source line, or instruction). It is calculated by multiplying the number of times the PC appears in any call stack by the average time interval between call stacks.
Incl. CPU Time, Excl. CPU Time: time spent inside the CPU (by a function, source line, or instruction). It is calculated by multiplying the number of times a PC value appears in the profile by 10 ms.
Incl. Ideal Time, Excl. Ideal Time: theoretical time spent by a function, source line, or instruction under the assumption of one machine cycle per instruction. It is useful to compare ideal time with actual.
Incl. HWC Data, Excl. HWC Data: Number of events measured.
Incl. Cycles, Excl. Cycles : number of machine cycles.
Incl. Instr'ns, Excl. Instr'ns : number of instructions.
Incl. FP operations, Excl. FP operations: number of floating-point operations.
Incl. Load counts, Excl. Load counts: number of load operations.
Incl. Store counts, Excl. Store counts: number of store operations.
Incl. System calls, Excl. System calls: number of system calls.
Incl. Bytes Read, Excl. Bytes Read: number of bytes in a read operation.
Incl. Bytes Written, Excl. Bytes Written: number of bytes in a write operation.
Incl. FP Exceptions, Excl. FP Exceptions: number of floating-point exceptions.
Incl. Page faults, Excl. Page faults: number of page faults.
Incl. bytes leaked, Excl. bytes leaked: Number of bytes leaked as a result of calls to malloc that were not followed by calls to free.
Incl. bytes malloc'd, Excl. bytes malloc'd: number of bytes allocated in malloc operations.
Incl. bytes MPI/Sent, Excl. bytes MPI/Sent: number of bytes of data sent by an MPI routine.
Incl. bytes MPI/Recv, Excl. bytes MPI/Recv: number of bytes of data received by an MPI routine.
Incl. MPI Send-Ops, Excl. MPI Send-Ops: number of times an MPI send routine was executed.
Incl. MPI Recv-Ops, Excl. MPI Recv-Ops: number of times an MPI receive routine was executed.
Incl. MPI Barriers, Excl. MPI Barriers: number of times an MPI_Barrier routine was executed.
Address: address of the function.
Instr'n Coverage: a percentage of instructions (in the line or function) that were executed at least once.
Calls: number of times a function is called.
Pixstats/Cycles-per instr'n: shows how efficient the code is written to avoid stalls or to take advantage of super scalar operation. A cycles per-instruction count of 1.0 means that an instruction is executed every cycle. A count greater than 1.0 means some instructions took more than one cycle. A count less that 1.0 means that sometimes more than one instruction was executed at a given cycle. The R10000 and R12000 processors can potentially execute up to 4 instructions on every cycle.

In the disassembly view, this metric turns into pixstats, which displays basic block boundaries and the cycle counts distribution for each instruction in the basic block.

The following options are available on the Data Display Options window only:

Display Data As:, Times/Counts, Percentages: lets you choose whether you want to display your performance metrics as times and counts (for instance, the time a function required to execute) or as percentages (the percentage of the program's time a function used). The default is Times/Counts.
Hide 0 Functions in Function List and Hide 0 Functions in Graph: lets you filter functions with 0 counts from the list or graph.
Incl. Percentage: show inclusive percentages on the Call Graph View window.
Incl. Total Time: show inclusive total time on the Call Graph View window.
Incl. CPU Time: show inclusive CPU time on the Call Graph View window.
Incl. Ideal Time: show inclusive ideal time on the Call Graph View window.
Incl. HWC Data: show inclusive hardware counter data on the Call Graph View window.
Incl. System calls: show inclusive system calls on the Call Graph View window.
Incl. Bytes Read: show inclusive bytes read on the Call Graph View window.
Incl. Bytes Written: show inclusive bytes written on the Call Graph View window.
Incl. FP Exceptions: show inclusive floating-point exceptions on the Call Graph View window.
Incl. Page faults: show inclusive page faults on the Call Graph View window.
Incl. bytes leaked: show inclusive bytes leaked as a result of malloc operations not followed by matching free operations on the Call Graph View window.
Incl. bytes malloc'd: show inclusive bytes allocated with a malloc operation on the Call Graph View window.
Calls: show the number of calls to that function on the Call Graph View window.

The following option is available on the Sort Options window only:

Alphabetic: sort alphabetically by function name.

Views Menu

The Views menu in the Performance Analyzer provides the following selections for viewing the performance data from an experiment. Each view displays the data for the time interval bracketed by the calipers in the time line.

Usage View (Graphs): displays resource usage charts and event charts. See “Usage View (Graphs)”.
Usage View (Numerical): displays the aggregate values of resources used. See “Usage View (Numerical) Window”.
I/O View: displays I/O events. See “The I/O View Window”.
MPI Stats View (Graphs): displays MPI information in the form of graphs. See “The MPI Stats View (Graphs) Window”.
MPI Stats View (Numerical): displays MPI information in the form of text. See “The MPI Stats View (Numerical) Window”.
Call Graph View: displays a call graph that shows functions and calls and their associated performance metrics. See “The Call Graph View Window”.
Butterfly View: displays the callers and callees of the function. See “Butterfly View”.
Leak View: displays individual leaks and their associated call stacks. See “Using Malloc Error View, Leak View, and Malloc View”.
Malloc View: displays individual malloc routines and their associated call stacks. See “Using Malloc Error View, Leak View, and Malloc View”.
Malloc Error View: displays errors involving memory leaks and bad calls to free, indicating error locations and the total number of errors. See “Using Malloc Error View, Leak View, and Malloc View”.
Heap View: displays a map of heap memory showing malloc, realloc, free, and bad free operations. See “Analyzing the Memory Map with Heap View”.
Call Stack View: displays the call stack for the selected event and the corresponding event type. See “The Call Stack Window”.
Working Set View: measures the coverage of the DSOs that make up the executable, noting which were not used. See “Working Set View” in Chapter 2.

Executable Menu

If you enabled Track Exec'd Processes for the current experiment, the Executable menu will be enabled and will contain selections for any execed processes. (The Track Exec'd Processes selection is in the Performance panel of the Executable menu.) These selections let you see the performance results for the other executables.

Note: The Executable menu is not enabled by an experiment generated by the Select Task submenu in the Perf menu of the WorkShop Debugger window, the ssrun(1) command, or any other method using SpeedShop functionality. It can only be enabled by experiments generated in older versions of WorkShop.

Thread Menu

If your process forked any processes, the Thread menu is activated and contains selections corresponding to the different threads. Selecting a thread displays its performance results.

Note: The Thread menu is not enabled by an experiment generated by the Select Task submenu in the Perf menu of the WorkShop Debugger window, the ssrun(1) command, or any other method using SpeedShop functionality. It can only be enabled by experiments generated in older versions of WorkShop.

Usage View (Graphs)

The Usage View (Graphs) window displays resource usage and event charts containing the performance data from the experiment. These charts show resource usage over time and indicate where sample events took place. Sample events are shown as vertical lines. Figure 5-4, shows the Usage View (Graphs) window.

Figure 5-4. Usage View (Graphs) Window

Charts in the Usage View (Graphs) Window

The available charts in the Usage View (Graphs) Window are as follows:

User versus system time: shows CPU use. Whenever the system clock ticks, the process occupying the CPU is charged for the entire ten millisecond interval. The time is charged either as user or system time, depending on whether the process is executing in user mode or system mode. The graph provides these annotations to show how time is spent during an experiment's process: Running (user mode), Running (system mode), Running (graphics mode), Waiting (for block I/O), Waiting (raw I/O, paging) , Waiting (for memory), Waiting (in select), Waiting in CPU queue, Sleep (for resource), Sleep (for stream monitor), and Stopped (job control).
Page Faults: shows the number of page faults that occur within a process. Major faults are those that require a physical read operation to satisfy; minor faults are those where the necessary page is already in memory but not mapped into the process address space.

Each major fault in a process takes approximately 10 to 50 milliseconds. A high page-fault rate is an indication of a memory-bound situation.
Context Switch: shows the number of voluntary and involuntary context switches in the life of the process.

Voluntary context switches are attributable to an operation caused by the process itself, such as a disk access or waiting for user input. These occur when the process can no longer use the CPU. A high number of voluntary context switches indicates that the process is spending a lot of time waiting for a resource other than the CPU.

Involuntary context switches happen when the system scheduler gives the CPU to another process, even if the target process is able to use it. A high number of involuntary context switches indicates a CPU contention problem.
KBytes Read and KBytes Written: shows the number of bytes transferred between the process and the operating system buffers, network connections, or physical devices. KBytes Read are transferred into the process address space; KBytes Written are transferred out of the process address space. A high byte-transfer rate indicates an I/O-bound process.
read() calls and write() calls: shows the number of readand write system calls made by the process.
poll() calls and ioctl() calls: shows the combined number of poll or select system calls (used in I/O multiplexing) and the number of I/O control system calls made by the process.
System Calls: shows the total number of system calls made by the process. This includes the counts for the calls shown on the other charts.
Signals: shows the total number of signals received by the process.
Total Size and Resident Size: shows the total size of the process in pages and the number of pages resident in memory at the end of the time interval when the data is read. It is different from the other charts in that it shows the absolute size measured at the end of the interval and not an incremental count for that interval.

If you see the process total size increasing over time when your program should be in a steady state, the process most likely has leaks and you should analyze it using Leak View and Malloc View .

Getting Event Information from the Usage View (Graphs) Window

The charts only indicate trends. To get detailed data, click the relevant area on the chart; the data displays at the top of the window. The left mouse button displays event data; the right mouse button displays interval data.

When you click the left mouse button on a sample event in a chart, the following actions take place:

The point becomes selected, as indicated by the diamond marker above it. The marker appears in the time line, resource usage chart, and Usage View (Graphs) charts if the window is open.
The current event line at the top of the window identifies the event and displays its time.
The call stack that corresponds to this sample point is displayed in the Call Stack window (see “The Call Stack Window”).

Clicking a graph with the right mouse button displays the values for the interval if a collection is specified. If a collection is not specified, clicking a graph with the right mouse button displays the interval bracketed by the nearest sample events.

The Process Meter Window

The process meter lets you observe resource usage for a running process without conducting an experiment. To call the process meter, select Process Meter from the Views menu in the WorkShop Debugger window.

A Process Meter window with data and its menus displayed appears in Figure 5-5. The Process Meter window uses the same Admin menu as the WorkShop Debugger tools.

The Charts menu options display the selected stripcharts in the Process Meter window.

The Scale menu adjusts the time scale in the stripchart display area such that the time selected becomes the end value.

You can select which usage charts and event charts display. You can also display sample point information in the Status field by clicking within the charts.

Figure 5-5. The Process Meter Window with Major Menus Displayed

Usage View (Numerical) Window

The Usage View (Numerical) window shows detailed, process-specific resource usage information in a textual format for a specified interval. The interval is defined by the calipers in the time line area of the Performance Analyzer main window. To display the Usage View (Numerical) window, select Usage View (Numerical) from the Views menu.

The top of the window identifies the beginning and ending events for the interval. The middle portion of the window shows resource usage for the target executable. The bottom panel shows resource usage on a system-wide basis. Data is shown both as total values and as per-second rates.

The I/O View Window

The I/O View window helps you determine the problems in an I/O-bound process. It produces graphs of all I/O system calls for up to 10 files involved in I/O. Clicking an I/O event with the left mouse button displays information about it in the event identification field at the top of the I/O View window.

For a list of the system calls traced, see “I/O Trace” in Chapter 4.

The MPI Stats View (Graphs) Window

The MPI Stats View (Graphs) window displays information on as many as 32 aspects of an MPI program in graph format. For an illustration of the window, see Figure 2-6.

If a graph contains nothing but zeros, it is not displayed.

In the following list of information that may be displayed in the graphs, shared memory refers to memory in a multiprocessor system that can be accessed by any processor. The High Performance Parallel Interface ( HIPPI) is a network link, often used to connect computers; it is slower than shared memory transfers but faster than TCP/IP transfers. TCP/IP is a networking protocol that moves data between two systems on the Internet.

Collective calls are those that move a message from one processor to multiple processors or from multiple processors to one processor. MPI_Bcast(3) is a collective call. A point-to-point call, such as MPI_Send(3) or MPI_Ssend(3), moves a message from one processor to one processor.

Note: The MPI tracing experiment does not track down communicators, and it does not trace all collective operations.

The following information can be displayed in the MPI Stats View (Graphs) window.

Retries in allocating MPI headers per procedure for collective calls
Retries in allocating MPI headers per host for collective calls.
Retries in allocating MPI headers per procedure for point-to-point calls
Retries in allocating MPI headers per host for point-to-point calls
Retries in allocating MPI buffers per procedure for collective calls
Retries in allocating MPI buffers per host for collective calls
Retries in allocating MPI buffers per procedure for point-to-point calls
Retries in allocating MPI buffers per host for point-to-point calls
The number of send requests using shared memory for collective calls
The number of send requests using shared memory for point-to-point calls
The number of send requests using a HIPPI bypass for collective calls
The number of send requests using a HIPPI bypass for point-to-point calls
The number of send requests using TCP/IP for collective calls
The number of send requests using TCP/IP for point-to-point calls
The number of data buffers sent using shared memory for point-to-point calls
The number of data buffers sent using shared memory for collective calls
The number of data buffers sent using a HIPPI bypass for point-to-point calls
The number of data buffers sent using a HIPPI bypass for collective calls
The number of data buffers sent using TCP/IP for point-to-point calls
The number of data buffers sent using TCP/IP for collective calls
The number of message headers sent using shared memory for point-to-point calls
The number of message headers sent using shared memory for collective calls
The number of message headers sent using a HIPPI bypass for point-to-point calls
The number of message headers sent using a HIPPI bypass for collective calls
The number of message headers sent using TCP/IP for point-to-point calls
The number of message headers sent using TCP/IP for collective calls
The total number of bytes sent using shared memory for point-to-point calls
The total number of bytes sent using shared memory for collective calls
The total number of bytes sent using a HIPPI bypass for point-to-point calls
The total number of bytes sent using a HIPPI bypass for collective calls
The total number of bytes sent using TCP/IP for point-to-point calls
The total number of bytes sent using TCP/IP for collective calls

The MPI Stats View (Numerical) Window

The MPI Stats View (Numerical) window displays the same information as the MPI Stats View (Graphs) window (see the preceding section), but it presents it in text form.

Unlike the MPI Stats View (Graphs) window, this window includes all of the data, whether or not it is zero.

The Parallel Overhead View Window

The Parallel Overhead View window shows the overhead incurred by a parallel program. MPI, OpenMP, and pthread parallel programming models are supported. The following figure illustrates the overhead for the total.f Fortran program, located in the /usr/demos/WorkShop/mp directory.

Figure 5-6. Overhead View

The following list describes each of the data items for this OpenMP demo. Other programming models generate slightly different data.

Parallelization Overhead: The percentage of the total overhead time spent making the code parallel. In the example, this time is negligible.
Load Imbalance: The percentage of the overhead time caused by load imbalance. Load imbalance means the parallel work is not evenly distributed among the processors, causing some processors to wait while the others finish their tasks.
Insufficient Parallelism: The percentage of the overhead time spent in regions of the code that are not parallel.
Barrier Loss: The percentage of overhead time consumed by the barrier mechanism. This is not the time spent waiting at a barrier.
Synchronization Loss: The percentage of the overhead time consumed by synchronization mechanisms other than barriers.
Other Model-specific Overhead: The percentage of the overhead time due to other operations of the parallel programming model, in this case OpenMP.

Overhead data is collected automatically when you create an experiment file. To see the total picture, aggregate the experiment files from each processor into a single file, as follows:

% ssaggregate -e total.usertime* -o userout

Then view the single output file, userout, through cvperf.

The Call Graph View Window

The Call Graph View window displays the functions as nodes, annotated with performance metrics, and their calls as connecting arcs (see Figure 5-7). Bring up the Call Graph View window by selecting Call Graph View from the Views menu.

Figure 5-7. Call Graph View with Display Controls

Since a call graph can get quite complicated, the Performance Analyzer provides various controls for changing the graph display. The Preferences selection in the Config menu lets you specify which performance metrics display and also lets you filter out unused functions and arcs. There are two node menus in the display area; these let you filter nodes individually or as a selected group. The top row of display controls is common to all MIPSpro WorkShop graph displays. It lets you change scale, alignment, and orientation. The bottom row of controls lets you define the form of the graph. You can view the call graph as a butterfly graph, showing the functions that call and are called by a single function, or as a chain graph between two functions.

Special Node Icons

Although rare, nodes can be annotated with two types of graphic symbols:

A right-pointing arrow in a node indicates an indirect call site. It represents a call through a function pointer. In such a case, the called function cannot be determined by the current methods.
A circle in a node indicates a call to a shared library with a data-space jump table. The node name is the name of the routine called, but the actual target in the shared library cannot be identified. The table might be switched at run time, directing calls to different routines.

Annotating Nodes and Arcs

You can specify which performance metrics appear in the call graph, as described in the following list:

Node Annotations: to specify the performance metrics that display inside a node, use the Preferences dialog box in the Config menu from the Performance Analyzer main view.
Arc Annotations: arc annotations are specified by selecting Preferences... from the Config menu in the Call Graph View window. You can display the counts on the arcs (the lines between the functions). You can also display the percentage of calls to a function broken down by incoming arc. For an explanation of the performance metric items, see “Config Menu”.

Filtering Nodes and Arcs

You can specify which nodes and arcs appear in the call graph as described in the following list:

Call Graph Preferences Filtering Options: the Preferences selection in the Call Graph View Config menu also lets you hide functions and arcs that have 0 calls.
Node Menu: there are two node menus for filtering nodes in the graph: the Node menu and the Selected Nodes menu. Both menus are shown in Figure 5-8.

The Node menu lets you filter a single node. It is displayed by holding down the right mouse button while the cursor is over the node. The name of the selected node appears at the top of the menu.

Figure 5-8. Node Menus

The following list describes the Node menu selections:
- Hide Node: removes the selected node from the call graph display
- Collapse Subgraph: removes the nodes called by the selected node (and subsequently called nodes) from the call graph display
- Show Immediate Children: displays the functions called by the selected node
- Show Parents: displays all the functions that call the selected node
- Show All Children: displays all the functions and the descendants called by the selected node
Selected Nodes Menu: the Selected Nodes menu lets you filter multiple nodes. You can select multiple nodes by dragging a selection rectangle around them. You can also Shift-click a node, and it will be selected along with all the nodes that it calls. Holding down the right mouse button anywhere in the graph, except over a node, displays the Selected Nodes menu. The following list describes the menu selections:
- Hide: removes the selected nodes from the call graph display
- Collapse: removes the nodes called by the selected nodes (and descendant nodes) from the call graph display
- Expand: displays all the functions (descendants) called by the selected nodes

Filtering Nodes through the Display Controls

The lower row of controls in the Call Graph View panel helps you reduce the complexity of a busy call graph.

You can perform these display operations:

Butterfly: Presents the call graph from the perspective of a single node (the target node), showing only those nodes that call it or are called by it. Functions that call it are displayed to the left and functions it calls are on the right. Selecting any node and clicking Butterfly redraws the display with the selected node in the center. The selected node is displayed and highlighted in the function list.
Chain: lets you display all paths between a given source node and target node. The Chain dialog box is shown in Figure 5-9. You designate the source function by selecting it or entering it in the Source Node field and clicking the Make Source button. Similarly, the target function is selected or entered and then established by clicking the Make Target button. If you want to filter out paths that go through nodes and arcs with zero counts, click the toggle. After these selections are made, click OK.

Figure 5-9. Chain Dialog Box
Prune Chains: displays a dialog box that provides two selections for filtering paths from the call graph (see Figure 5-10).

The Prune Chains button is only activated when a chain mode operation has been performed. The dialog box selections are:
- The Hide Paths Through toggle removes from view all paths that go through the specified node. You must have a current node specified. Note that this operation is irreversible; you will not be able to redisplay the hidden paths unless you perform the Chain operation again.
- The Hide Paths Not Through toggle removes from view all paths except the ones that go through the specified node. This operation is irreversible.
Figure 5-10. Prune Chains Dialog Box
Important Children: lets you focus on a function and its descendants and set thresholds to filter the descendants. You can filter the descendants either by percentage of the caller's time or by percentage of the total time. The Threshold key field identifies the type of performance time data used as the threshold. See Figure 5-11.

Figure 5-11. Show Important Children Dialog Box
Important Parents: Lets you focus on the parents of a function, that is, the functions that call it. You can set thresholds to filter only those parents making a significant number of calls, by percentage of the caller's time, or by percentage of the total time. The Threshold key field identifies the type of performance time data used as the threshold.
Clear Graph: removes all nodes and arcs from the call graph.

Other Manipulation of the Call Graph

The Call Graph View window provides facilities for changing the display of the call graph without changing the data content.

Geometric Manipulation through the Control Panel

The controls for changing the display of the call graph are in the upper row of the control panel (see Figure 5-12).

Figure 5-12. Call Graph View Controls for Geometric Manipulation

These controls are:

Zoom menu button: shows the current scale of the graph. If you click this button, a pop-up menu appears displaying other available scales. The scaling range is between 15% and 200% of the normal (100%) size.
Zoom out button: resets the scale of the graph to the next (available) smaller size in the range.
Zoom in button: resets the scale of the graph to the next (available) larger size in the range.
Overview button: invokes an overview pop-up display that shows a scaled down representation of the graph. The nodes appear in the analogous places on the overview pop-up, and a white outline can be used to position the main graph relative to the pop-up. Alternatively, the main graph may be repositioned by using its scroll bars.
Realign button: redraws the graph, restoring the positions of any nodes that were repositioned.
Rotate button: flips the orientation of the graph between horizontal (calling nodes at the left) and vertical (calling nodes at the top).

For more information on the graphical controls, see the ProDev WorkShop: Overview manual.

Using the Mouse in the Call Graph View

You can move an individual node by dragging it using the middle mouse button. This helps reveal obscured arc annotations.

You can select multiple nodes by dragging a selection rectangle around them. Shift-clicking a node selects the node along with all the nodes that it calls.

Selecting Nodes from the Function List

You can select functions from the function list of the Performance Analyzer window to be highlighted in the call graph. Select a node from the list and then click the Show Node button in the Function List window. The node will be highlighted in the graph.

Butterfly View

The Butterfly View shows a selected function, the functions that called it (the Immediate Parents), and the functions it calls (the Immediate Children). For an illustration, see Figure 2-9.

You can change the selected function by clicking on a new one in the function list area of the main Performance Analyzer window.

The Attrib.% column shows the percentage of the sort key (inclusive time, in the illustration) attributed to each caller or callee. The sort key varies according to the view; on an I/O View , for instance, it is by default inclusive bytes read. You can change the criteria for what is displayed in the columns and how the list is ordered by using the Preferences... and Sort... options, both of which are accessed through the Config menu on the main Performance Analyzer menu.

If you want to save the data as text, select Save As PostScript... from the Admin menu.

Analyzing Memory Problems

The Performance Analyzer provides four tools for analyzing memory problems: Malloc Error View, Leak View, Malloc View, and Heap View. Setting up and running a memory analysis experiment is the same for all four tools. After you have conducted the experiment, you can apply any of these tools.

A memory leak occurs when memory that is allocated in the program in not freed later. As a result, the size of the program grows unnecessarily.

Using Malloc Error View, Leak View, and Malloc View

After you have run a memory experiment using the Performance Analyzer, you can analyze the results using Malloc Error View, Leak View, or Malloc View (see Figure 5-13). Malloc View is the most general, showing all memory allocation operations. Malloc Error View shows only those memory operations that caused problems, identifying the cause of the problem and how many times it occurred. Leak View displays each memory leak that occurs in your executable, its size, the number of times the leak occurred at that location during the experiment, and the corresponding call stack (when you select the leak).

Each of these views has three major areas:

Identification area: this indicates which operation has been selected from the list. Malloc View identifies malloc routines, indicating the number of malloc locations and the size of all malloc operations in bytes. Malloc Error View identifies leaks and bad free routines, indicating the number of error locations and how many errors occurred in total. Leak View identifies leaks, indicating the number of leak locations and the total number of bytes leaked.
List area: this is a list of the appropriate types of memory operations according to the type of view. Clicking an item in the list identifies it at the top of the window and displays its call stack at the bottom of the list. The list displays in order of size.
Call stack area: this displays the contents of the call stack when the selected memory operation occurred. Figure 5-14, shows a typical Source View window with leak annotations. (You can change the annotations by using the Preferences... selection in the Performance Analyzer Config menu). Colored boxes draw attention to high counts.

Note: As an alternative to viewing leaks in Leak View, you can save one or more memory operations as a text file. Choose Save As Text... from the Admin menu, select one or more entries, and view them separately in a text file along with their call stacks. Multiple items are selected by clicking the first and then either dragging the cursor over the others or shift-clicking the last in the group to be selected.

Figure 5-13. Malloc View Window with Admin Menu

Figure 5-14. Source View Window with Memory Analysis Annotations

Analyzing the Memory Map with Heap View

The Heap View window lets you analyze data from experiments based on the Memory Leak Trace task. The Heap View window provides a memory map that shows memory problems occurring in the time interval defined by the calipers in the Performance Analyzer window. The map indicates the following memory block conditions:

malloc: reserved memory space
realloc: reallocated space
free: open space
error: bad free space
unused space

In addition to the Heap View memory map, you can analyze memory leak data using these other tools:

If you select a memory problem in the map and bring up the Call Stack window, it will show you where the selected problem took place and the state of the call stack at that time.
The Source View window shows exclusive and inclusive malloc routines and leaks and the number of bytes used by source line.

Heap View Window

A typical Heap View window with its parts labeled appears in Figure 5-15.

Figure 5-15. Heap View Window

The following list describes the major features of a Heap View window:

Map key: appears at the top of the heap map area to identify blocks by color. The actual colors depend on your color scheme.
Heap map: shows heap memory as a continuous, wrapping, horizontal rectangle. The memory addresses begin at the upper left corner and progress from left to right, row by row. The rectangle is broken up into color-coded segments according to memory use status. Clicking a highlighted area in the heap map identifies the type of problem, the memory address where it occurred, its size in the event list area, and the associated call stack in the call stack display area.

Note in Figure 5-15, that there are only a few problems in the memory at the lower addresses and many more at the higher addresses.
Memory event indicators: the events appear color-coded in the scroll bar. Clicking an indicator with the middle button scrolls the display to the selected problem.
Search field: provides two functions:
- If you enter a memory address in the field, the corresponding position will be highlighted in the heap map. If there was a problem at that location, it will be identified in the event list area. If there is no problem, the event list area displays the address at the beginning of the memory block and its size.
- If you hold down the left mouse button and position the cursor in the heap map, the corresponding address will display in the Search field.
Event list area: displays the events occurring in the selected block. If only one event was received at the given address, its address is shown by default. If more than one event is shown, double-clicking an event will display its corresponding call stack.
Call stack area: displays the call stack corresponding to the event highlighted in the event list area.
Malloc Errors button: causes malloc errors and their addresses to display in the event list area. You can then enter the address of the malloc error in the Search field and press the Enter key to see the error's malloc information and its associated call stack.
Zoom in button: an upward-pointing arrow, it redisplays the heap area at twice the current size of the display. If you reach the limit, an error message displays.
Zoom out button: a downward-pointing arrow, it redisplays the heap area at half the current size (to a limit of one pixel per byte). If you reach the limit, an error message displays.

Source View `malloc` Annotations

Like Malloc View, if you double-click a line in the call stack area of the Heap View window, the Source View window displays the portion of code containing the corresponding line. The line is highlighted and indicated by a caret (^), with the number of bytes used by malloc in the annotation column. See Figure 5-14.

Saving Heap View Data as Text

Selecting Save As Text... from the Admin menu in Heap View lets you save the heap information or the event list in a text file. When you first select Save As Text..., a dialog box displays asking you to specify heap information or the event list. After you make your selection, the Save Text dialog box displays (see Figure 5-16). This lets you select the file name in which to save the Heap View data. The default file name is experiment-filename .out. When you click OK, the data for the current caliper setting and the list of unmatched free routines, if any, are appended to the specified file.

Note: The Save As Text... selection in the File menu for the Source View saves the current file. No file name default is provided, and the file that you name will be overwritten.

Figure 5-16. Heap View Save Text Dialog Boxes

The Call Stack Window

The Call Stack window, which is accessed from the Performance Analyzer Views menu, lets you get call stack information for a sample event selected from one of the Performance Analyzer views. See Figure 5-17.

Figure 5-17. Performance Analyzer Call Stack Window

There are three main areas in the Call Stack window:

Event identification area: displays the number of the event, its time stamp, and the time within the experiment. If you have a multiprocessor experiment, the thread will be indicated here.
Call stack area: displays the contents of the call stack when the sample event took place.
Event type area: highlights the type of event and shows the thread in which it was defined. It indicates, in parentheses, whether the sample was taken in all threads or the indicated thread only.

Analyzing Working Sets

If you suspect a problem with frequent page faults or instruction cache misses, conduct a working set analysis to determine if rearranging the order of your functions will improve performance.

The term working set refers to those executable pages, functions, and instructions that are actually brought into memory during a phase or operation of the executable. If more pages are required than can fit in memory at the same time, page thrashing (that is, swapping in and out of pages) may result, slowing down your program. Strategic selection of which pages functions appear on can dramatically improve performance in such cases.

You do this by creating a file containing a list of functions, their sizes, and addresses called a cord mapping file. The functions should be ordered so as to optimize page swapping efficiency. This file is then fed into the cord utility, which rearranges the functions according to the order suggested in the cord mapping file. See the cord(1) man page for more information.

Working set analysis is appropriate for:

Programs that run for a long time
Programs whose operation comes in distinct phases
Distributed shared objects (DSOs) that are shared among several programs

Working Set Analysis Overview

WorkShop provides two tools to help you conduct working set analysis:

Working Set View is part of the Performance Analyzer. It displays the working set of pages for each DSO that you select and indicates the degree to which the pages are used.
The cord analyzer, sscord(1), is separate from the Performance Analyzer and is invoked by typing sscord at the command line. It displays a list of the working sets that make up a cord mapping file, shows their utilization efficiency, and, most importantly, computes an optimized ordering to reduce working sets.

Figure 5-18, presents an overview of the process of conducting working set analysis.

Figure 5-18. Working Set Analysis Process

First, conduct one or more Performance Analyzer experiments using the Ideal Time/Pixie task. Set sample traps at the beginning and end of each operation or phase that represents a distinct task. You can run additional experiments on the same executable to collect data for other situations in which it can be used.

After you have collected the data for the experiments, run the Performance Analyzer and select Working Set View. Save the working set for each phase or operation that you want to improve. Do this by setting the calipers to bracket each phase and select Save Working Set from the Admin menu.

Select Save Cord Map File to save the cord mapping file (for all runs and caliper settings). This need only be done once.

The next step is to create the working set list file, which contains all of the working sets you want to analyze using the cord analyzer. Create the working set list file in a text editor, specifying one line for each working set and in reverse order of priority, that is, the most important comes last.

The working set list and the cord mapping file serve as input to the cord analyzer. The working set list provides the cord analyzer with working sets to be improved. The cord mapping file provides a list of all the functions in the executable. The cord analyzer displays the list of working sets and their utilization efficiency. It lets you do the following:

Construct gray-code cording feedback (the preferred method).
Examine the page layout and the efficiency of each working set with respect to the original ordering of the executable.
Construct union and intersection sets as desired.
View the efficiency of a different ordering.
Construct a new cord mapping file as input to the cord utility.

If you have a new order that you would like to try out, edit your working set list file in the desired order, submit it to the cord analyzer, and save a new cord mapping file for input to cord.

Working Set View

The Working Set View measures the coverage of the dynamic shared objects (DSOs) that make up your executable (see Figure 5-19). It indicates instructions, functions, and pages that were not used when the experiment was run. It shows the coverage results for each DSO in the DSO list area. Clicking a DSO in the list displays its pages with color-coding to indicate the coverage of the page.

Figure 5-19. Working Set View

DSO List Area

The DSO list area displays coverage information for each DSO used by the executable. It has the following columns:

Text or DSO Region Name: identifies the DSO.
Ideal Time: lists the percentage of ideal time for the caliper setting attributed to the DSO.
Counts of: Instrs. : lists the number of instructions contained in the DSO.
Counts of: Funcs. : lists the number of functions contained in the DSO.
Counts of: Pages : lists the number of pages occupied by the DSO.
% Coverage of: Instrs.: lists the percentage obtained by dividing the number of instructions used by the total number of instructions in the DSO.
% Coverage of: Funcs. : lists the percentage obtained by dividing the number of functions used by the total number of functions in the DSO.
% Coverage of: Pages : lists the coverage obtained by dividing the number of pages touched by the total pages in the DSO.
Avg. Covg. of Touched: Pages: lists the coverage obtained by dividing the number of instructions executed by the total number of instructions on those pages touched by the DSO.
Avg. Covg. of Touched: Funcs: lists the average percentage use of instructions within used functions.

The Search field lets you perform incremental searches to find DSOs in the DSO list. (An incremental search goes to the immediately matching target as you enter each character.)

DSO Identification Area

The DSO identification area shows the address, size, and page information for the selected DSO. It also displays the address, number of instructions, and coverage for the page selected in the page display area.

Page Display Area

The page display area at the bottom of the Working Set View window shows all the pages in the DSO and indicates untouched pages, unused functions, executed instructions, unused instructions, and table data (related to rld(1)). It also includes a color legend at the top to indicate how pages are used.

Clicking a page displays its address, number of instructions, and coverage data in the identification area. Clicking a function in the function list of the main Performance Analyzer window highlights (using a solid rectangle) the page on which the function begins. Clicking the left mouse button on a page indicates the first function on the page by highlighting it in the function list area of the Performance Analyzer window. Similarly, clicking the middle button on a page highlights the function at the middle of the page, and clicking the right button highlights the button at the end of the page. For all three button clicks, the page containing the beginning of the function becomes highlighted. Note that left clicks typically highlight the page before the one clicked, since the function containing the first instruction usually starts on the previous page.

Admin Menu

The Admin menu of the Working Set View window has the following menu selections:

Save Working Set: Saves the working set for the selected DSO. You can incorporate this file into a working set list file to be used as input to the Cord Analyzer.
Save Cord Map File: saves all of the functions in the DSOs in a cord mapping file for input to the Cord Analyzer. This file corresponds to the feedback file discussed on the cord(1) man page.
Save Summary Data as Text: saves a text file containing the coverage statistics in the DSO list area.
Save Page Data as Text: saves a text file containing the coverage statistics for each page in the DSO.
Save All Data as Text: saves a text file containing the coverage statistics in the DSO list area and for each page in the selected DSO.
Close: closes the Working Set View window.

Cord Analyzer

The cord analyzer is not actually part of the Performance Analyzer; it is discussed in this part of the manual because it works in conjunction with the Working Set View. The cord analyzer lets you explore the working set behavior of an executable or shared library (DSO). With it you can construct a feedback file for input to the cord(1) utility to generate an executable with improved working set behavior.

Invoke the cord analyzer at the command line using the following syntax:

sscord -fb fb_file -wsl ws_list_file -ws ws_file -v|-V executable

The sscord command accepts the following arguments:

-fb fb_file: specifies a single text file to use as a feedback file for the executable. It should have been generated either from a Performance Analyzer experiment on the executable or DSO, or from the cord analyzer. If no -fb argument is given, the feedback file name will be generated as executable.fb.
-wsl ws_list_file: specifies a single text file name as input; the working set list consists of the working set files whose names appear in the input file. Each file name should be on a single line.
-ws ws_file: specifies a single working set file name.
-v|-V: verbose output. If specified, mismatches between working sets and the executable or DSO are noted.
executable: specifies a single executable file name as input.

The Cord Analyzer window is shown in Figure 5-20, with its major areas and menus labeled.

Working Set Display Area

The working set display area of the Cord Analyzer window shows all of the working sets included in the working set list file. It has the following columns:

Working-set pgs. (util. %): lists the number of pages in the working set and the percentage of page space that is utilized.
cord'd set pgs: specifies the minimum number of pages for this set, that is, the number of pages the working set would occupy if the program or DSO were reordered optimally for that specific working set.
Working-set Name: identifies the path for the working set.

Note that when the function list is displayed, double-clicking a function displays a plus sign (+) in the working set display area to the left of any working sets that contain the function.

Working Set Identification Area

The working set identification area shows the name of the selected working set. It also shows the number of pages in the working set list, in the selected working set, and in the corded working set, and the number of pages used as tables. It also provides the address for the selected page, its size, and its coverage as a percentage.

Figure 5-20. The Cord Analyzer Window

Page Display Area

The page display area at the bottom of the window shows the starting address for the DSO and its pages, and their use in terms of untouched pages, unused functions, executed instructions, unused instructions, and table data related to rld(1)). It includes a color legend at the top to indicate how pages are used.

Function List

The Function List window displays all the functions in the selected working set. It contains the following columns:

Use: count of the working sets containing the function.
Address: starting address for the function.
Insts.: number of instructions in the function.
Function (File): name of the function and the file in which it occurs.

When the Function List window is displayed, clicking a working set in the working set display area displays a plus sign (+) in the function list to the left of any functions that the working set contains. Similarly, double-clicking a function displays a plus sign in the working set display area to the left of any working sets that contain the function.

The Search field lets you do incremental searches for a function in the Function List window.

Admin Menu

The Admin menu contains the standard Admin menu commands in WorkShop views. It has the Save Working Set List command, which is specific to the cord analyzer. It saves a new working set list with whatever changes you made to it in the session.

File Menu

The File menu contains the following selections:

Delete All Working Sets: removes all the working sets from the working set list. It does not delete any files.
Delete Selected Working Set: removes the selected working set from the working set list.
Add Working Set: includes a new working set in the working set list.
Add Working Set List from File: adds the working sets from the specified list to the current working set file.
Construct Gray-code Cording Feedback: generates an ordering to minimize the working sets, placing the highest priority set first. It compacts each set and orders it to minimize the transitions between each set and the one that follows. Gray code is believed to be superior to weighted ordering, but you might want to experiment with them both.
Construct Weighted Cording Feedback: finds as many distinct affinity sets as it can and orders them to minimize the working sets for their operations in a weighted priority order.
Construct Union of Selected Sets: displays a new working set built as a union of working sets. This is the same as an OR of the working sets.
Construct Intersection of Selected Sets: displays a new working set built from the intersection of the specified working sets. This is the same as an AND of the working sets.
Read Feedback File: loads a new cord mapping file into the Cord Analyzer.

Prev	Table of Contents	Next
Chapter 4. Setting Up Performance Analysis Experiments		Glossary