Open|SpeedShop is an advanced, open and free profiling tool which allows you to profile codes that are serial, thread-parallel or MPI parallel. It can profile codes that have been compiled with debugging symbols -g as well as those without these symbols, the difference will appear in the level of detail available. It's a modular profiler whose capacities adapt according to the tools available. For example, if hardware counters are accessible via PAPI (Performance Application Programming Interface), then Open|SpeedShop can take advantage of them to inform you about cache misses. Open|SpeedShop has a lightweight graphical interface that can be exported over X11 as well as a command line interface that can automatically gather statistics.
Profiling with Open|SpeedShop is based on the idea of experimentation. An experiment is a program execution with a certain number of active metrics. The metrics are varied and include for example the amount of time spent in each function, the number of calls of a function, the number of cache misses for each level of cache, MPI calls, reads and writes of the filesystem etc. The available metrics depend on the architecture used for the experiment.
Profiling MPI Applications
MPI applications can be profiled the same way as serial or thread-parallel applications (see below). You simply replace the name of the executable by mpirun binary_file .... Some examples will be shown below. To be able to assemble the results from all the nodes, Open|SpeedShop needs to write the raw data to a shared filesystem. As the default location is the /tmp directory which normally isn't shared, you need to use the environment variable OPENSS_RAWDATA_DIR to specify a location shared by all the nodes. We recommend that you add the line (for the Bash shell)
[name@server $] export OPENSS_RAWDATA_DIR=$HOME
to your job submission script.
The main experiments allow you to time each function, to count events using hardware counters, to profile MPI function calls and to profile input/output operations on the filesystem.
Time Metric: ossusertime
The experiment ossusertime uses a sampling technique to determine the inclusive and exclusive real time spent in each function call. It also records the call stack, allowing you to determine the most costly paths. To carry out this experiment, use the command ossusertime:
[name@server $] ossusertime executable
When the experiment is executed, it creates a file named executable-usertime.openss. To display the results, execute the command:
[name@server $] openss -f executable-usertime.openss
You will get a window which looks like this one:
You have information on the executable name, the number of threads used, as well as a list of the functions, the exclusive and inclusive time and the percentage of processor utilization for each function.
By double clicking on a function, if the application has been compiled with debugging symbols, Open|SpeedShop shows you the source code with indications about the time used by different sections of the code, as you can see in this image:
You can also identify which execution flow paths are the costliest by clicking on the HC icon. You will obtain a window that looks like this:
If the application has been parallelized using threads, for example using OpenMP, you can also get information about the load-balancing among the threads by clicking on the LB icon. You should see a window that looks like this,
You can thus determine if a significant part of the code is being executed by just one thread. If this is the case, you will find that the maximum, minimum and mean exclusive execution time are all the same and correspond to the same thread number. By looking at the maximum, minimum and mean execution time you can gauge to what extent your computation has been well distributed across the thread pool.
Hardware Event Counters: osshwcsamp
The experiment osshwcsamp uses a statistical sampling technique along with the hardware counters that are available via PAPI in order to inform you of costly events like cache misses. A cache miss happens when the data needed by the processor aren't available in the CPU cache. The processor must therefore get these data from the system memory, a very lengthy process. In general, we can think of each level of cache (modern CPUs have three) as being roughly two times slower than the preceding one. So, the L3 cache is two times slower than the L2 cache, itself two times slower than the L1 cache. The system memory is around ten times slower than the L3 cache while disk access is roughly a million times slower than accessing the system memory.
|Storage Unit||Access Time|
|L1 Cache||A few nanoseconds|
|L2 Cache||A few nanoseconds longer than L1|
|L3 Cache||A few nanoseconds longer than L2|
|System Memory||~10-100 nanoseconds|
|Hard Disk||~10-100 milliseconds|
As the size of the cache decreases as its speed grows, it is generally uninteresting to try and minimize the number of L1 cache misses, as the L1 cache is scarcely 30 KB divided between instructions and data. The best performance gain comes from minimizing, in this order:
- Disk accesses
- Memory accesses (L3 cache misses)
- L3 cache accesses (L2 cache misses)
The experiment osshwcsamp allows you to profile the second and third points above by providing you with the L2 and L3 cache miss data. To carry out the experiment, use the command:
[name@server $] osshwcsamp <executable> <counters>
The list of available counters can be found with the command
[nom@serveur $] papi_avail | grep Yes
Note that you should execute this command on a compute node (by means of an interactive job or inside a job submission script), because the available counters may not be the same as on the head node.
The following experiment
[name@server $] osshwcsamp executable PAPI_L2_DCM,PAPI_L2_DCH
will record for example misses and successful accesses for the L2 data cache, while the experiment,
[name@server $] osshwcsamp executable PAPI_L3_DCM,PAPI_L3_DCH
will record similar data but for the L3 cache.
When the experiment is complete, Open|SpeedShop will create a file named executable-hwcsamp.openss. If you open this file with the command
[name@server $] openss -f executable-hwcsamp.openss
you will see window like the following,
MPI Function Calls: ossmpi and ossmpit
The experiments ossmpi and ossmpit allow you to analyze MPI function calls. To use them, run one of the following commands:
[name@server $] ossmpi "mpirun executable"
[name@server $] ossmpit "mpirun executable"
The difference between ossmpi and ossmpit is that ossmpit keeps a trace of the function calls by means of which you can determine that path that led to the MPI functions being called. Note: In order that Open|SpeedShop can collect the data from several nodes, you must define the environment variable OPENSS_RAWDATA_DIR so that it points to a directory on a shared filesystem that all the nodes have access to. The default value of /tmp is normally local to each node and therefore unacceptable. For example, in Bash you could type
[name@server $] export OPENSS_RAWDATA_DIR=$HOME
By opening the output file executable-mpi-openmpi.openss in the graphical interface, the display will look like
Like with applications parallelizing by threads, you can also get information on the load balancing with the LB icon:
Reading and Writing the Filesystem: ossio and ossiot
These two experiments analyze the I/O function calls of your program. You can run them with the commands
[name@server $] ossio executable [name@server $] ossiot executable
[name@server $] ossio "mpirun executable" [name@server $] ossiot "mpirun executable"
in the case of an MPI application. As with the ossmpi and ossmpit experiments, the ossiot experiment adds information about the call stack which allows you to trace the path of I/O calls in the application. Opening the output file in openss will display
with (in the case of MPI) information about the load-balancing,
Extracting Data from the Command Line
Open|SpeedShop provides a command line interface which sometimes allows you to extract information more easily than with the graphical interface. You can use the command line interface by adding the option -cli.
It's also possible to automate via a script the analysis of batch results by the option -batch. The following script is particularly useful for extracting data from an experiment for each thread:
To summarize, this script gets the list of thread numbers from the command Open|SpeedShop list -v threads, then extracts the data for each of the threads via the command expview -t thread_number, as well as load-balancing data using the command expview -m loadbalance -v functions. The script takes two arguments as parameters: the directory in which to write the data (which will be created if it doesn't already exist) and the name of file .openss to analyze. For example,
[name@server $] ./analyze.sh results my_application-usertime.openss
The file results/loadbalance will contain the same information as the LB option using the graphical interface, e.g.
while the files with a numerical name will contain data specific to each of the threads, for example:
An equivalent script can be used for an MPI job, in order to extract the data for each of the MPI processes:
In this case, the data in the file load balance refer to the MPI rank rather than the thread number: