Open SpeedShop

De Wiki de Calcul Québec
Aller à : Navigation, rechercher
Cette page est une traduction de la page Open SpeedShop et la traduction est complétée à 100 % et à jour.

Autres langues :anglais 100% • ‎français 100%

Sommaire

Description

Open|SpeedShop is an advanced, open and free profiling tool which allows you to profile codes that are serial, thread-parallel or MPI parallel. It can profile codes that have been compiled with debugging symbols -g as well as those without these symbols, the difference will appear in the level of detail available. It's a modular profiler whose capacities adapt according to the tools available. For example, if hardware counters are accessible via PAPI (Performance Application Programming Interface), then Open|SpeedShop can take advantage of them to inform you about cache misses. Open|SpeedShop has a lightweight graphical interface that can be exported over X11 as well as a command line interface that can automatically gather statistics.

Useful Links

Operating Principle

Profiling with Open|SpeedShop is based on the idea of experimentation. An experiment is a program execution with a certain number of active metrics. The metrics are varied and include for example the amount of time spent in each function, the number of calls of a function, the number of cache misses for each level of cache, MPI calls, reads and writes of the filesystem etc. The available metrics depend on the architecture used for the experiment.

Profiling MPI Applications

MPI applications can be profiled the same way as serial or thread-parallel applications (see below). You simply replace the name of the executable by mpirun binary_file .... Some examples will be shown below. To be able to assemble the results from all the nodes, Open|SpeedShop needs to write the raw data to a shared filesystem. As the default location is the /tmp directory which normally isn't shared, you need to use the environment variable OPENSS_RAWDATA_DIR to specify a location shared by all the nodes. We recommend that you add the line (for the Bash shell)

[name@server $] export OPENSS_RAWDATA_DIR=$HOME


to your job submission script.

Main Experiments

The main experiments allow you to time each function, to count events using hardware counters, to profile MPI function calls and to profile input/output operations on the filesystem.

Time Metric: ossusertime

The experiment ossusertime uses a sampling technique to determine the inclusive and exclusive real time spent in each function call. It also records the call stack, allowing you to determine the most costly paths. To carry out this experiment, use the command ossusertime:

[name@server $] ossusertime executable


When the experiment is executed, it creates a file named executable-usertime.openss. To display the results, execute the command:

[name@server $] openss -f executable-usertime.openss


You will get a window which looks like this one:

Openspeedshop-usertime.png

You have information on the executable name, the number of threads used, as well as a list of the functions, the exclusive and inclusive time and the percentage of processor utilization for each function.

By double clicking on a function, if the application has been compiled with debugging symbols, Open|SpeedShop shows you the source code with indications about the time used by different sections of the code, as you can see in this image:

Openspeedshop-source.png

You can also identify which execution flow paths are the costliest by clicking on the HC icon. You will obtain a window that looks like this:

Openspeedshop-hotcallpaths.png

If the application has been parallelized using threads, for example using OpenMP, you can also get information about the load-balancing among the threads by clicking on the LB icon. You should see a window that looks like this,

Openspeedshop-loadbalance.png

You can thus determine if a significant part of the code is being executed by just one thread. If this is the case, you will find that the maximum, minimum and mean exclusive execution time are all the same and correspond to the same thread number. By looking at the maximum, minimum and mean execution time you can gauge to what extent your computation has been well distributed across the thread pool.

Hardware Event Counters: osshwcsamp

The experiment osshwcsamp uses a statistical sampling technique along with the hardware counters that are available via PAPI in order to inform you of costly events like cache misses. A cache miss happens when the data needed by the processor aren't available in the CPU cache. The processor must therefore get these data from the system memory, a very lengthy process. In general, we can think of each level of cache (modern CPUs have three) as being roughly two times slower than the preceding one. So, the L3 cache is two times slower than the L2 cache, itself two times slower than the L1 cache. The system memory is around ten times slower than the L3 cache while disk access is roughly a million times slower than accessing the system memory.

Typical access time for data storage units
Storage Unit Access Time
L1 Cache A few nanoseconds
L2 Cache A few nanoseconds longer than L1
L3 Cache A few nanoseconds longer than L2
System Memory ~10-100 nanoseconds
Hard Disk ~10-100 milliseconds

As the size of the cache decreases as its speed grows, it is generally uninteresting to try and minimize the number of L1 cache misses, as the L1 cache is scarcely 30 KB divided between instructions and data. The best performance gain comes from minimizing, in this order:

  1. Disk accesses
  2. Memory accesses (L3 cache misses)
  3. L3 cache accesses (L2 cache misses)

The experiment osshwcsamp allows you to profile the second and third points above by providing you with the L2 and L3 cache miss data. To carry out the experiment, use the command:

[name@server $] osshwcsamp <executable> <counters>


The list of available counters can be found with the command

[nom@serveur $] papi_avail | grep Yes


Note that you should execute this command on a compute node (by means of an interactive job or inside a job submission script), because the available counters may not be the same as on the head node.

The following experiment

[name@server $] osshwcsamp executable PAPI_L2_DCM,PAPI_L2_DCH


will record for example misses and successful accesses for the L2 data cache, while the experiment,

[name@server $] osshwcsamp executable PAPI_L3_DCM,PAPI_L3_DCH


will record similar data but for the L3 cache.

When the experiment is complete, Open|SpeedShop will create a file named executable-hwcsamp.openss. If you open this file with the command

[name@server $] openss -f executable-hwcsamp.openss


you will see window like the following,

Openspeedshop-hwc.png

MPI Function Calls: ossmpi and ossmpit

The experiments ossmpi and ossmpit allow you to analyze MPI function calls. To use them, run one of the following commands:

[name@server $] ossmpi "mpirun executable"


or

[name@server $] ossmpit "mpirun executable"


The difference between ossmpi and ossmpit is that ossmpit keeps a trace of the function calls by means of which you can determine that path that led to the MPI functions being called. Note: In order that Open|SpeedShop can collect the data from several nodes, you must define the environment variable OPENSS_RAWDATA_DIR so that it points to a directory on a shared filesystem that all the nodes have access to. The default value of /tmp is normally local to each node and therefore unacceptable. For example, in Bash you could type

[name@server $] export OPENSS_RAWDATA_DIR=$HOME


By opening the output file executable-mpi-openmpi.openss in the graphical interface, the display will look like

Openspeedshop-ossmpi.png

Like with applications parallelizing by threads, you can also get information on the load balancing with the LB icon:

Openspeedshop-ossmpi-loadbalance.png

Reading and Writing the Filesystem: ossio and ossiot

These two experiments analyze the I/O function calls of your program. You can run them with the commands

 
 [name@server $] ossio executable
 [name@server $] ossiot executable
 


or

 
 [name@server $] ossio "mpirun executable"
 [name@server $] ossiot "mpirun executable"
 


in the case of an MPI application. As with the ossmpi and ossmpit experiments, the ossiot experiment adds information about the call stack which allows you to trace the path of I/O calls in the application. Opening the output file in openss will display

Openspeedshop-ossio.png

with (in the case of MPI) information about the load-balancing,

Openspeedshop-ossio-loadbalance.png

Extracting Data from the Command Line

Open|SpeedShop provides a command line interface which sometimes allows you to extract information more easily than with the graphical interface. You can use the command line interface by adding the option -cli.

It's also possible to automate via a script the analysis of batch results by the option -batch. The following script is particularly useful for extracting data from an experiment for each thread:

File : analyse-fils.sh
#! /bin/bash
 
outputdir=$1
file=$2
cache=tmp.$$
 
rm -r $outputdir
mkdir -p $outputdir
 
echo "Getting thread IDs"
openss -batch -f $file > $cache << EOF
list -v threads
EOF
 
for threadid in $(cat $cache | grep "^[0-9]"); do
	echo "Getting data for thread $threadid"
	openss -batch -f $file > $outputdir/$threadid << EOF
expview -t $threadid
EOF
done
 
echo "Getting load balance info"
openss -batch -f $file > $outputdir/loadbalance << EOF
expview -m loadbalance -v functions
EOF
 
rm $cache


To summarize, this script gets the list of thread numbers from the command Open|SpeedShop list -v threads, then extracts the data for each of the threads via the command expview -t thread_number, as well as load-balancing data using the command expview -m loadbalance -v functions. The script takes two arguments as parameters: the directory in which to write the data (which will be created if it doesn't already exist) and the name of file .openss to analyze. For example,

[name@server $] ./analyze.sh results my_application-usertime.openss


The file results/loadbalance will contain the same information as the LB option using the graphical interface, e.g.

File : loadbalance
[openss]: The restored experiment identifier is:  -x 1
 
        Max   Posix ThreadId          Min   Posix ThreadId      Average  Function (defining location)
  Exclusive           of Max    Exclusive           of Min    Exclusive
Time Across                   Time Across                   Time Across
      Posix                         Posix                         Posix
ThreadIds(s)                   ThreadIds(s)                   ThreadIds(s)
 
 223.057138  140169854267312   127.942855       1144858944   145.674997  gomp_iter_static_next (libgomp.so.1.0.0: iter.c,39)
 211.628567       1128073536   153.999997  140169854267312   193.735710  computeRhod<double> (exe_MEevolution_optimized_omp_v2.out: TMasterEquation.hpp,821)
 187.914282  140169854267312   137.485712       1119680832   162.339282  gomp_iter_ull_static_next (libgomp.so.1.0.0: iter_ull.c,40)
 167.371425       1092548928   155.742854       1111288128   161.321425  Tsdh_c_Tzduh<std::complex<double>, double> (exe_MEevolution_optimized_omp_v2.out: home_blas_template.hpp,1741)
 101.857141       1092548928    90.371427       1128073536    94.496427  Tsdg_cc_Tduh<double, std::complex<double> > (exe_MEevolution_optimized_omp_v2.out: home_blas_template.hpp,2337)
  84.057141       1136466240    26.571428  140169854267312    68.949999  do_spin (libgomp.so.1.0.0: wait.h,48)
  71.085713  140169854267312    43.371428       1144858944    58.339285  Tmdg_c_Tzduh<double, double> (exe_MEevolution_optimized_omp_v2.out: home_blas_template.hpp,1458)
  67.028570       1119680832    22.628571  140169854267312    55.074999  cpu_relax (libgomp.so.1.0.0: futex.h,145)
  56.114285       1128073536    54.314285       1100941632    55.117856  rkf45_apply._omp_fn.5 (libgsl.so.0.16.0: rkf45.c,322)
  54.599999       1119680832    42.028571       1128073536    47.082142  rkf45_apply._omp_fn.4 (libgsl.so.0.16.0: rkf45.c,306)
  48.485713       1119680832    38.285714  140169854267312    43.110713  rkf45_apply._omp_fn.3 (libgsl.so.0.16.0: rkf45.c,290)
  44.028571       1119680832    41.114285       1111288128    42.160713  rkf45_apply._omp_fn.6 (libgsl.so.0.16.0: rkf45.c,345)


while the files with a numerical name will contain data specific to each of the threads, for example:

File : 1092548928
[openss]: The restored experiment identifier is:  -x 1
 
 Exclusive    Inclusive       % of  Function (defining location)
  CPU time  CPU time in      Total
        in     seconds.  Exclusive
  seconds.                CPU Time
188.285711   342.885707  14.064367  computeRhod<double> (exe_MEevolution_optimized_omp_v2.out: TMasterEquation.hpp,821)
167.371425   167.971425  12.502134  Tsdh_c_Tzduh<std::complex<double>, double> (exe_MEevolution_optimized_omp_v2.out: home_blas_template.hpp,1741)
140.828569   142.999997  10.519464  gomp_iter_ull_static_next (libgomp.so.1.0.0: iter_ull.c,40)
139.542854   140.971426  10.423425  gomp_iter_static_next (libgomp.so.1.0.0: iter.c,39)
101.857141   102.428569   7.608417  Tsdg_cc_Tduh<double, std::complex<double> > (exe_MEevolution_optimized_omp_v2.out: home_blas_template.hpp,2337)
 69.971427    69.971427   5.226652  Tmdg_c_Tzduh<double, double> (exe_MEevolution_optimized_omp_v2.out: home_blas_template.hpp,1458)
 68.342856   123.514283   5.105003  do_spin (libgomp.so.1.0.0: wait.h,48)
 55.942856    69.799999   4.178760  rkf45_apply._omp_fn.5 (libgsl.so.0.16.0: rkf45.c,322)
 55.171427    55.171427   4.121137  cpu_relax (libgomp.so.1.0.0: futex.h,145)
 52.457142    71.285713   3.918388  rkf45_apply._omp_fn.4 (libgsl.so.0.16.0: rkf45.c,306)
 45.514285    65.799999   3.399778  rkf45_apply._omp_fn.3 (libgsl.so.0.16.0: rkf45.c,290)
 41.457142    68.971427   3.096722  rkf45_apply._omp_fn.6 (libgsl.so.0.16.0: rkf45.c,345)
 37.942856    38.257142   2.834215  Tmdg_cc_Tduh<double, std::complex<double> > (exe_MEevolution_optimized_omp_v2.out: home_blas_template.hpp,2121)
 37.171428    57.857142   2.776592  rkf45_apply._omp_fn.2 (libgsl.so.0.16.0: rkf45.c,276)


An equivalent script can be used for an MPI job, in order to extract the data for each of the MPI processes:

File : analyse-rangs.sh
#! /bin/bash
 
outputdir=$1
file=$2
cache=tmp.$$
 
rm -r $outputdir
mkdir -p $outputdir
 
echo "Getting rank IDs"
openss -batch -f $file > $cache << EOF
list -v ranks
EOF
 
for rankid in $(cat $cache | grep "^[0-9]"); do
	echo "Getting data for rank $rankid"
	openss -batch -f $file > $outputdir/$rankid << EOF
expview -r $rankid
EOF
done
 
echo "Getting load balance info"
openss -batch -f $file > $outputdir/loadbalance << EOF
expview -m loadbalance -v functions
EOF
 
rm $cache


In this case, the data in the file load balance refer to the MPI rank rather than the thread number:

File : exemple_loadbalance
[openss]: The restored experiment identifier is:  -x 1
 
        Max  Rank        Min  Rank     Average  Function (defining location)
  Exclusive    of  Exclusive    of   Exclusive
   I/O call   Max   I/O call   Min    I/O call
    time in          time in           time in
   seconds.         seconds.          seconds.
     Across           Across            Across
  Ranks(ms)        Ranks(ms)         Ranks(ms)
2018.246179     6   0.009193    13  979.827349  __write (libpthread-2.5.so)
  27.855668    14   0.011322     9    4.411268  read (libpthread-2.5.so)
  15.325733     0   0.513742     6    5.176433  open64 (libpthread-2.5.so)
   0.517390     0   0.025608     9    0.157019  close (libpthread-2.5.so)
   0.376123     0   0.007962     7    0.036721  __lseek64 (libpthread-2.5.so)


Server-Specific Information

To use Open|SpeedShop on Colosse you should load the modules apps/openspeedshop/2.1:
[user@colosse $] module load apps/openspeedshop/2.1


The hardware counters available on the compute nodes are the following:

[usager@colosse $] papi_avail | grep Yes
 PAPI_L1_DCM  0x80000000  Yes   No   Level 1 data cache misses
PAPI_L1_ICM  0x80000001  Yes   No   Level 1 instruction cache misses
PAPI_L2_DCM  0x80000002  Yes   Yes  Level 2 data cache misses
PAPI_L2_ICM  0x80000003  Yes   No   Level 2 instruction cache misses
PAPI_L1_TCM  0x80000006  Yes   Yes  Level 1 cache misses
PAPI_L2_TCM  0x80000007  Yes   No   Level 2 cache misses
PAPI_L3_TCM  0x80000008  Yes   No   Level 3 cache misses
PAPI_L3_LDM  0x8000000e  Yes   No   Level 3 load misses
PAPI_TLB_DM  0x80000014  Yes   No   Data translation lookaside buffer misses
PAPI_TLB_IM  0x80000015  Yes   No   Instruction translation lookaside buffer misses
PAPI_TLB_TL  0x80000016  Yes   Yes  Total translation lookaside buffer misses
PAPI_L1_LDM  0x80000017  Yes   No   Level 1 load misses
PAPI_L1_STM  0x80000018  Yes   No   Level 1 store misses
PAPI_L2_LDM  0x80000019  Yes   No   Level 2 load misses
PAPI_L2_STM  0x8000001a  Yes   No   Level 2 store misses
PAPI_BR_UCN  0x8000002a  Yes   No   Unconditional branch instructions
PAPI_BR_CN   0x8000002b  Yes   No   Conditional branch instructions
PAPI_BR_TKN  0x8000002c  Yes   No   Conditional branch instructions taken
PAPI_BR_NTK  0x8000002d  Yes   Yes  Conditional branch instructions not taken
PAPI_BR_MSP  0x8000002e  Yes   No   Conditional branch instructions mispredicted
PAPI_BR_PRC  0x8000002f  Yes   Yes  Conditional branch instructions correctly predicted
PAPI_TOT_IIS 0x80000031  Yes   No   Instructions issued
PAPI_TOT_INS 0x80000032  Yes   No   Instructions completed
PAPI_FP_INS  0x80000034  Yes   No   Floating point instructions
PAPI_LD_INS  0x80000035  Yes   No   Load instructions
PAPI_SR_INS  0x80000036  Yes   No   Store instructions
PAPI_BR_INS  0x80000037  Yes   No   Branch instructions
PAPI_RES_STL 0x80000039  Yes   No   Cycles stalled on any resource
PAPI_TOT_CYC 0x8000003b  Yes   No   Total cycles
PAPI_LST_INS 0x8000003c  Yes   Yes  Load/store instructions completed
PAPI_L1_DCH  0x8000003e  Yes   Yes  Level 1 data cache hits
PAPI_L2_DCH  0x8000003f  Yes   Yes  Level 2 data cache hits
PAPI_L1_DCA  0x80000040  Yes   No   Level 1 data cache accesses
PAPI_L2_DCA  0x80000041  Yes   No   Level 2 data cache accesses
PAPI_L3_DCA  0x80000042  Yes   Yes  Level 3 data cache accesses
PAPI_L1_DCR  0x80000043  Yes   No   Level 1 data cache reads
PAPI_L2_DCR  0x80000044  Yes   No   Level 2 data cache reads
PAPI_L3_DCR  0x80000045  Yes   No   Level 3 data cache reads
PAPI_L1_DCW  0x80000046  Yes   No   Level 1 data cache writes
PAPI_L2_DCW  0x80000047  Yes   No   Level 2 data cache writes
PAPI_L3_DCW  0x80000048  Yes   No   Level 3 data cache writes
PAPI_L1_ICH  0x80000049  Yes   No   Level 1 instruction cache hits
PAPI_L2_ICH  0x8000004a  Yes   No   Level 2 instruction cache hits
PAPI_L1_ICA  0x8000004c  Yes   No   Level 1 instruction cache accesses
PAPI_L2_ICA  0x8000004d  Yes   No   Level 2 instruction cache accesses
PAPI_L3_ICA  0x8000004e  Yes   No   Level 3 instruction cache accesses
PAPI_L1_ICR  0x8000004f  Yes   No   Level 1 instruction cache reads
PAPI_L2_ICR  0x80000050  Yes   No   Level 2 instruction cache reads
PAPI_L3_ICR  0x80000051  Yes   No   Level 3 instruction cache reads
PAPI_L2_TCH  0x80000056  Yes   Yes  Level 2 total cache hits
PAPI_L1_TCA  0x80000058  Yes   Yes  Level 1 total cache accesses
PAPI_L2_TCA  0x80000059  Yes   No   Level 2 total cache accesses
PAPI_L3_TCA  0x8000005a  Yes   No   Level 3 total cache accesses
PAPI_L1_TCR  0x8000005b  Yes   Yes  Level 1 total cache reads
PAPI_L2_TCR  0x8000005c  Yes   Yes  Level 2 total cache reads
PAPI_L3_TCR  0x8000005d  Yes   Yes  Level 3 total cache reads
PAPI_L2_TCW  0x8000005f  Yes   No   Level 2 total cache writes
PAPI_L3_TCW  0x80000060  Yes   No   Level 3 total cache writes
PAPI_FP_OPS  0x80000066  Yes   Yes  Floating point operations
PAPI_SP_OPS  0x80000067  Yes   Yes  Floating point operations; optimized to count scaled single precision vector operations
PAPI_DP_OPS  0x80000068  Yes   Yes  Floating point operations; optimized to count scaled double precision vector operations
PAPI_VEC_SP  0x80000069  Yes   No   Single precision vector/SIMD instructions
PAPI_VEC_DP  0x8000006a  Yes   No   Double precision vector/SIMD instructions
PAPI_REF_CYC 0x8000006b  Yes   No   Reference clock cycles
To use Open|SpeedShop on Mp2 you should load the modules openss/2.0.2 and papi/5.1.1:
[user@ip01-mp2 $] module load openss/2.0.2 papi/5.1.1


The hardware counters available on the compute nodes are the following:

[usager@ip01-mp2 $] papi_avail | grep Yes
 PAPI_L1_DCM  0x80000000  Yes   No   Level 1 data cache misses
PAPI_L1_ICM  0x80000001  Yes   No   Level 1 instruction cache misses
PAPI_L2_DCM  0x80000002  Yes   No   Level 2 data cache misses
PAPI_L2_ICM  0x80000003  Yes   No   Level 2 instruction cache misses
PAPI_L1_TCM  0x80000006  Yes   Yes  Level 1 cache misses
PAPI_L2_TCM  0x80000007  Yes   No   Level 2 cache misses
PAPI_FPU_IDL 0x80000012  Yes   No   Cycles floating point units are idle
PAPI_TLB_DM  0x80000014  Yes   No   Data translation lookaside buffer misses
PAPI_TLB_IM  0x80000015  Yes   No   Instruction translation lookaside buffer misses
PAPI_TLB_TL  0x80000016  Yes   Yes  Total translation lookaside buffer misses
PAPI_STL_ICY 0x80000025  Yes   No   Cycles with no instruction issue
PAPI_HW_INT  0x80000029  Yes   No   Hardware interrupts
PAPI_BR_TKN  0x8000002c  Yes   No   Conditional branch instructions taken
PAPI_BR_MSP  0x8000002e  Yes   No   Conditional branch instructions mispredicted
PAPI_TOT_INS 0x80000032  Yes   No   Instructions completed
PAPI_FP_INS  0x80000034  Yes   No   Floating point instructions
PAPI_BR_INS  0x80000037  Yes   No   Branch instructions
PAPI_VEC_INS 0x80000038  Yes   No   Vector/SIMD instructions (could include integer)
PAPI_RES_STL 0x80000039  Yes   No   Cycles stalled on any resource
PAPI_TOT_CYC 0x8000003b  Yes   No   Total cycles
PAPI_L1_DCH  0x8000003e  Yes   Yes  Level 1 data cache hits
PAPI_L2_DCH  0x8000003f  Yes   Yes  Level 2 data cache hits
PAPI_L1_DCA  0x80000040  Yes   No   Level 1 data cache accesses
PAPI_L2_DCA  0x80000041  Yes   No   Level 2 data cache accesses
PAPI_L1_ICH  0x80000049  Yes   Yes  Level 1 instruction cache hits
PAPI_L2_ICH  0x8000004a  Yes   No   Level 2 instruction cache hits
PAPI_L1_ICA  0x8000004c  Yes   No   Level 1 instruction cache accesses
PAPI_L2_ICA  0x8000004d  Yes   No   Level 2 instruction cache accesses
PAPI_L1_ICR  0x8000004f  Yes   No   Level 1 instruction cache reads
PAPI_L1_TCH  0x80000055  Yes   Yes  Level 1 total cache hits
PAPI_L2_TCH  0x80000056  Yes   Yes  Level 2 total cache hits
PAPI_L1_TCA  0x80000058  Yes   Yes  Level 1 total cache accesses
PAPI_L2_TCA  0x80000059  Yes   No   Level 2 total cache accesses
PAPI_FML_INS 0x80000061  Yes   No   Floating point multiply instructions
PAPI_FAD_INS 0x80000062  Yes   No   Floating point add instructions (Also includes subtract instructions)
PAPI_FDV_INS 0x80000063  Yes   No   Floating point divide instructions (Counts both divide and square root instructions)
PAPI_FSQ_INS 0x80000064  Yes   No   Floating point square root instructions (Counts both divide and square root instructions)
PAPI_FP_OPS  0x80000066  Yes   No   Floating point operations
PAPI_SP_OPS  0x80000067  Yes   No   Floating point operations; optimized to count scaled single precision vector operations
PAPI_DP_OPS  0x80000068  Yes   No   Floating point operations; optimized to count scaled double precision vector operations
To use Open|SpeedShop on Guillimin you should load the modules iomkl/2015b and OpenSpeedShop/2.2:
[user@guillimin $] module load iomkl/2015b OpenSpeedShop/2.2


The hardware counters available on the compute nodes are the following:

[usager@guillimin $] papi_avail | grep Yes
 PAPI_L1_DCM  0x80000000  Yes   No   Level 1 data cache misses
PAPI_L1_ICM  0x80000001  Yes   No   Level 1 instruction cache misses
PAPI_L2_DCM  0x80000002  Yes   Yes  Level 2 data cache misses
PAPI_L2_ICM  0x80000003  Yes   No   Level 2 instruction cache misses
PAPI_L1_TCM  0x80000006  Yes   Yes  Level 1 cache misses
PAPI_L2_TCM  0x80000007  Yes   No   Level 2 cache misses
PAPI_L3_TCM  0x80000008  Yes   No   Level 3 cache misses
PAPI_L3_LDM  0x8000000e  Yes   No   Level 3 load misses (Westmere)
PAPI_TLB_DM  0x80000014  Yes No/Yes Data translation lookaside buffer misses (Westmere/Sandy Bridge)
PAPI_TLB_IM  0x80000015  Yes   No   Instruction translation lookaside buffer misses
PAPI_TLB_TL  0x80000016  Yes   Yes  Total translation lookaside buffer misses
PAPI_L1_LDM  0x80000017  Yes   No   Level 1 load misses
PAPI_L1_STM  0x80000018  Yes   No   Level 1 store misses
PAPI_L2_LDM  0x80000019  Yes   No   Level 2 load misses (Westmere)
PAPI_L2_STM  0x8000001a  Yes   No   Level 2 store misses
PAPI_STL_ICY 0x80000025  Yes   No   Cycles with no instruction issue (Sandy Bridge)
PAPI_BR_UCN  0x8000002a  Yes No/Yes Unconditional branch instructions
PAPI_BR_CN   0x8000002b  Yes   No   Conditional branch instructions
PAPI_BR_TKN  0x8000002c  Yes No/Yes Conditional branch instructions taken
PAPI_BR_NTK  0x8000002d  Yes Yes/No Conditional branch instructions not taken
PAPI_BR_MSP  0x8000002e  Yes   No   Conditional branch instructions mispredicted
PAPI_BR_PRC  0x8000002f  Yes   Yes  Conditional branch instructions correctly predicted
PAPI_TOT_IIS 0x80000031  Yes   No   Instructions issued (Westmere)
PAPI_TOT_INS 0x80000032  Yes   No   Instructions completed
PAPI_FP_INS  0x80000034  Yes No/Yes Floating point instructions
PAPI_LD_INS  0x80000035  Yes   No   Load instructions
PAPI_SR_INS  0x80000036  Yes   No   Store instructions
PAPI_BR_INS  0x80000037  Yes   No   Branch instructions
PAPI_RES_STL 0x80000039  Yes   No   Cycles stalled on any resource (Westmere)
PAPI_TOT_CYC 0x8000003b  Yes   No   Total cycles
PAPI_LST_INS 0x8000003c  Yes   Yes  Load/store instructions completed (Westmere)
PAPI_L2_DCH  0x8000003f  Yes   Yes  Level 2 data cache hits
PAPI_L2_DCA  0x80000041  Yes   No   Level 2 data cache accesses
PAPI_L3_DCA  0x80000042  Yes   Yes  Level 3 data cache accesses
PAPI_L2_DCR  0x80000044  Yes   No   Level 2 data cache reads
PAPI_L3_DCR  0x80000045  Yes   No   Level 3 data cache reads
PAPI_L2_DCW  0x80000047  Yes   No   Level 2 data cache writes
PAPI_L3_DCW  0x80000048  Yes   No   Level 3 data cache writes
PAPI_L1_ICH  0x80000049  Yes   No   Level 1 instruction cache hits (Westmere)
PAPI_L2_ICH  0x8000004a  Yes   No   Level 2 instruction cache hits
PAPI_L1_ICA  0x8000004c  Yes   No   Level 1 instruction cache accesses (Westmere)
PAPI_L2_ICA  0x8000004d  Yes   No   Level 2 instruction cache accesses
PAPI_L3_ICA  0x8000004e  Yes   No   Level 3 instruction cache accesses
PAPI_L1_ICR  0x8000004f  Yes   No   Level 1 instruction cache reads (Westmere)
PAPI_L2_ICR  0x80000050  Yes   No   Level 2 instruction cache reads
PAPI_L3_ICR  0x80000051  Yes   No   Level 3 instruction cache reads
PAPI_L2_TCH  0x80000056  Yes   Yes  Level 2 total cache hits (Westmere)
PAPI_L2_TCA  0x80000059  Yes No/Yes Level 2 total cache accesses
PAPI_L3_TCA  0x8000005a  Yes   No   Level 3 total cache accesses
PAPI_L2_TCR  0x8000005c  Yes   Yes  Level 2 total cache reads
PAPI_L3_TCR  0x8000005d  Yes   Yes  Level 3 total cache reads
PAPI_L2_TCW  0x8000005f  Yes   No   Level 2 total cache writes
PAPI_L3_TCW  0x80000060  Yes   No   Level 3 total cache writes
PAPI_FDV_INS 0x80000063  Yes   No   Floating point divide instructions (Sandy Bridge)
PAPI_FP_OPS  0x80000066  Yes   Yes  Floating point operations
PAPI_SP_OPS  0x80000067  Yes   Yes  Floating point operations; optimized to count scaled single precision vector operations
PAPI_DP_OPS  0x80000068  Yes   Yes  Floating point operations; optimized to count scaled double precision vector operations
PAPI_VEC_SP  0x80000069  Yes No/Yes Single precision vector/SIMD instructions
PAPI_VEC_DP  0x8000006a  Yes No/Yes Double precision vector/SIMD instructions
PAPI_REF_CYC 0x8000006b  Yes   No   Reference clock cycles


Outils personnels
Espaces de noms

Variantes
Actions
Navigation
Ressources de Calcul Québec
Outils
Partager