Tau
Sommaire |
Introduction
Tau is a toolkit for the profiling and tracing of parallel programs written in Fortran, C/C++, Java and Python.
Thanks to Tau, you can do a critical analysis of the performance of your application, identifying the bottlenecks and bugs which slow it down and improve your code. Tau tracks the execution of your parallel program and provides information about it like the amount of time spent in each function (or even in a given loop) or the number of times that a function was called. This information is available not only for the objects defined in your program like functions, classes, templates etc. but also for some external libraries such as MPI.
Using Tau on Colosse
Tau is available on Colosse for the Intel and Gnu compilers in conjunction with OpenMPI. To add Tau 2.21.1 to your environment, use one of the two following examples:
[name@server $] module load compilers/gcc/4.4.2 mpi/openmpi/1.4.3_gcc tools/tau/2.21.1_gcc
[name@server $] module load compilers/intel/11.1.059 mpi/openmpi/1.4.3_intel tools/tau/2.21.1_intel
Tau is compiled with support for profiling, tracing (internal Tau library) and for interacting with the following third-party software:
- Programming Database Toolkit (PDT)
- Binary File Descriptor (BFD)
- OpenMPI
- OpenMP and OPARI2
- Performance API (PAPI)
- Dyninst (for the gcc compiler only)
- Dwarf
- VTF (tracing format)
- OTF (tracing format)
These programs are automatically added to your environment. No additional command is therefore required to use them. Similarly, multiple parallel installations of Tau allow you to use the tracing functions, profiling etc. from a single module. Tau automatically chooses the appropriate version for the desired task. If you want more control over the behaviour of Tau or the choice of a particular version of the library, you can use environment variables and certain program options, as described in the Tau manual.
If you need a particular Tau function which isn't available in the current installation on Colosse, don't hesitate to contact us, we will be happy to add it.
Documentation
The Tau documentation is available from project's website. We recommend that you read the user guide, especially the first section which describes the instrumentation process as well as the section Some Common Application Scenarios.
Tutorial
The rest of this page is dedicated to a short tutorial on Tau. We will use a trivial parallel program to illustrate the basic functionality of Tau: the computation of π using a Monte Carlo algorithm. All of the code examples and commands have been tested on Colosse and seek to familiarize you with the use of Tau on this supercomputer.
Introduction
The number π is the constant of proportionality between a circle's diameter and its circumference. It is a transcendental number, that is it is not the solution of any finite degree polynomial over the field of rational numbers. Several approaches exist however to approximate π with more or less precision. We will use one such approach to write a program that calculates in parallel an approximation to π.
Our approach (of Monte Carlo type) uses the ratio between the area of a square and that of an inscribed circle (see the accompanying figure). If we randomly place a large number of points in the interior of this square, the ratio between the number of points in the interior of the circle and total number of points will be equal to the ratio between the area of the circle and that of the square: p / n = πr^{2} / (2r)^{2}, where p is the number of points in the circle and n the total number of points. We can thus derive the relation π = 4p/n.
Our program will randomly place a great number of points in the square (by choosing random numbers between 0.0 and 1.0 for the x and y components of each point). The number of points to test will be specified by a command line argument, for example (for 8,000,000 points):
[name@server $] mpirun -np 8 ./pi-mc 8000000
A supplementary constraint of our program is that it should print to the standard out an estimate of π at regular intervals. Beginning with a naive implementation, we will improve our program thanks to data gathered by Tau until we obtain an optimal performance.
Initial Version (Naive Implementation)
The first version of our program is especially simple. Each process tests a point and transmits the result to the master process, which then prints an estimate of π. These instructions are repeated until the desired number of points have been tested:
#include <math.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include "mpi.h" /* * Generate a random point in the square, then test and return whether it is * inside the circle. */ bool generate_point() { double x; // X coordinate of point inside the square double y; // Y coordinate of point inside the square x = (double) rand() / RAND_MAX; y = (double) rand() / RAND_MAX; return (sqrt(x * x + y * y) <= 1.0); } /* * Estimate Pi over MPI using a Monte Carlo algorithm, printing an estimate * after each iteration * * n: Number of random points to test (the higher, the more precise the * estimate will be) * comm: MPI communicator */ void pi_monte_carlo(int n, MPI_Comm comm) { int rank; // Rank inside the MPI communicator int size; // Size of the MPI communicator int pw; // 0 or 1, was the point inside the circle for this rank and this // iteration? int pc; // Number of points inside the circle, summed for all ranks for // this cycle of iteration int p; // Total number of points inside the circle (pi = 4*p/n) int m; // Number of points tested up to now double pi; // Pi estimate /* Initialize variables and check that the number of points is valid. * We simplify the algorithm by forcing n to be a multiple of the number * of processes. */ MPI_Comm_rank(comm, &rank); MPI_Comm_size(comm, &size); p = 0; if (n % size != 0) { if (rank == 0) { printf("Invalid number of points: %d\n", n); printf("Must be a multiple of: %d\n", size); } return; } // Compute pi for (m = size; m <= n; m += size) { pw = (int) generate_point(); MPI_Reduce(&pw, &pc, 1, MPI_INT, MPI_SUM, 0, comm); // Rank 0 prints the estimate if (rank == 0) { p += pc; pi = (double) 4 * p / m; printf("%d %.20f\n", m, pi); } } } int main(int argc, char* argv[]) { MPI_Init(&argc, &argv); pi_monte_carlo(atoi(argv[1]), MPI_COMM_WORLD); MPI_Finalize(); return 0; }
You can compile and then test this program using the commands:
[name@server $] module load compilers/gcc/4.4.2 mpi/openmpi/1.4.3_gcc [name@server $] mpicc -o pi-mc-1 pi-mc-1.c [name@server $] mpirun -np 8 ./pi-mc-1 8000000 > pi-mc-1.out [name@server $] tail pi-mc-1.out
Dynamic Instrumentation
At this point, it's already possible to use Tau with our program, thanks to dynamic instrumentation, a Tau functionality which permits you to profile an application without having to modify or recompile it. The information you obtain this way is less detailed than using other methods but it does nonetheless allow you to judge certain elements of the performance of a parallel code. Dynamic instrumentation is not only simple but also the sole solution when you don't have access to the source code as is frequently the case with commercial software. Dynamic instrumentation permits you to study such issues as memory consumption, I/O and GPU utilization among others.
The instrumentation is carried out during program execution by the tau_exec utility. Let's execute once again our program, this time by means of tau_exec in order to study the input/output (I/O) done by our program:
[name@server $] module load tools/tau/2.21.1_gcc [name@server $] tau_exec -io -- mpirun -np 8 ./pi-mc-1 8000000 > mpi-mc-1.out
After the program has finished running, the profiling information is stored in a file named profile.0.0.0. This file can be analyzed by the pprof utility, whose output begins with a summary of the functions executed in our program:
[name@server $] pprof > pprof.out
Reading Profile files in profile.* NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 35,200 43,990 1 748909 43990242 .TAU application 14.0 6,145 6,145 484557 0 13 write() 6.0 2,638 2,638 263887 0 10 read() [THROTTLED] 0.0 0.916 0.916 13 0 70 fopen() 0.0 0.667 0.667 54 0 12 writev() 0.0 0.596 0.596 116 0 5 readv() 0.0 0.412 0.463 113 6 4 close() 0.0 0.46 0.46 20 0 23 send() 0.0 0.397 0.397 15 0 26 fclose() 0.0 0.382 0.382 34 0 11 pipe() 0.0 0.286 0.286 35 0 8 open() 0.0 0.161 0.161 16 0 10 recv() 0.0 0.102 0.102 16 0 6 accept() 0.0 0.094 0.094 4 0 24 recvfrom() 0.0 0.092 0.092 8 0 12 socket() 0.0 0.086 0.086 2 0 43 read() 0.0 0.056 0.056 5 0 11 fopen64() 0.0 0.04 0.04 5 0 8 fscanf() 0.0 0.025 0.025 5 0 5 connect() 0.0 0.022 0.022 2 0 11 fprintf() 0.0 0.019 0.019 1 0 19 socketpair() 0.0 0.008 0.008 3 0 3 lseek() 0.0 0.008 0.008 2 0 4 rewind() 0.0 0.006 0.006 2 0 3 bind() --------------------------------------------------------------------------------------- [...]
Even without information on MPI calls or the internal program flow, we can see already that our program has a performance problem. The write() function, used to print the estimate of π, represents 14% of the execution time. Ideally, I/O should be a negligible part of the execution time, which we would like to see entirely dedicated to the calculation of π. One solution could be to print the estimate less frequently.
Compile-Time Instrumentation
Before modifying our code, let's study our first version using compile-time instrumentation. This will allow us to obtain much more detailed information, not only about the internal calls of our code but also concerning the amount of time spent in MPI functions. We can instrument the code at compile-time using the taucc, taucxx and tauf90 utility programs and then profile the code as it executes. In our case:
[name@server $] taucc -o pi-mc-1 pi-mc-1.c [name@server $] mpirun -np 8 ./pi-mc-1 8000000 > pi-mc-1.out [name@server $] pprof > pprof.out
We obtain this time eight profile.*.0.0 files, or one per MPI process. The pprof utility collates these data and presents them to us with an average over all the nodes at the end:
Reading Profile files in profile.* [...] FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 16 31,472 1 1 31472907 .TAU application 99.9 0.0725 31,456 1 3 31456223 main 96.1 4,140 30,241 1 987503 30241926 pi_monte_carlo 82.7 26,027 26,027 875000 0 30 MPI_Reduce() 3.8 1,200 1,200 1 0 1200720 MPI_Init() 0.2 54 54 12500.1 0 4 MPI_Reduce() [THROTTLED] 0.1 19 19 100001 0 0 generate_point [THROTTLED] 0.0 13 13 1 0 13504 MPI_Finalize() 0.0 0.00025 0.00025 1 0 0 MPI_Comm_rank() 0.0 0.00025 0.00025 1 0 0 MPI_Comm_size()
We note that MPI communications (calls to MPI_Reduce) represent practically all of the program's execution time, which is obviously not what we want. We could reduce the amount of inter-process communication by having each process calculate a large number of points (rather than a single one) before sending this information to the master node which then prints out the estimate. This solution would also reduce the number of I/O operations, eliminating at a single stroke both of the problems we've identified so far.
Second Version (Reduced Number of MPI Calls)
The second version of our program uses each process to calculate 10,000 points in the square before transmitting the result to the master process. These instructions are repeated until the desired number of points has been tested:
#include <math.h> #include <stdio.h> #include <stdlib.h> #include "mpi.h" // Number of points to test at a time #define np 10000 /* * Generate points inside the square and test if they are also inside the * circle. Then, return the number of points inside the circle. * * n: Number of points to test */ int generate_points(int n) { int i; int p = 0; double x; double y; for (i = 0; i < n; ++i) { x = (double) rand() / RAND_MAX; y = (double) rand() / RAND_MAX; if (sqrt(x * x + y * y) <= 1.0) { p++; } } return p; } /* * Estimate Pi over MPI using a Monte Carlo algorithm, printing an estimate * at regular intervals * * n: Number of random points to test (the higher, the more precise the * estimate will be) * comm: MPI communicator */ void pi_monte_carlo(long int n, MPI_Comm comm) { int rank; // Rank inside the MPI communicator int size; // Size of the MPI communicator int pw; // Number of points inside the circle for this rank and this // estimate cycle int p; // Total number of points inside the circle (pi = 4*p/n), summed // for all ranks and estimate cycles int pc; // Number of points inside the circle, summed for all ranks for // this cycle of iteration long int m; // Number of points tested up to now double pi; // Pi estimate /* Initialize variables and check that the number of points is valid. * We simplify the algorithm by forcing n to be a multiple of the number * of points per estimate cycle. */ MPI_Comm_rank(comm, &rank); MPI_Comm_size(comm, &size); p = 0; if (n % (size * np) != 0) { if (rank == 0) { printf("Invalid number of points: %d\n", n); printf("Must be a multiple of: %d\n", size * np); } return; } // Compute pi for (m = size * np; m <= n; m += size * np) { pw = generate_points(np); MPI_Reduce(&pw, &pc, 1, MPI_INT, MPI_SUM, 0, comm); // Rank 0 prints the estimate if (rank == 0) { p += pc; pi = (double) 4 * p / m; printf("%d %.20f\n", m, pi); } } } int main(int argc, char* argv[]) { MPI_Init(&argc, &argv); pi_monte_carlo(atol(argv[1]), MPI_COMM_WORLD); MPI_Finalize(); return 0; }
Let's instrument and profile this code:
[name@server $] taucc -o pi-mc-1 pi-mc-1.c [name@server $] mpirun -np 8 ./pi-mc-1 8000000 > pi-mc-1.out [name@server $] pprof > pprof.out
Reading Profile files in profile.* [...] FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 19 1,367 1 1 1367477 .TAU application 98.6 0.0709 1,348 1 3 1348449 main 89.0 1,216 1,216 1 0 1216556 MPI_Init() 9.3 7 127 1 2002 127602 pi_monte_carlo 5.4 73 73 1000 0 74 MPI_Reduce() 3.3 45 45 1000 0 46 generate_points 0.3 4 4 1 0 4220 MPI_Finalize() 0.0 0.000375 0.000375 1 0 0 MPI_Comm_rank() 0.0 0.000375 0.000375 1 0 0 MPI_Comm_size()
We immediately remark that the program execution is much more rapid. Nevertheless, an analysis with Tau shows that the program now spends essentially all of its time in MPI_Init function. How is this possible, given that our MPI environment hasn't changed? The answer is that the program is now sufficiently rapid that the initialization time is no longer negligible in comparison to the time spent calculating π. In order to obtain a precise performance measure, we need to increase the number of points to calculate so that the proportion of time spent initializing MPI becomes negligible:
[name@server $] taucc -o pi-mc-2 pi-mc-2.c [name@server $] mpirun -np 8 ./pi-mc-2 8000000000 > pi-mc-2.out [name@server $] pprof > pprof.out
Reading Profile files in profile.* [...] FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 18 51,823 1 1 51823376 .TAU application 100.0 0.068 51,804 1 3 51804915 main 97.6 501 50,575 1 200002 50575714 pi_monte_carlo 88.4 45,787 45,787 100000 0 458 generate_points 8.3 4,287 4,287 100000 0 43 MPI_Reduce() 2.4 1,225 1,225 1 0 1225538 MPI_Init() 0.0 3 3 1 0 3595 MPI_Finalize() 0.0 0.000375 0.000375 1 0 0 MPI_Comm_rank() 0.0 0 0 1 0 0 MPI_Comm_size()
The time spent initializing MPI is now small enough that we can evaluate the situation. We remark that the proportion of time spent in MPI communications has diminished significantly compared to the first version of the program. Nevertheless, the amount of time is still excessive, given that our problem is trivial. We might be tempted to increase the value of np to calculate more points each time but increasing the value to 100,000 has only a minimal effect.
Third Version (Use of Non-Blocking MPI Functions)
The problem that we observe here is due to the use of blocking MPI calls. When the function MPI_Reduce is called, it has to wait until all of the processes have sent their results until continuing. In turn, the processes wait until their data has been received before continuing. That causes delays when the processes do not finish at the same time, which is inevitable with multi-tasking operating systems using a scheduler.
The solution is to use non-blocking MPI communications. In the third version of our program, each process calculates 10,000 points at a time and then sends the results to the master process. It then continues to work without waiting and periodically verifies that the master process has completed the reception of the results. The master process for its part collates the results of the processes as they are received:
#include <math.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include "mpi.h" // Rank in charge of coordinating all processes #define manager_rank 0 // Tags used in MPI calls to segregate messages relating to results from the // stop signal sent to workers when enough points have been accumulated. #define tag_results 0 #define tag_stop 1 // Number of points to generate per iteration #define np 10000 // Encapsulate an MPI message, including the data buffer and objects required // for non-blocking messages. typedef struct { int data[2]; MPI_Request request; MPI_Status status; } Message; // Test if a previously sent message has completed bool test(Message* message) { int flag; MPI_Test(&message->request, &flag, &message->status); return (bool) flag; } // Cancel a previously sent message void cancel(Message* message) { MPI_Cancel(&message->request); } // Non-blocking send of results from a worker to the manager rank void isend_results(Message* message, int p, int n, MPI_Comm comm) { message->data[0] = p; message->data[1] = n; MPI_Isend(&message->data, 2, MPI_INT, manager_rank, tag_results, comm, &message->request); } // Non-blocking receive of a rank’s results by the manager rank void irecv_results(Message* message, int rank, MPI_Comm comm) { MPI_Irecv(&message->data, 2, MPI_INT, rank, tag_results, comm, &message->request); } // Blocking send of the stop signal to all ranks void send_stop(Message* message, MPI_Comm comm) { int size; int i; MPI_Comm_size(comm, &size); message->data[0] = 0; for (i = 0; i < size; ++i) { MPI_Send(&message->data, 1, MPI_INT, i, tag_stop, comm); } } // Non-blocking receive of the stop signal by a worker void irecv_stop(Message* message, MPI_Comm comm) { MPI_Irecv(&message->data, 1, MPI_INT, manager_rank, tag_stop, comm, &message->request); } /* * Estimate Pi over MPI using a Monte Carlo algorithm, printing an estimate * at regular intervals * * n: Number of random points to test (the higher, the more precise the * estimate will be) * comm: MPI communicator */ void pi_monte_carlo(long int n, MPI_Comm comm) { // Variables used by all ranks bool finished; // Has this rank finished all his job? bool working; // Is this rank currently computing points to // test? bool managing; // Is this rank currently managing the results? int rank; // Rank inside the MPI communicator int pw; // Number of points inside the circle on this // rank since the last message sent to the // manager int nw; // Number of points tested on this rank since // the last message sent to the manager Message results_out_message; // Message for sending results to manager Message stop_in_message; // Message to receive stop signal from manager // Variables used only by the manager rank int p; // Total number of points inside the circle long int m; // Total number of points tested up to now long int mp; // The number of points accumulated for the // last printed estimate int size; // Size of the MPI communicator int i; // Loop counter Message* results_in_message; // Array of messages for receiving results Message stop_out_message; // Message to sent stop signal // Variable and communication initialization finished = false; working = true; MPI_Comm_rank(comm, &rank); managing = (rank == manager_rank); // All ranks prepare to receive the stop signal irecv_stop(&stop_in_message, comm); // All ranks compute a first set of points and send it to the manager isend_results(&results_out_message, generate_points(np), np, comm); pw = 0; nw = 0; if (managing) { p = 0; m = 0; mp = 0; MPI_Comm_size(comm, &size); results_in_message = malloc(size * sizeof(Message)); for (i = 0; i < size; ++i) { // Manager prepares to receive results from all ranks irecv_results(&results_in_message[i], i, comm); } } // Main loop while (! finished) { if (working) { // If the stop signal has been received, we are done computing // points. We cancel pending results message. if (test(&stop_in_message)) { working = false; cancel(&results_out_message); } // We generate another round of points. Then, we check if our // previous results have been sent to the manager. If that is // the case, we begin sending another batch of results. else { pw += generate_points(np); nw += np; if (test(&results_out_message)) { isend_results(&results_out_message, pw, nw, comm); pw = 0; nw = 0; } } } if (managing) { // We send the stop signal to all ranks as soon as we have tested // enough points. We cancel pending results reception messages. if (m >= n) { send_stop(&stop_out_message, comm); for (i = 0; i < size; ++i) { cancel(&results_in_message[i]); } managing = false; } // Collate results from all workers. If we have received // new results, we add them to our current count. We then prepare to // receive more. for (i = 0; i < size; ++i) { if (test(&results_in_message[i])) { p += results_in_message[i].data[0]; m += results_in_message[i].data[1]; irecv_results(&results_in_message[i], i, comm); } } // We print a new pi estimate every time we have received new // results. if (m != mp) { printf("Pi is estimated as %.14f after testing %d points\n", (double) 4 * p / m, m); mp = m; } } if (! (working || managing)) { finished = true; } } } int generate_points(int n) { int i; int p = 0; double x; double y; for (i = 0; i < n; ++i) { x = (double) rand() / RAND_MAX; y = (double) rand() / RAND_MAX; if (sqrt(x * x + y * y) <= 1.0) { p++; } } return p; } int main(int argc, char* argv[]) { MPI_Init(&argc, &argv); pi_monte_carlo(atol(argv[1]), MPI_COMM_WORLD); MPI_Finalize(); return 0; }
If we instrument and profile this new version in the same fashion as the earlier versions we obtain:
Reading Profile files in profile.* [...] FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 17 47,908 1 1 47908900 .TAU application 100.0 0.0666 47,891 1 3 47891230 main 97.4 649 46,678 1 274391 46678917 pi_monte_carlo 95.8 45,887 45,887 100682 0 456 generate_points 2.5 1,208 1,208 1 0 1208418 MPI_Init() 0.2 25 77 100001 100001 1 test [THROTTLED] 0.1 18 57 61202.9 61202.9 1 isend_results 0.1 52 52 100001 0 1 MPI_Test() [THROTTLED] 0.1 38 38 61202.9 0 1 MPI_Isend() 0.0 3 7 12500.1 12500 1 irecv_results [THROTTLED] 0.0 3 3 1 0 3829 MPI_Finalize() 0.0 3 3 12500.1 0 0 MPI_Irecv() [THROTTLED] 0.0 0.011 0.0286 1 1 29 irecv_stop 0.0 0.0155 0.0155 0.875 0 18 MPI_Irecv() 0.0 0.003 0.015 2 2 8 cancel 0.0 0.012 0.012 2 0 6 MPI_Cancel() 0.0 0.000625 0.00725 0.125 1.125 58 send_stop 0.0 0.00662 0.00662 1 0 7 MPI_Send() 0.0 0 0 1 0 0 MPI_Comm_rank() 0.0 0 0 0.25 0 0 MPI_Comm_size()
The exchange of results by MPI now occupies a negligible portion of the execution time, which is what we would expect for a well-optimized embarrassingly parallel program. The time that had previously been lost to the latency involved in synchronizing the processes is now dedicated to calculating π, thanks to the use of non-blocking communications.
Algorithmic Considerations
We have seen that Tau allows you to identify performance problems and helps correct them. However, it's important to remember that the choice of an appropriate algorithm and approach for a given problem is much more crucial than optimizing a given program. For example, even if our program for calculating π is now optimized, our Monte Carlo algorithm is still several orders of magnitude slower than using an approach based on a Taylor series expansion...