Open MPI

De Wiki de Calcul Québec
Aller à : Navigation, rechercher
Cette page est une traduction de la page Open MPI et la traduction est complétée à 100 % et à jour.

Autres langues :anglais 100% • ‎français 100%


General description

Open MPI is a library implementing the MPI/en standard and allows you to write programs that do distributed calculations.

Pros and cons

Open MPI is very modular, and easy to use and install. Its modular architecture allows you to fine-tune your computation.

Wrapper scripts for compilers

Like most MPI distributions, Open MPI offers wrapper scripts for compilers. These are scripts that call other compilers (GCC for example) with all options necessary to compile MPI code. This compiler's name depends on the language that is used. You should call mpicc for C, mpiCC, mpicxx or mpic++ for C++, and mpif77 or mpif90 for Fortran 77 or 90.

The -showme option displays the compiler that is used to compile the library with the options that were used to compile it.

[name@server $] mpicxx -showme
 g++ -I/usr/include/openmpi-x86_64 -pthread -m64 -L/usr/lib64/openmpi/lib -lmpi_cxx -lmpi


You can run a program compiled with Open MPI by starting it via the mpiexec command. For example:

[name@server $] mpiexec ./my_application application options

mpiexec accepts a large number of options. Basic arguments control the total number of processes, the number of processes per resource (e.g., node), process distribution and process pinning, etc. These tunings are particularly important if you would like to optimize your processes' memory affinity.

Number of processes

The total number of MPI processes is controlled by the -c,-np,--np,-n,--n options, which are all synonymous for Open MPI. For example, if you would like to run 16 processes, you use

[name@server $] mpiexec -n 16 ./my_application

On most Calcul Québec supercomputers (exception: Mp2), you do not need to specify this parameter if you are executing the same number of processes as are reserved by your job using PBS.

Furthermore, you can give the number of processes used per resource (e.g., node) using the following options

Parameter Description
-npernode, --npernode Number of processes per compute node
-npersocket, --npersocket Number of processes per socket

It is rarely necessary to set those numbers. It could be the case if you run a hybrid job MPI/en/OpenMP/en. In this case, if the node has 8 cores, you should set the environment variable OMP_NUM_THREADS to 8, and set --npernode 1, as follows:

 [name@server $] export OMP_NUM_THREADS=8
 [name@server $] mpiexec --npernode 1 ./my_application

Note that it is very well possible that your application is faster if you use more MPI ranks and fewer OpenMP threads. You could then use the following instead:

 [name@server $] export OMP_NUM_THREADS=4
 [name@server $] mpiexec --npernode 2 ./my_application

or also

 [name@server $] export OMP_NUM_THREADS=2
 [name@server $] mpiexec --npernode 4 ./my_application

Which one of these is best depends a lot on the application you use, the algorithms applied and the implementation. The only way to be sure is to test your application with different parameters.

Process distribution

You can also specify the way processes are distributed. They can be distributed by core, by socket, or by node. OpenMPI distributes processes sequentially on all available cores. These parameters modify the order in which cores are put on the list. Let's take, for example, a job that runs 16 MPI processes on 2 nodes that each have 2 processors with 4 cores each (totaling 16 cores). Here is how they will be placed depending on the chosen option.

Paramètre Nœud 0 Node 1
Processor 0 Processor 1 Processor 0 Processor 1
-bycore,--bycore 0, 1, 2, 3 4, 5, 6, 7 8, 9, 10, 11 12, 13, 14, 15
-bysocket,--bysocket 0, 2, 4, 6 1, 3, 5, 7 8, 10, 12, 14 9, 11, 13, 15
-bynode,--bynode 0, 2, 4, 6 8, 10, 12, 14 1, 3, 5, 7 9, 11, 13, 15

Setting this parameter could be important if memory use is not uniformly distributed among MPI ranks. If, for example, the first 8 ranks of your application use more memory than the last 8, you may want to repartition the load by specifying the bynode options. If your application does a lot of communication between neighbouring ranks, you should probably specify bycore to minimize communications between nodes and between sockets.

Another example is a hybrid MPI/OpenMP application. If, for example, your application uses 4 MPI ranks with 4 threads each, using

 [name@server $] export OMP_NUM_THREADS=4
 [name@server $] mpiexec --np 4 ./my_application

here is the repartitioning that you obtain given the following options:

Parameter Node 0 Node 1
Processor 0 Processor 1 Processor 0 Processor 1
-bycore,--bycore 0, 1, 2, 3 - - -
-bysocket,--bysocket 0, 2 1, 3 - -
-bynode,--bynode 0, 2 - 1, 3 -
--npernode 2 --bysocket 0 1 2 3

You can see here that to distribute your load among the 4 sockets, you should add --npernode 2 and specify per socket repartitioning, that means running:

 [name@server $] export OMP_NUM_THREADS=4
 [name@server $] mpiexec --np 4 --npernode 2 --bysocket ./my_application

Process binding

Generally, on a dedicated computer, processes rarely jump from core to core. Nevertheless this is something that could happen. The operating system must run some background jobs and must use some cores from time to time. When that happens, MPI processes are halted, and automatically restored by the OS. There is no guarantee that processes are restored on the same core where they were initially. This could lead to situation where memory used by a core is allocated to badly affined physical memory. To assure that processes are bound to their initial cores or sockets, Open MPI offers the following options:

Fixation des processus

Parameter Description
-bind-to-core,--bind-to-core Bind processes to their initial cores
-bind-to-socket,--bind-to-socket Bind processes to their initial sockets
-bind-to-none,--bind-to-none No process binding

If your application's performance is heavily dependent on memory bandwidth, it is recommended to bind processes to their initial sockets or cores.

Note: in all these cases, MPI processes are bound to there initial compute nodes.

Default parameters

Open MPI's default parameters are --bind-to-none --bycore, which means that processes are placed sequentially on cores of the same socket, and then on the same node, before changing nodes. Processes can change core at will by the OS (though always within the same node).

Displaying bindings

Open MPI offers the --report-bindings parameter, which shows in what way MPI ranks are distributed. The result looks like this, inside your job's error file:

File : report-bindings.txt
[r103-n2:05877] MCW rank 8 bound to socket 1[core 0]: [. . . .][B . . .]
[r103-n2:05877] MCW rank 10 bound to socket 1[core 1]: [. . . .][. B . .]
[r103-n2:05877] MCW rank 12 bound to socket 1[core 2]: [. . . .][. . B .]
[r103-n2:05877] MCW rank 14 bound to socket 1[core 3]: [. . . .][. . . B]
[r103-n2:05877] MCW rank 0 bound to socket 0[core 0]: [B . . .][. . . .]
[r103-n2:05877] MCW rank 2 bound to socket 0[core 1]: [. B . .][. . . .]
[r103-n2:05877] MCW rank 4 bound to socket 0[core 2]: [. . B .][. . . .]
[r103-n2:05877] MCW rank 6 bound to socket 0[core 3]: [. . . B][. . . .]
[r109-n77:05754] MCW rank 9 bound to socket 1[core 0]: [. . . .][B . . .]
[r109-n77:05754] MCW rank 11 bound to socket 1[core 1]: [. . . .][. B . .]
[r109-n77:05754] MCW rank 13 bound to socket 1[core 2]: [. . . .][. . B .]
[r109-n77:05754] MCW rank 15 bound to socket 1[core 3]: [. . . .][. . . B]
[r109-n77:05754] MCW rank 1 bound to socket 0[core 0]: [B . . .][. . . .]
[r109-n77:05754] MCW rank 3 bound to socket 0[core 1]: [. B . .][. . . .]
[r109-n77:05754] MCW rank 5 bound to socket 0[core 2]: [. . B .][. . . .]
[r109-n77:05754] MCW rank 7 bound to socket 0[core 3]: [. . . B][. . . .]

Modular Component Architecture

The Open MPI library is very modular. It is constructed using the Modular Component Architecture (MCA). It is possible to display all available MCA parameters using the following command:

[name@server $] ompi_info -a | less

Software components for message passing

The first layer is the Point-to-point Messaging Layer (PML). Two components implement this interface, called ob1 (for Obi-Wan Kenobi) and cm (for Connor MacLeod, of the film Highlander)[1].

The PML component ob1

The Byte Transfer Layer (BTL) is used to transfer bytes by the ob1 PML. Because great Force may be with Obi-Wan Kenobi, it can use multiple BTL types. The following BTLs are available:

BTL component of the PML ob1
Name Meaning Description
self self virtual memory copy
sm shared memory Bytes are passed using memory.
tcp Transmission Control Protocol Bytes are transferred using the TCP protocol.
openib OpenFabrics Everything that is compatible with OpenFabrics. InfiniBand belongs to this category.

It is possible to deactivate BTL components at run time using the MCA BTL:

[name@server $] mpiexec --mca btl ^sm -n 64 Programme   # this deactivates the sm BTL

The ob1 PML is used by Open MPI on Colosse and on Mammouth Parallèle II.

The PML component cm

The name cm comes from the name Connor MacLeod. Connor MacLeod is a fictional immortal character from the movie Highlander who is competing with other immortals. At the end, only one immortal remains. For this reason, the PML component can only use one MTL (Media Transfer Layer) component. For example, Guillimin uses the MTL component that is named PSM (for Performance Scaled Messaging).


Outils personnels
Espaces de noms

Ressources de Calcul Québec