Pros and cons
Open MPI is very modular, and easy to use and install. Its modular architecture allows you to fine-tune your computation.
Wrapper scripts for compilers
Like most MPI distributions, Open MPI offers wrapper scripts for compilers. These are scripts that call other compilers (GCC for example) with all options necessary to compile MPI code. This compiler's name depends on the language that is used. You should call mpicc for C, mpiCC, mpicxx or mpic++ for C++, and mpif77 or mpif90 for Fortran 77 or 90.
The -showme option displays the compiler that is used to compile the library with the options that were used to compile it.
[name@server $] mpicxx -showme g++ -I/usr/include/openmpi-x86_64 -pthread -m64 -L/usr/lib64/openmpi/lib -lmpi_cxx -lmpi
You can run a program compiled with Open MPI by starting it via the mpiexec command. For example:
[name@server $] mpiexec ./my_application application options
mpiexec accepts a large number of options. Basic arguments control the total number of processes, the number of processes per resource (e.g., node), process distribution and process pinning, etc. These tunings are particularly important if you would like to optimize your processes' memory affinity.
Number of processes
The total number of MPI processes is controlled by the -c,-np,--np,-n,--n options, which are all synonymous for Open MPI. For example, if you would like to run 16 processes, you use
[name@server $] mpiexec -n 16 ./my_application
On most Calcul Québec supercomputers (exception: Mp2), you do not need to specify this parameter if you are executing the same number of processes as are reserved by your job using PBS.
Furthermore, you can give the number of processes used per resource (e.g., node) using the following options
|-npernode, --npernode||Number of processes per compute node|
|-npersocket, --npersocket||Number of processes per socket|
It is rarely necessary to set those numbers. It could be the case if you run a hybrid job MPI/en/OpenMP/en. In this case, if the node has 8 cores, you should set the environment variable OMP_NUM_THREADS to 8, and set --npernode 1, as follows:
[name@server $] export OMP_NUM_THREADS=8 [name@server $] mpiexec --npernode 1 ./my_application
Note that it is very well possible that your application is faster if you use more MPI ranks and fewer OpenMP threads. You could then use the following instead:
[name@server $] export OMP_NUM_THREADS=4 [name@server $] mpiexec --npernode 2 ./my_application
[name@server $] export OMP_NUM_THREADS=2 [name@server $] mpiexec --npernode 4 ./my_application
Which one of these is best depends a lot on the application you use, the algorithms applied and the implementation. The only way to be sure is to test your application with different parameters.
You can also specify the way processes are distributed. They can be distributed by core, by socket, or by node. OpenMPI distributes processes sequentially on all available cores. These parameters modify the order in which cores are put on the list. Let's take, for example, a job that runs 16 MPI processes on 2 nodes that each have 2 processors with 4 cores each (totaling 16 cores). Here is how they will be placed depending on the chosen option.
|Paramètre||Nœud 0||Node 1|
|Processor 0||Processor 1||Processor 0||Processor 1|
|-bycore,--bycore||0, 1, 2, 3||4, 5, 6, 7||8, 9, 10, 11||12, 13, 14, 15|
|-bysocket,--bysocket||0, 2, 4, 6||1, 3, 5, 7||8, 10, 12, 14||9, 11, 13, 15|
|-bynode,--bynode||0, 2, 4, 6||8, 10, 12, 14||1, 3, 5, 7||9, 11, 13, 15|
Setting this parameter could be important if memory use is not uniformly distributed among MPI ranks. If, for example, the first 8 ranks of your application use more memory than the last 8, you may want to repartition the load by specifying the bynode options. If your application does a lot of communication between neighbouring ranks, you should probably specify bycore to minimize communications between nodes and between sockets.
Another example is a hybrid MPI/OpenMP application. If, for example, your application uses 4 MPI ranks with 4 threads each, using
[name@server $] export OMP_NUM_THREADS=4 [name@server $] mpiexec --np 4 ./my_application
here is the repartitioning that you obtain given the following options:
|Parameter||Node 0||Node 1|
|Processor 0||Processor 1||Processor 0||Processor 1|
|-bycore,--bycore||0, 1, 2, 3||-||-||-|
|-bysocket,--bysocket||0, 2||1, 3||-||-|
|-bynode,--bynode||0, 2||-||1, 3||-|
|--npernode 2 --bysocket||0||1||2||3|
You can see here that to distribute your load among the 4 sockets, you should add --npernode 2 and specify per socket repartitioning, that means running:
[name@server $] export OMP_NUM_THREADS=4 [name@server $] mpiexec --np 4 --npernode 2 --bysocket ./my_application
Generally, on a dedicated computer, processes rarely jump from core to core. Nevertheless this is something that could happen. The operating system must run some background jobs and must use some cores from time to time. When that happens, MPI processes are halted, and automatically restored by the OS. There is no guarantee that processes are restored on the same core where they were initially. This could lead to situation where memory used by a core is allocated to badly affined physical memory. To assure that processes are bound to their initial cores or sockets, Open MPI offers the following options:
Fixation des processus
|-bind-to-core,--bind-to-core||Bind processes to their initial cores|
|-bind-to-socket,--bind-to-socket||Bind processes to their initial sockets|
|-bind-to-none,--bind-to-none||No process binding|
If your application's performance is heavily dependent on memory bandwidth, it is recommended to bind processes to their initial sockets or cores.
Note: in all these cases, MPI processes are bound to there initial compute nodes.
Open MPI's default parameters are --bind-to-none --bycore, which means that processes are placed sequentially on cores of the same socket, and then on the same node, before changing nodes. Processes can change core at will by the OS (though always within the same node).
Open MPI offers the --report-bindings parameter, which shows in what way MPI ranks are distributed. The result looks like this, inside your job's error file:
Modular Component Architecture
The Open MPI library is very modular. It is constructed using the Modular Component Architecture (MCA). It is possible to display all available MCA parameters using the following command:
[name@server $] ompi_info -a | less
Software components for message passing
The first layer is the Point-to-point Messaging Layer (PML). Two components implement this interface, called ob1 (for Obi-Wan Kenobi) and cm (for Connor MacLeod, of the film Highlander).
The PML component ob1
The Byte Transfer Layer (BTL) is used to transfer bytes by the ob1 PML. Because great Force may be with Obi-Wan Kenobi, it can use multiple BTL types. The following BTLs are available:
|self||self||virtual memory copy|
|sm||shared memory||Bytes are passed using memory.|
|tcp||Transmission Control Protocol||Bytes are transferred using the TCP protocol.|
|openib||OpenFabrics||Everything that is compatible with OpenFabrics. InfiniBand belongs to this category.|
It is possible to deactivate BTL components at run time using the MCA BTL:
[name@server $] mpiexec --mca btl ^sm -n 64 Programme # this deactivates the sm BTL
The PML component cm
The name cm comes from the name Connor MacLeod. Connor MacLeod is a fictional immortal character from the movie Highlander who is competing with other immortals. At the end, only one immortal remains. For this reason, the PML component can only use one MTL (Media Transfer Layer) component. For example, Guillimin uses the MTL component that is named PSM (for Performance Scaled Messaging).