NAMD is a molecular dynamics program designed for the simulation of large bio-molecular systems. NAMD can use several different kinds of parallelism:
- several cores of a single node using shared memory (threads)
- several nodes of a cluster
- one or more graphics cards (GPGPU)
or a combination of all three techniques. This software has been used on machines with more than 200,000 coresl.
Use of NAMD on Helios
Two versions of NAMD are available on Helios. The first version, loaded via the module apps/namd-multicore, corresponds to the GPU-enabled binaries published on the NAMD website. It's the version we recommend you use if you want to run your job on a single node.
NAMD is a program whose performance scales rather well with the available resources. We have run the benchmark ApoA1 (92224 atoms) on Helios using the module apps/namd-multicore via the command
[name@server $] namd2 +pN apoa1.namd
where N is the number of threads. The table below summarizes the performance.
|Number of threads||Without GPU||With GPU||With twoawayx yes option||With binding||Efficiency|
|1 (1 GPU)||2.5||0.21||0.21||0.11||x|
|5 (2 GPUs)||0.48||0.047||0.041||0.026||100%|
|10 (4 GPUs)||0.26||0.023||0.020||0.013||100%|
|20 (8 GPUs)||0.064||0.010||0.0075||0.0075||87%|
In this table, the first column corresponds to the use of NAMD without any GPU support. The second column adds GPU support by increasing the number of GPUs proportional to the number of threads. For the third column, the twoawayx yes option is added to the file apoa1.namd, which permits the artificial augmentation of the number of cells to maximize the GPU utilization. The fourth column pins the threads to specific CPU cores on Helios (cpu-binding). Finally, the last column shows the efficiency of the NAMD job as a fonction of the resources allocated, using two GPUs and five threads as the reference point.
The third, fourth and fifth lines correspond respectively to a quarter, a half and a full Helios node. Note that if you use, in the same NAMD job, the total resources allocated to the job by Moab, the affinity of the threads and CPU cores is taken care of by the system. If you instead run several NAMD calculations inside a single job, you should look after the affinity yourself by using the tool numactl. For example, running the following commands
[name@server $] export CUDA_VISIBLE_DEVICES=0,1; numactl --physcpubind=0-4 namd2 +p5 +idlepoll apoa1.namd & [name@server $] export CUDA_VISIBLE_DEVICES=2,3; numactl --physcpubind=5-9 namd2 +p5 +idlepoll apoa1.namd & [name@server $] export CUDA_VISIBLE_DEVICES=4,5; numactl --physcpubind=10-14 namd2 +p5 +idlepoll apoa1.namd & [name@server $] export CUDA_VISIBLE_DEVICES=6,7; numactl --physcpubind=15-19 namd2 +p5 +idlepoll apoa1.namd & [name@server $] wait
in a job that asks for eight GPUs will result in much better performance than running the following commands
[name@server $] export CUDA_VISIBLE_DEVICES=0,1; namd2 +p5 +idlepoll apoa1.namd & [name@server $] export CUDA_VISIBLE_DEVICES=2,3; namd2 +p5 +idlepoll apoa1.namd & [name@server $] export CUDA_VISIBLE_DEVICES=4,5; namd2 +p5 +idlepoll apoa1.namd & [name@server $] export CUDA_VISIBLE_DEVICES=6,7; namd2 +p5 +idlepoll apoa1.namd & [name@server $] wait
On the contrary, running four processes with two GPUs using the command
[name@server $] namd2 +p5 +idlepoll apoa1.namd
will give the same performance as the first case.
If you want to carry out a computation using more than one node, you can use the second version, apps/namd-mpi. This one offers slightly lower performance than the multi-core one on a single node but allows you to use several nodes to get a better speed-up. The MPI version is available for versions 1.6.x and 1.8.x of OpenMPI. In our testing we have obtained 0.0085s/step on one node and 0.006s/step with two nodes for the benchmark computation ApoA1. These results have been obtained by starting NAMD with the following parameters, depending on the OpenMPI version:
[name@server $] mpiexec -np X --npernode 1 --report-bindings namd2 +ppn19 +idlepoll apoa1.namd
for OpenMPI 1.6.5 and
[name@server $] mpiexec -np X --bind-to none --map-by ppr:1:node --report-bindings namd2 +ppn19 +idlepoll apoa1.namd
for OpenMPI 1.8.1. Note that the option --bind-to none is very important with OpenMPI 1.8 for all hybrid jobs because the default linking parameters have changed since OpenMPI 1.7.4.
Note that the MPI version is currently experiencing stability problems and you may get a segmentation fault error when using several nodes.
The MPI version of NAMD can also be useful on a single node if your problem requires the use of a single GPU per process. If it's the case for you, you'll get the following error with the multicore version: FATAL ERROR: PME offload requires exactly one CUDA device per process..
Several Processes per GPU
If you use the MPI version of NAMD, you may want to have more processes than GPUs. For example, you could create 20 MPI processes even though the nodes have 8 GPUs. Our experience has shown that this gives a poorer performance than ApoA1 when running NAMD as indicated in the preceding section. If you nonetheless wish to try it with your problem, note that you should ask for GPUs in shared mode during the Moab job submission. To do this, change the line #PBS -l nodes=1:gpus=8 to become #PBS -l nodes=1:gpus=8:shared. Note as well that you can only use this option if you ask for an entire node.