Running jobs

De Wiki de Calcul Québec
(Redirigé depuis Running jobs)
Aller à : Navigation, rechercher
Cette page est une traduction de la page Exécuter une tâche et la traduction est complétée à 100 % et à jour.

Autres langues :anglais 100% • ‎français 100%

Sommaire

General guidelines

How to use jobs

On computers we are most often familiar with graphical user interfaces (GUIs). There are windows, menus, buttons; clicking here and there, and the system responds immediately. But on Calcul Québec's computers it is different. To start with, the environment uses a command line interface. Furthermore, the jobs you would like to run are not immediately ran, but put into a waiting list, or queue. Only once the necessary resources are available a job is ran, otherwise jobs would step on each other's toes. Hence you should write a file (a submission script) that describes the job to run and the resources that are necessary for this job, put the job into the queue, and come back later once the job has finished. That means there is no interaction between the user and the program during the job's execution.

The system's role for job submissions is to react like a conductor for an orchestra. It has multiple responsibilities: it must maintain database of all jobs that were submitted until they have finished, respect certain conditions (limits, priorities), make sure to only assign the available resources to one job at a time, decide which jobs to run and on which compute nodes, manage launching them on those nodes and make sure to clean them up once finished.

On Calcul Quebec's computers, those responsibilities are given to the software Torque, combined with a scheduler that decides which jobs to run. We use the schedulers Maui and Moab, depending on the server.

Resource Allocation

On a personal computer, there is usually only one user at a time, whereas hundreds of users can be logged in at the same time on a cluster (the type of computer offered at Calcul Québec). This cluster typically consists of hundreds of nodes that contain between 8 and 24 cores each. Every user must explicitly ask for the resources that he or she needs. Specifically, this is mainly guided by two parameters: the time needed to complete the task (wall time) and the number of processors (number of nodes and cores) which are used by the job. In certain cases, one must also specify the amount of memory needed to allow the scheduler to choose the nodes that are most appropriate for the job.

It is important to specify those parameters well. If they are too high, the job may wait longer than necessary in the queue and may block the system to other users that might need it. If they are too small, then the job may not finish or lack memory. These parameters allow the scheduler to choose which of the queued jobs to run.

A sample submission file

In this section we show a minimal example submission file. The exact contents vary from server to server. Please consult the section "Additional server-specific documentation" below for more details.


File : minimal_example.sh
#!/bin/bash
#PBS -A abc-123-aa
#PBS -l walltime=30:00:00
#PBS -l nodes=2:ppn=8
#PBS -q queue
#PBS -r n
 
module load compilers/intel/12.0.4
module load mpi/openmpi/1.4.5_intel
 
cd $SCRATCH/workdir
mpiexec /path/to/my/mpi_program


In this file, lines starting with #PBS are options that are given to the task manager. Lines starting with # (but not #PBS or #!) are comments, and are ignored. All other lines form the script that will run on a compute node. A detailed explanation follows:

  • #!/bin/bash : this line, that must be the first line in this file, shows that it is a script that is interpreted by /bin/bash. Other interpreters are possible (/bin/tcsh, /bin/sh or even /bin/env perl).
  • #PBS -A abc-123-aa : this defines the project identifier that is used (Rap ID). This is compulsory for all jobs on Colosse and Helios and optional for Briarée, Cottos, Guillimin and Hadès with the default project. This is to determine which resources can be allocated to the job and to account its resources to the correct project. For the moment, it does not apply to Mp2, Ms2 and Psi. On Colosse and Helios you can find out your Rap ID using the command serveur-info.
  • #PBS -l walltime=30:00:00 : this is the wall time reserved for the job, in hours, minutes and seconds. After this time, Torque kills the job, whether it has finished or not.
  • #PBS -l nodes=2:ppn=8 : this is the number of nodes and the number of cores ("processors") per node (« ppn ») that you need.
  • #PBS -q queue : this is the queue you want to use. All servers define a default queue, so that this line may not be necessary.
  • #PBS -r n : shows that jobs can not be restarted. Certain servers restart jobs in case of error, which means that Torque can restart it from the beginning. The user must assure that the job can be restarted without undesirable side-effects.
  • module load compilers/intel/12.0.4 : we load the Intel compiler. The module name varies from machine to machine.
  • module load mpi/openmpi/1.4.5_intel : we load the MPI library that we would like to use (put the one that was used when the program was compiled).
  • cd $SCRATCH/workdir : when starting the script, the initial directory is the user's home directory ($HOME). Usually one wants to work in another directory, typically within a file system that is appropriate for large files (like where $SCRATCH lives).
  • mpiexec /path/to/my/mpi_program : here you call the program. For an MPI program, the mpiexec command launches the process on all available processors. Here we haven't specified the number of processes to start, but the way to start an MPI program may vary from server to server.

Interaction with the job submission system

Job submissions

After having written the script, you should submit it to Torque. The command to do this is qsub. On Colosse, you should instead use the Moab command named msub. When you run this command, the job ID is printed once your job has been successfully put into the queue. Also, on top of the options included in the script with the #PBS syntax, one can give these options to qsub and msub on the command line. Command line options have preference over those specified in the script.

Showing the list of jobs

The qstat -a -u $USER command shows the list of jobs for a user. With Maui and Moab, one can do a showq -u $USER. In both cases, remove -u $USER, to see all jobs by everybody.

Removing or killing a job

With the qdel Job ID (Torque) or mjobctl -c Job ID (Moab) you can delete or kill a job. The job ID is the one that was displayed when you submitted the job and you can also see using qstat or showq.

Obtaining standard output and standard error

Messages that are normally displayed on the terminal when you run something interactively, are instead put into files by Torque only at the end of the job's execution. Look for files with the extensions .oJob ID (for standard output) and .eJob ID (for standard error). For example, pbs_job.o99485 and pbs_job.e99485. Here you can find the results of your simulations and error messages if everything did not go as intended.

If the job writes a large amount of data to standard output, it is better to redirect it to a file:

mpiexec /path/to/my/mpi_program > output_file

Priority

Job priority is calculated following the fair share algorithm. Jobs are not necessarily run in the order they were submitted; the scheduler sorts the jobs keeping track of which parts of the server should be attributed to which groups (based on an annual allocation) and on the server's use by that group during the previous month. The scheduler also accounts for the available resources and the time the job waits in the queue, so that all jobs that respect the limits are eventually run.

Since the number of hours allocated to a group reflect the execution priority, it is always possible to submit jobs that exceed this limit. We reserve the right to decrease the priority for a computation if a group consumes more than double their fair share during one month. We normally only apply this measure in the event of exceptionally heavy use of the systems.

Other references

Example script documenting qsub's options

Example scripts for multiple serial simulations, OpenMP jobs, and hybrid jobs

Running an interactive job

In case of problems

If your job does not run, please send us your script, the server name, and all information needed to reproduce your problem. We will also need your job ID if you obtained one.

Additional server-specific documentation

Briarée

The job submission system on Briarée consists of the Torque task manager together with the Maui scheduler. Jobs are submitted using the command "qsub":

[name@server $] qsub [options] script.pbs


The "-A" option used to specify the project (RAP Id) is optional if you only have one active project or only one active project with an allocation. If you have several active projects, you can define a default project if you write it in the file "$HOME/.projet". If the project that you request is invalid, your active projects will be displayed.

Choice of computing nodes

Available memory varies from node to node on Briarée. All nodes have 12 cores, but there are 316 nodes with 24 GB (2 GB per core), 314 nodes with 48 GB (4 GB per core), and 42 nodes with 96 GB (8 GB per core). You can ask for nodes with 48 GB by adding the property "m48G" in the following way: -l nodes=100:m48G:ppn=12. Similarly you can specify the m96G property to obtain nodes with 96 GB, and m24G to get nodes with 24 GB. It is however generally not necessary to ask for m24G because Maui checks for availability of nodes in increasing memory order.

You should not uselessly ask for nodes with 48 GB or 96 GB if less memory suffices. If you do that, you risk preventing other more memory-hungry jobs from occupying those nodes. Moreover you may increase the waiting time for the job that you submitted.

When a job needs 96-GB nodes, you have to request the property m96G to get a higher priority for this job on these nodes.

To see available resources on nodes with 24 GB, use the following command :

[name@server $] pbs_free :m24G


and similarly for nodes with 48 GB and 96 GB, using the m48G and m96G properties, respectively.

Queues

It is not necessary to specify a queue at submission time. The default queue is a routing queue that will steer the job into the correct execution queue according to the number of nodes and time requested. You can see the limits for the different queues in the table below.

Queue Maximum execution time Minimum number of nodes per job Maximum number of nodes per job Maximum number of cores per user Maximum number of jobs per user Maximum number of cores for all jobs
test 1 h 1416
courte 48 h 4 72
normale 168 h (7 days) 4 1416 36
longue 336 h (14 days) 4 180 24 720
hpcourte 48 h 5 171
hp 168 h (7 days) 5 171 2052 8

A group cannot run more than 1416 jobs and cannot use more than 2520 cores at the same time.

Jobs that ask for more than 4 nodes (48 cores) are put into the "hp" and "hpcourte" queues. We limit access to those queues to users who demonstrate that their model uses resources efficiently. Please contact us for more details. Similarly, the "longue" queue is limited to those who ask for access to it.

We have reserved 4 nodes with 48 GB exclusively to the "test" queues, towards which the job is steered if it asks for one hour or less of wall time. If those 4 nodes are occupied, "test" queue jobs can also be routed to other nodes, according to their availability.

To view the available resources on the "test" queue, use the following command:

[name@server $] pbs_free :test


To view the available resources on all queues other than "test", please type:

[name@server $] pbs_free :normal


Multiple simulations per node

If a user launches 12 serial jobs on Briarée, then Maui will try to regroup them onto the same node, and similarly for parallel jobs that use fewer than 12 cores per node. If this is not the behaviour you would like, you should ask for complete nodes (ppn=12) or specify sufficiently large memory so that Maui does not use more than one job per node.

Only jobs from the same user can share nodes.

Using MPI

All MPI libraries that are available on Briarée use the "mpiexec" command to run MPI programs. If you would like to run a process on each core obtained via Torque, you don't need to specify the number of processes or the list of nodes to use because "mpiexec" will get that information directly from Torque.


File : openmpi_briaree.sh
#!/bin/bash
#PBS -l walltime=30:00:00
#PBS -l nodes=2:ppn=12
#PBS -r n
 
 
module load intel-compilers/12.0.4.191
module load MPI/Intel/openmpi/1.6.2
 
 
cd $SCRATCH/workdir
mpiexec /path/to/my/mpi_program

Colosse

The job submission system on Colosse consists of the Torque task manager together with the Moab scheduler. Jobs are submitted using the command "msub":

[name@server $] msub [options] script.pbs


The "-A" option is necessary to specify your project (RAP Id), as well as a specification of your wall time (-l walltime) and computational (-l nodes=1:ppn=8) resources.

Choice of computing nodes

Available memory varies from node to node on Colosse. All nodes have 8 cores, but there are 936 nodes with 24 GB (3 GB per core) and 24 nodes with 48 GB (6 GB per core). You can ask for nodes with 48 GB by adding the "-l feature='48g'" option to the "msub" command.

You should not uselessly ask for nodes with 48 GB if less memory suffices. If you do that, you risk preventing other more memory-hungry jobs from occupying those nodes. Moreover, the waiting time for the job that you submitted is longer for those nodes because of their small number.

Job size and duration limits

The maximum duration of a job on Colosse is 48 hours. The maximum job size varies depending on its duration.

  • For a 24 hours job, up to 32 nodes can be used.
  • For a job between 24 and 48 hours, up to 16 nodes can be used.

Queues

It is not necessary to specify a queue at submission time. The default queue is the routing queue that will steer the job into the correct execution queue following the number of nodes and time asked for.

Using MPI

You should load the module corresponding to the desired MPI library version. The list of available versions that are available on Colosse can be obtained using

[name@server $] module avail mpi


With OpenMPI, the example submission script given in the general section above works very well. If you would like to run a process on each core obtained via Torque, you don't need to specify the number of processes or the list of nodes to use because "mpiexec" will get that information directly from Torque.


Reserving memory

Compute nodes are equipped with 24 GB of RAM, for 8 cores, so you have 3 GB available per process. If for a particular application, the process needs more than 3 GB, you should use a fraction of available cores, by specifying the number of cores to use to MPI. For example for a job that needs 6 GB per process, you could use the following submission script:


File : openmpi_colosse_4cores.sh
#!/bin/bash
#PBS -A abc-123-aa
#PBS -l walltime=30:00:00
#PBS -l nodes=2:ppn=8
 
module load compilers/intel/12.0.4
module load mpi/openmpi/1.4.5_intel
 
mpiexec -npernode 4 /path/to/my/mpi_program


Multiple simulations per node

If a user starts 8 serial jobs on Colosse, they will run on different nodes, because every job will always obtain complete nodes. To not waste resources, you should regroup multiple simulations into the same job, if memory allows.

Cottos

The job submission system on Cottos consists of the Torque task manager together with the Maui scheduler. Jobs are submitted using the command "qsub":

[name@server $] qsub [options] script.pbs


The "-A" option used to specify the project (RAP Id) is optional if you only have one active project or only one active project with an allocation. If you have several active projects, you can define a default project if you write it in the file "$HOME/.projet". If the project that you request is invalid, your active projects will be displayed.

Choice of computing nodes

All computing nodes on Cottos are identical. They all have 8 cores and 16 GB of RAM (2 GB per core). So you do not need to choose your nodes.

Queues

It is not necessary to specify a queue at submission time. The default queue is the routing queue that will steer the job into the correct running queue following the number of nodes and time asked for. You can obtain the limits for the different queues using the "qstat -q" command, ran on that machine.

Jobs that ask for more than 4 nodes (32 cores) are put into the "hp" and "hpcourte" queues. We limit access to those queues to users who demonstrate that their model uses resources efficiently. Please contact us for more details.

We have reserved 2 nodes exclusively to the "test" queue, towards which the job is steered if it asks for one hour or less of wall time. If those 2 nodes are occupied, "test" queue jobs can also be routed to other nodes, according to their availability.

To view the available resources on the "test" queue, use the following command:

[name@server $] pbs_free :test


To view the available resources on all queues other than "test", please type:

[name@server $] pbs_free :normal


Multiple simulations per node

If a user launches 8 serial jobs on Cottos, then Maui will try to regroup them onto the same node, and similarly for parallel jobs that use fewer than 8 cores per node. If this is not the behaviour you would like, you should ask for complete nodes (ppn=8) or specify sufficiently large memory so that Maui does not use more than one job per node.

Only jobs for the same user can share nodes.

Using MPI

All MPI libraries that are available on Cottos use the "mpiexec" command to run MPI programs. If you would like to run a process on each core obtained via Torque, you don't need to specify the number of processes or the list of nodes to use because "mpiexec" will get that information directly from Torque.


File : openmpi_cottos.sh
#!/bin/bash
#PBS -l walltime=30:00:00
#PBS -l nodes=2:ppn=8
#PBS -r n
 
module load intel-compilers/11.0.083
module load openmpi_intel64/1.4.1
 
cd $SCRATCH/workdir
mpiexec /path/to/my/mpi_program

Guillimin

The job submission system on Guillimin consists of the Torque task manager together with the Moab scheduler. Jobs are submitted using the Torque command "qsub":

[name@server $] qsub [options] script.pbs


The project is specified using the "-A abc-123-xy" option, where "abc-123-xy" is replaced by the RAP Id. If this option is absent and the user only has access to one default project, the job uses the default project.

Queues

Computing nodes of Guillimin cluster are grouped in 7 partitions:

  • Serial Workload (SW) - for serial jobs and "light" parallel jobs, memory: 3GB/core
  • High Bandwidth (HB) - for massively parallel jobs, memory: 2GB/core
  • Large Memory (LM) - for jobs, requiring large memory footprint, memory: 6GB/core
  • Serial Workload (SW2) - for serial jobs and "light" parallel jobs, memory: 4GB/core.
  • Large Memory (LM2) - for jobs, requiring large memory footprint, memory: 8GB/core.
  • Extra Large Memory (XLM2) - for job, requiring a very large memory footprint, memory: 12, 16 or 32GB/core
  • Accelerated Workload (AW2) - for jobs that use GPUs or Intel Xeon Phi accelerators (4 or 8 GB/core)

Queues on Guillimin are organized in a flexible fashion using the default queue (metaq). Depending on the type of job, you submit your job specifying the number of cores (nodes and ppn or procs) and minimum required memory per core (pmem).

The last debug queue is a special one. It is specially created to allow you to test your code, before you submit it for a long run. The jobs, submitted to "debug" queue should normally start almost immediately, and you will be able to see if your program behaves as you expect. There are strict resource and time limitations for this queue though: The default running time is 30 min only (max time is 2 hours), and your job is allowed to use maximum 4 CPU cores. If the parameters in your submission file exceed these limits, your job will be rejected!

The "metaq" queue is the default queue.


IMPORTANT:

  • The above-mentioned compute nodes have different amounts of memory! Jobs are put on nodes according to the value of the pmem parameter (memory per core), which is by default 1700m for jobs using multiple nodes with ppn=12, 2700m for jobs using ppn<12, and 3700m for jobs using ppn=16. Be aware that if your job exceeds these limits, it will be killed automatically.
  • Please note, that the default "metaq" queue can also handle parallel jobs. If your code does not require a lot of interprocessor communications, you will probably not notice any performance issues while using this queue. However, if your program performs large data exchanges between the nodes (like, for example, 3D-FFT, parallelized with MPI), then you need to use the "hb" or "lm" queue instead.
  • If you are using thread-type parallelization (like OpenMP), you should specify "nodes=1:ppn=m" , because your program does not perform any inter-node data exchange
  • It is ALWAYS a good idea to submit a short test job to the "debug" queue first, before a long-time run. In this way you will immediately know if your program works as expected or not. However, please do NOT use the "debug" queue for real production-type runs! Remember, that your job will be automatically killed after 30 min !
  • The DEFAULT walltime for each of all queues except "debug" is 3 hours, with a MAXIMUM allowed walltime of 30 days.

Multiple simulations per node

Multiple jobs can run on the same node if you specify nodes=1:ppn=m where m<12, for the same user or for different users. If this is not what you would like, you should ask for complete nodes (ppn=12 or ppn=16) or specify sufficiently large memory so that Moab does not put more than one job on each node.


Using MPI

Here is an example submission script for an MPI job, which sends your program to be executed on 36 CPU cores:


File : openmpi_guillimin.sh
#!/bin/bash
#PBS -A abc-123-aa
#PBS -l walltime=30:00:00
#PBS -l nodes=3:ppn=12
#PBS -l pmem=5700m
 
module load ifort_icc/14.0.4
module load openmpi/1.6.3-intel
 
cd $PBS_O_WORKDIR
mpiexec -n 36 ./code


The particular features of this submission script are as follows:

  • The line "#PBS -l nodes=3:ppn=12" asks the scheduler to reserve 36 CPU cores on 3 nodes.
  • The line "#PBS -l pmem=5700m" asks the scheduler to reserve 5.7G per core, so that the job runs on an LM node.
  • The "module load ..." commands load the modules that correspond to the compiler and MPI library that are used for compiling the program. On Guillimin, you have access to different versions of OpenMPI and MVAPICH2. The "module avail" command shows the available versions. These packages are specially built using InfiniBand libraries and ensure that MPI traffic of your parallel application will go through the InfiniBand network. We do not recommend installing your own MPI library, for it will probably be unaware of the InfiniBand network.
  • The line "mpiexec -n 36 ./code" starts the program "code", compiled with MPI, in parallel on 36 cores. The program "mpiexec" is a mentioned above launcher which "organizes" all communications between the MPI processes. Parameter "-n" should never be larger than the number of cores (nodes*ppn) in "#PBS -l nodes=...:ppn=12" line. If you do not specify the number of processes, the "mpiexec" command will run one per core.


Using ScaleMP

One of the resources available on Guillimin is a collection of nodes which behaves like a single computer with a large shared memory. This system, called scalemp, consists of 11 nodes with 12 cores per node and 8 GB of memory per core. Because of software running on these nodes, scalemp can be used like a single node with 132 cores and a 1000 GB of memory. Therefore, this system is particularly useful for researchers who require access to large amounts of memory. More information about ScaleMP can be found here.

All Guillimin users have access to the ScaleMP machine through the "scalemp" queue. Here is how, following this example submission script, accompanied by the corresponding "qsub" command:


File : smpsubmit.bat
#!/bin/bash
#PBS -l nodes=1:ppn=16
#PBS -l walltime=0:10:00
# To associate OpenMP threads to consecutive cores.
export KMP_AFFINITY=compact,verbose,0,0
export MKL_VSMP=1
cd $HOME/smpdir
export OMP_NUM_THREADS=16
 
./openmp-mm


[name@server $] qsub -q scalemp ./smpsubmit.bat


Several examples of programs designed specifically for the ScaleMP system are stored in the directory "/opt/ScaleMP/examples". It is a good idea to make a copy of these programs in your home directory so that you may alter, compile, and run them:


 
 [name@server $] mkdir ScaleMPexamples
 [name@server $] cp -R /software/ScaleMP/examples/* ScaleMPexamples/.
 


This example folder contains examples of many different types of jobs (e.g. Serial, MPI, OpenMP, MKL, Pthread, etc). Each type of job contains an example run-script and useful information. Please identify the type of job that best resembles your application, carefully follow the relevant guidelines, and modify a copy of the example run-script as appropriate for your application.

Hadès

The job submission system on the Hadès cluster consists of the Torque task manager together with the Maui scheduler. Jobs are submitted using the command "qsub" from briaree1, the Briarée's login node, or from hades, which is Hadès login node. The command

[name@server $] qsub -q @hades -lnodes=1:ppn=3 -lwalltime=48:00:00 script.pbs


gives you three GPUs and three cores. You can use 8 cores on one node if and only if you ask for the complete node using "-lnodes=1:ppn=7". Torque thinks that there are only 7 cores per node.

The "-A" option used to specify the project (RAP Id) is optional if you only have one active project or only one active project with an allocation. If you have several active projects, you can define a default project if you write it in the file "$HOME/.projet". If the project that you request is invalid, your active projects will be displayed.

You can also run the following commands to question Torque on Hadès:

 
 [name@server $] qstat -a @hades
 [name@server $] pbsnodes -s hades
 [name@server $] pbs_free hades
 


Before compiling (on briaree1 or the login node hades) and running (in your submission script), you should load the CUDA module:


[name@server $] module add CUDA



Choice of computing nodes

All computing nodes on Hadès are identical. They all have 8 cores and 24 GB of RAM (3 GB per core). So you do not need to choose your nodes.

Queues

It is not necessary to specify a queue at submission time. The default queue is the routing queue that will steer the job into the correct running queue following the number of nodes and time asked for. You can obtain the limits for the different queues using the command


[name@server $] qstat -q @hades


run on hades or briaree1.

Multiple simulations per node

Multiple jobs can run on the same node on Hadès, for the same user or for different users. If this is not what you would like, you should ask for complete nodes (ppn=7) or specify sufficiently large memory so that Maui does not put more than one job on each node. Because jobs share nodes, it is essential to only use the resources you asked for, not more. If not, jobs will step on each other toes.

Using MPI

Hadès' software environment is the same as on Briarée. So you can submit MPI jobs there in the same fashion. For MPI jobs that use graphical processors (GPUs), the only difference is that you need to load the CUDA module and the program needs to be written to use it.

Helios

Compulsory Parameters

The job submission system on Helios is Torque in conjunction with the scheduler Moab. You submit a job using the Moab command msub:

[name@server $] msub [options] script.pbs


The option -A allows you to specify the project (RAP Id) and is compulsory, as is the job's duration (-l walltime=) and the compute resources needed (-l nodes=X:gpus=Z). In particular note that in order to avoid having two users share the same PCI bus you must request an even number of GPUs if you request multiple GPUs. Also keep in mind that the option ppn=z is forbidden. You automatically obtain five CPU cores (for K20 nodes) or three CPU cores (for K80 nodes) for each pair of GPUs requested.

File : script_soumission.sh
#!/bin/bash
#PBS -N MyJob
#PBS -A abc-123-aa
#PBS -l walltime=300
#PBS -l nodes=1:gpus=2
cd "${PBS_O_WORKDIR}"


The command cd ${PBS_O_WORKDIR} is needed to ensure that the job executes in the directory from which the script was submitted.

Asking for GPUs

The number of GPUs per compute node is specified using the option -l nodes=x:gpus=y. For example, the request -l nodes=1:gpus=4 gives you one node with 4 GPUs and 10 CPUs on that node. Each Helios node has 8 GPUs and 2 processors (sockets) with 10 cores each for K20 nodes, and 16 GPUs and 2 processors (sockets) with 12 cores each for K80 nodes. Thus you can ask for maximally 8 GPUs per node for K20 nodes, and 16 GPUs for K80 nodes.

Maximum job duration

The maximum job duration is 12 hours.

Default output and error files

On Helios, the job's default output and error files are ${MOAB_JOBID}.out and ${MOAB_JOBID}.err, in the jobs execution directory. You can change those values using the usual -o and -e options.

Submission queue

It is not necessary to specify a submission queue on Helios. The submission queue is automatically determined as a function of the walltime and the number of GPUs.

K20 vs K80 nodes

You may specify that you want nodes with K20 or K80 using the option -l feature=k20 or -l feature=k80. If neither of these options is specified, the job will be routed to K20 nodes if it requested for more than 1 GPU, and will be able to run on either K20 or K80 nodes if it requested a single GPU.

helios-info

On Helios, the command helios-info provides information about your group's use of the cluster as well as the general use of Helios. For example,

[name@server $] helios-info 
 Your Rap IDs are: corem colosse-users exx-883-ac bwg-974-aa exx-883-ab exx-883-aa six-213-ad
Total number of jobs currently running: 28
Total number of k20 in use: 116/120 (96.00%).
Total number of k80 in use: 33/96 (34.00%).
You are currently using 0 k20 and 0 k80 for 0 job(s).
RAPI six-213-ad: 0 used GPUs / 0 allocated GPUs (recent history)

Mammouth parallèle II

The job submission system on Mp2 consists of the Torque task manager together with the Maui scheduler. Jobs are submitted using the command "qsub":

[name@server $] qsub [options] script.pbs


You can also submit jobs using the bqTools software. This last tool is useful to submit a very large number of jobs. It is more complicated, but offers more features than PBS arrays.

Choice of computing nodes

On Mp2, compute nodes differ in their number of cores (24 or 48), their RAM (32, 256, or 512 GB) and the topology of there connection to the InfiniBand network. The choice of nodes is done using the submission queue.

List of queues

Queue Minimum number of nodes Available number of nodes Memory per node Cores per node Maximum run time InfiniBand setting
qwork 1 (24 cores) 1588 32 GB 24 120 h 7:2
qfbb 12 (288 cores) 216 32 GB 24 120 h 1:1
qfat256 1 (48 cores) 20 256 GB 48 120 h 1:1
qfat512 1 (48 cores) 2 512 GB 48 48 h 1:1

The default queue is "qwork".

You can run the "bqmon" command to get to know how many nodes are free in each queue.

Submission examples

Having been given a very large number of cores, we configured Torque on Mp2 in such a way that it only sees one core per node even if there are more. So jobs will always obtain complete nodes. Henceforth only the value "ppn=1" is valid for "qsub". The following examples show what the resources are that you obtain given the parameters passed to "qsub".

The command

[name@server $] qsub -q qwork -l walltime=1:00:00 -l nodes=1:ppn=1 myscript.sh


gives one node (24 cores) to the job where the command

[name@server $] qsub -q qfbb u-l walltime=1:00:00 -l nodes=12:ppn=1 myscript.sh


gives 12 nodes (288 cores).

For compatibility reasons, you can also use equivalent bqTools commands with "ppn=24". This command gives 24 cores:

[name@server $] bqsub -q qwork -l walltime=1:00:00 -l nodes=1:ppn=24 myscript.sh


and this command 288 cores:

[name@server $] bqsub -q qfbb -l walltime=1:00:00 -l nodes=12:ppn=24 myscript.sh


Multiple simulations per node

You should not use "qsub" directly to launch multiple serial jobs on Mp2 in different jobs. One would then only use one core per node which is a waste of resources. You should group multiple serial computations together into one single job. You can do that yourself by writing the submission script in that way. Or you could use the "bqsub" or "bqsub_accumulator" commands that will automatically accumulate the computations:


[name@server $] bqsub -q qwork -l walltime=1:00:00 myserialcomputation.sh



The "bqsubmit" command accumulates serial computations using one computation per core, by default. Two new options were added to "bqsubmit" for Mp2 (do not use those for Ms2):

Option Description Default value
runJobsPerNode Number of calculations run concurrently. This is useful if you run OpenMP jobs or if you must use fewer instances per node. Example: « runJobsPerNode=12 ». Number of cores for the node
accJobsPerNode Number of calculations that should be accumulated per node. This is useful if you run a large number of calculations during a short time. Exemple: « accJobsPerNode=1000 ». Number of cores for the node

Using MPI

We recommend the "_ofed" MPI modules. For example:


[name@server $] module add openmpi_pathscale64/1.4.3_ofed



[name@server $] module add mvapich2_intel64/1.6_ofed


As Torque only sees one core per node whereas there really are 24 or 48, it is essential to keep track of that when you submit jobs. For example, if you would like to use 4 cores per node, you can run a job that uses OpenMPI in the following way:


File : openmpi_mp2.sh
#!/bin/bash
#PBS -N testOpenmpi
#PBS -l nodes=2
#PBS -l walltime=0:02:00
#PBS -q qwork@mp2
 
cd $PBS_O_WORKDIR
 
# the number of MPI processes per node *** there are 24 cores per node
export ppn=4
export OMP_NUM_THREADS=$[24/ppn]
 
# the executable file
myExe=./a.out
 
# starting the program
mpiexec -n $[PBS_NUM_NODES*ppn] -npernode $ppn $myExe >> stdout


You can find other usage examples in the directory "/opt/examples/mp2" on Mp2. The sub-directory named "mpi" gives examples of submissions scripts for programs that use MPI with various libraries available on Mp2. The sub-directory "hybrid" show examples for hybrid MPI-OpenMP parallelized programs with the same libraries. Given that each node has 24 cores but only one InfiniBand card, it is often more efficient to make hybrid calculations on Mp2 than pure MPI calculations. Please contact an analyst for more details.

Mammouth série II

The job submission system on Ms2 consists of the Torque task manager together with the Maui scheduler. Jobs are submitted using the command "qsub":

[name@server $] qsub [options] script.pbs


You can also submit jobs using the bqTools software. This last software program is particularly useful to do an exploration of parameters.

Choice of computing nodes

Most nodes on the Ms2 cluster have 16 GB of RAM, with the exception of 44 nodes that have 32 GB. All nodes have 8 cores. To select a node with 32 GB you should add the "m32G" property to your submission. For example:


[name@server $] qsub -q work -l nodes=1:m32G,walltime=00:05:00 pbs1cores.sh


The fast InfiniBand network's topology is optimized into blocks of only 22 nodes. For a computation that needs a lot of inter-node communication, if is best to be sure that it is confined to the interior of one block. To do this, you should choose one of the 14 blocks of Ms2, named t1s, t2s, ..., t14s. For example, to use 2 nodes for the "t3s" block, you should specify "-l nodes=2:t3s:ppn=8".

To get to know the present usage for the blocks, you do:


[name@server $] bqmon|head -6;bqmon -p @ms |grep '^t[0-9]*s'


List of queues

Queue Minimum number of nodes Available number of nodes Maximum run time Memory
qwork 1 246 120 h 16 GB
qwork (m32G) 1 44 120 h 32 GB
qlong 1 6 1000 h 16 GB

The default queue is "qwork".

You can run the "bqmon" command to get to know how many nodes are free in each queue.

Multiple simulations per node

If a user launches 8 serial jobs on Ms2, then Maui will try to regroup them onto the same node, and similarly for parallel jobs that use fewer than 8 cores per node. If this is not the behaviour you would like, you should ask for complete nodes (ppn=8) or specify sufficiently large memory so that Maui does not use more than one job per node.

Only jobs from the same user can share nodes.

Instead of submitting independent serial jobs, you can also ask for a complete node and group multiple computations into one job.

Another possibility is to use the "bqsub_mono" command to accumulate calculations on 8 cores before submitting them. "bqsub_mono" uses the same syntax as "qsub". For example:


 
 [name@server $] bqsub_mono -q qwork@ms -l nodes=1:ppn=8,walltime=00:05:00 pbs1core_1.sh
 [name@server $] bqsub_mono -q qwork@ms -l nodes=1:ppn=8,walltime=00:05:00 pbs1core_2.sh
 [name@server $] bqsub_mono -q qwork@ms -l nodes=1:ppn=8,walltime=00:05:00 pbs1core_3.sh
 [name@server $] bqsub_mono -q qwork@ms -l nodes=1:ppn=8,walltime=00:05:00 pbs1core_4.sh
 


and so on. If the number of accumulated calculations with "bqsub_mono" is not a multiple of 8, all remaining calculations will be run in any case after 15 minutes.

Using MPI

To begin with, you should choose the desired MPI version using the "module" command:


 
 [name@server $] module add mvapich2_intel64
 [name@server $] module initadd mvapich2_intel64
 


to compile your program written using MPI with the following command:


[name@server $] mpicc calc_pi.c -o calc_pi


Finally you launch your job using a PBS script or with the "bqsub" command. Here is an example submitted onto two nodes, with 8 processes per node.


[name@server $] bqsub -q qwork@ms -P "applicationType=mpi" -P "command=./calc_pi" -l nodes=2:ppn=8 -l walltime=00:10:00


or


[name@server $] qsub -q qwork@ms -l nodes=2:ppn=8,walltime=00:10:00 calc_pi.sh

Psi - (archives)

The job submission system on Psi consists of the Torque task manager together with the Maui scheduler. Jobs are submitted using the command "qsub":

[name@server $] qsub [options] script.pbs


Choice of computing nodes

All computing nodes on Psi are identical. They all have 12 cores and 72 GB of RAM (6 GB per core). So you do not need to choose your nodes.

Queues

There is only one queue on Psi. As it is defined as the default queue, you never need to specify it.

Multiple simulations per node

Multiple jobs can run on the same node on Psi, for the same user or for different users. If this is not what you would like, you should ask for complete nodes (ppn=12) or specify sufficiently large memory so that Maui does not put more than one job on each node.

Using MPI

Example submission scripts can be found in the "/export/home/SCRIPTS" directory on Psi.


Outils personnels
Espaces de noms

Variantes
Actions
Navigation
Ressources de Calcul Québec
Outils
Partager