How to use jobs
On computers we are most often familiar with graphical user interfaces (GUIs). There are windows, menus, buttons; clicking here and there, and the system responds immediately. But on Calcul Québec's computers it is different. To start with, the environment uses a command line interface. Furthermore, the jobs you would like to run are not immediately ran, but put into a waiting list, or queue. Only once the necessary resources are available a job is ran, otherwise jobs would step on each other's toes. Hence you should write a file (a submission script) that describes the job to run and the resources that are necessary for this job, put the job into the queue, and come back later once the job has finished. That means there is no interaction between the user and the program during the job's execution.
The system's role for job submissions is to react like a conductor for an orchestra. It has multiple responsibilities: it must maintain database of all jobs that were submitted until they have finished, respect certain conditions (limits, priorities), make sure to only assign the available resources to one job at a time, decide which jobs to run and on which compute nodes, manage launching them on those nodes and make sure to clean them up once finished.
On Calcul Quebec's computers, those responsibilities are given to the software Torque, combined with a scheduler that decides which jobs to run. We use the schedulers Maui and Moab, depending on the server.
On a personal computer, there is usually only one user at a time, whereas hundreds of users can be logged in at the same time on a cluster (the type of computer offered at Calcul Québec). This cluster typically consists of hundreds of nodes that contain between 8 and 24 cores each. Every user must explicitly ask for the resources that he or she needs. Specifically, this is mainly guided by two parameters: the time needed to complete the task (wall time) and the number of processors (number of nodes and cores) which are used by the job. In certain cases, one must also specify the amount of memory needed to allow the scheduler to choose the nodes that are most appropriate for the job.
It is important to specify those parameters well. If they are too high, the job may wait longer than necessary in the queue and may block the system to other users that might need it. If they are too small, then the job may not finish or lack memory. These parameters allow the scheduler to choose which of the queued jobs to run.
A sample submission file
In this section we show a minimal example submission file. The exact contents vary from server to server. Please consult the section "Additional server-specific documentation" below for more details.
In this file, lines starting with #PBS are options that are given to the task manager. Lines starting with # (but not #PBS or #!) are comments, and are ignored. All other lines form the script that will run on a compute node. A detailed explanation follows:
- #!/bin/bash : this line, that must be the first line in this file, shows that it is a script that is interpreted by /bin/bash. Other interpreters are possible (/bin/tcsh, /bin/sh or even /bin/env perl).
- #PBS -A abc-123-aa : this defines the project identifier that is used (Rap ID). This is compulsory for all jobs on Colosse and Helios and optional for Briarée, Cottos, Guillimin and Hadès with the default project. This is to determine which resources can be allocated to the job and to account its resources to the correct project. For the moment, it does not apply to Mp2, Ms2 and Psi. On Colosse and Helios you can find out your Rap ID using the command serveur-info.
- #PBS -l walltime=30:00:00 : this is the wall time reserved for the job, in hours, minutes and seconds. After this time, Torque kills the job, whether it has finished or not.
- #PBS -l nodes=2:ppn=8 : this is the number of nodes and the number of cores ("processors") per node (« ppn ») that you need.
- #PBS -q queue : this is the queue you want to use. All servers define a default queue, so that this line may not be necessary.
- #PBS -r n : shows that jobs can not be restarted. Certain servers restart jobs in case of error, which means that Torque can restart it from the beginning. The user must assure that the job can be restarted without undesirable side-effects.
- module load compilers/intel/12.0.4 : we load the Intel compiler. The module name varies from machine to machine.
- module load mpi/openmpi/1.4.5_intel : we load the MPI library that we would like to use (put the one that was used when the program was compiled).
- cd $SCRATCH/workdir : when starting the script, the initial directory is the user's home directory ($HOME). Usually one wants to work in another directory, typically within a file system that is appropriate for large files (like where $SCRATCH lives).
- mpiexec /path/to/my/mpi_program : here you call the program. For an MPI program, the mpiexec command launches the process on all available processors. Here we haven't specified the number of processes to start, but the way to start an MPI program may vary from server to server.
Interaction with the job submission system
After having written the script, you should submit it to Torque. The command to do this is qsub. On Colosse, you should instead use the Moab command named msub. When you run this command, the job ID is printed once your job has been successfully put into the queue. Also, on top of the options included in the script with the #PBS syntax, one can give these options to qsub and msub on the command line. Command line options have preference over those specified in the script.
Showing the list of jobs
Removing or killing a job
Obtaining standard output and standard error
Messages that are normally displayed on the terminal when you run something interactively, are instead put into files by Torque only at the end of the job's execution. Look for files with the extensions .oJob ID (for standard output) and .eJob ID (for standard error). For example, pbs_job.o99485 and pbs_job.e99485. Here you can find the results of your simulations and error messages if everything did not go as intended.
If the job writes a large amount of data to standard output, it is better to redirect it to a file:
mpiexec /path/to/my/mpi_program > output_file
Job priority is calculated following the fair share algorithm. Jobs are not necessarily run in the order they were submitted; the scheduler sorts the jobs keeping track of which parts of the server should be attributed to which groups (based on an annual allocation) and on the server's use by that group during the previous month. The scheduler also accounts for the available resources and the time the job waits in the queue, so that all jobs that respect the limits are eventually run.
Since the number of hours allocated to a group reflect the execution priority, it is always possible to submit jobs that exceed this limit. We reserve the right to decrease the priority for a computation if a group consumes more than double their fair share during one month. We normally only apply this measure in the event of exceptionally heavy use of the systems.
In case of problems
If your job does not run, please send us your script, the server name, and all information needed to reproduce your problem. We will also need your job ID if you obtained one.
Additional server-specific documentation