BqTools

De Wiki de Calcul Québec
Aller à : Navigation, rechercher
Cette page est une traduction de la page BqTools et la traduction est complétée à 100 % et à jour.

Autres langues :anglais 100% • ‎français 100%

Sommaire

Description

The BQTools (Batch Queueing Tools) have been developped to simplify submission of a large number of jobs on a cluster. This tool does not submit jobs directly on the nodes, but rather calls Torque to do so.

This tool can generates a very large number of job scripts, manage automatically the working directories, and allows pre and postprocessing for each job or for the complete set of jobs. The following figure illustrates the work accomplished by BQTools. Refer to the following sections to learn how to use it.

Each group of tasks is called a batch. It is assigned its own id (batchId). A user can get information on a batch or delete a full batch using this id. Since BQTools is built on Torque, each task which is part of a batch gets its own id (jobId). A user may then request information on a specific task using this id. For more information, see our page on Torque.

Bqtools.png

Commands

bqsub

Usage : bqsub [-P key=value][qsub options]

This command submits a script or a command. bqsub is built on top of the bqsubmit command, but does not require a configuration file. It accepts all the options of the [Torque] qsub command. Use this command to submit a single task.
Options:
-P key=value: Set a key-value pair for BQTools options (keys may be command or applicationType).
qsub options: Specifies qsub options, such as -l walltime=hh:mm:ss,nodes=nodes:ppn=ppn -q queue.
see the Torque page for more details.

bqdel

Usage: bqdel jobId

Cancels the job jobId. Use bqstat to know a job's jobId.

bqstat

Usage: bqstat [qstat options]

Returns the state of jobs on the system.
Options:
-q target queue: Displays the state of a single queue.
-u username: Displays state of jobs for a specific user.
-f jobId: Displays more information on the state of job jobId.

bqsubmit

Usage: bqsubmit [-x] [-P key=value] [config file]

Used to submit a batch of jobs in a single command.
-x: Creates the directories for the batch submission, but does not submit it.
-P key=value: Changes or add a key-value pair to those already in the configuration file.
config file : Name of the configuration file. By default, the file bqsubmit.dat will be looked for.

bqdelete

Usage: bqdelete [batchId]

Cancels all jobs contained in a batch. To know the batchId of your batch, use the bqstatus command or the name of the .status file created in the directory (batchName_batchId.status).
If no batchId is given, the command displays the list of batches submitted.

bqstatus

Usage : bqstatus [batchId]

Displays the status of batches that were submitted with bqsubmit. If a specific batch is given, displays status of the tasks of this batch.
Note that the result may be different from that of qstat since bqstatus does not query Torque, but relies on files created instead.

bqmon

Usage : bqmon [-u user]

Displays system utilization. If a user is given, displays the list of nodes used by this user's jobs.

bqconcurrentjobs

Usage : bqconcurrentjobs [batchId valeur]

Displays the number of tasks that can be run concurrently for each batch. If a specific batch and a value are specified, modifies the number of tasks that may run concurrently and sets it to the given value.

Configuration file

The configuration file is what allows BQTools to build your job batches. The first one you will write will require a bit of work, but it will make it easier to submit a large number of jobs. If you need help doing so, you can contact us.

Many keywords are used to build your configuration file. We describe those in the following sections. You will also find configuration file examples below.

Required keywords

command
Name of the command to run on each compute node.

Optional keywords

batchName
Name of the batch. This name may not be used for more than one batch in any given working directory. See section on resubmission.
applicationType
Type of application used. The only option is mpi. When you use this option, you do not have to include mpirun -n np in your command.
submitOptions
Resources used by each task
copyFiles
Files that should be copied for each job (must not be large files).
linkFiles
List of files that must be accessible for each task.
templateFiles
List of files which should be modified for each task. Variables must be changed by the name of the parameters used in the configuration script. See section Syntax for parameters for more information.
preBatch
Command to be run on the head node, in the batch working directory. This command is run only once, before any task is executed.
preJob
Command to be run on the head node, in the task directory. This command is run once, independently, before each task.
postJob
Command to be run on the head node, in the task directory. This command is run once, independently, after each task.
postBatch
Command to be run on the head node, in the batch working directory. This command is run only once, after every task is completed.
paramSymLinks
1 or 0 indicating if you want symbolic links to be created pointing at each task directory. Those links are more explicit as they contain the parameters used and their values. Default value: 1.
concurrentJobs
Number of tasks (or group of tasks) that may be run at the same time. If a task requires that the previous task be completed before running, you can set this value to 1.

Keywords specific to Mp2

runJobsPerNode = number
Number of tasks that may be run concurrently on a given compute node. By default, this is equal to the number of cores. If each job is parallelized with OpenMP/en and runJobsPerNode is greater than 1, remember to set the value for OMP_NUM_THREADS in your command.
accJobsPerNode = nombre
To adjust the number of tasks submitted on each node. This number should be a multiple of runJobsPerNode. For example, if accJobsPerNode is twice the value of runJobsPerNode, there will be runJobsPerNode tasks ran concurrently, and when they are done, another group of runJobsPerNode will be started. Note that the value of walltime given in the submitOptions parameter must be large enough to enclose the runtime of both groups of tasks. This option is useful when you have a very large number of short tasks. By default, the number of accJobsPerNode is the number of cores on the node.

Keywords specific to Ms2

microJobs = number
Group jobs together. Each of those jobs will one after the other on the same cores.
Resources given with submitOptions must be sufficient for the whole group. This is especially important for walltime.

Syntax for parameters

There are many ways to specify the parameters and their values. For example, two parameters may be part of a single loop, one parameter can be constant, etc. In the following sections, we explain those syntaxes.

Constant parameter

Ex: temperature = 10

Parameter with a list of values

Ex: param1 = temperature = [10, 11, 14, 15]

Single loop with a variable

Usage : param1 = token = startValue : increment : stopValue Ex : param1 = temperature = 10 : 1 : 15

Simple loop with many variables

Ex: param1 = (temperature, pressure) = [(10:1:15, 100:1:105)]
It is also possible to use lists with many variables.
Ex: param1 = (temperature, pressure) = [(10,100), (11,101), (12,102), (13,103), (14,104), (15,105)]

Multiple loops

Ex: param1 = temperature = 10 : 1 : 15

param2 = pressure = [101, 102]

In this case, there are 6 different values of temperatore, and two values of pressure. There will be 12 cases run.

Loop from a file

Ex: param1 = (temperature, pression) = load param.txt
Note that the number of columns in the file param.txt must be the same as the number of variables in the parameter. Each column must be separated by a space or a comma. The file may also contain empty lines and comments (lines beginning with a #). Those lines will be ignored when reading the file. As an example, the param.txt file could contain

# temperature pression
    10.         101.
    10.         102.
    11.         101.
    11.         102.
    12.         101.
    12.         102.
    13.         101.
    13.         102.


Example for Ms2

Consider a parallel OpenMP job which runs on four threads for each job. Assuming that 100 tasks must be run and that each task requires less than one hour. We will combine jobs in groups of five tasks. The configuration file could be as follow.

File : bqsubmit.dat
# Name of the batch
batchName = multithreadCase
 
# Binary file to run
copyFiles = my_exec
 
# Input file with the values of variables
templateFiles = input.txt
 
# Task to be run before jobs are submitted to the scheduler
preBatch = rm -f totalOutput.txt
 
# Command to run on the compute node
command = export OMP_NUM_THREADS=4; ./my_exec input.txt > output.txt
 
# Gathering of results in a single file
postBatch = cat *.BQ/*.BQ/output.txt >> totalOutput.txt
 
# We do not want symbolic links
paramSymLinks = 0
 
# Combining tasks in groups of five
microJobs = 5
 
# Required resources for each group of five
submitOptions = -q qwork -l walltime=5:00:00,nodes=1:ppn=4
 
# List of parameters for each task
param1 = (temperature, pressure) = load values.txt
 
# Number of groups of five tasks that can run concurrently
concurrentJobs = 20


The working directory should contain four files before submitting the batch:

bqsubmit.dat
The configuration file as described above
my_exec
The binary file to be run.
values.txt
Text file containing two columns: the first one is temperature and the second is pressure, for each task. The file should therefore contain 100 lines without counting comments and empty lines.
input.txt
File read by the binary. The values must be replaced by the parameter names (temperature and pressure), prepended and appended with symbols "~~". BQTools will replace those values by appropriate values for each task. The file should look similar to:
~~temperature~~  ~~pressure~~


Example for Mp2

We now consider a hybrid case. Since Mp2 has many cores for each InfiniBand card, hybrid codes usually perform better.

We assume here that we want to submit 400 tasks, and that for each hybrid task, we will use four MPI process and three threads per process. Each task will therefore need 12 coeurs. We can therefore run two task per node (runJobsPerNode = 2). We assume that each task requires at least two hours to run, and that we want to run two tasks one after the other on the same cores. Finally, we consider that the input file will change for each task, and assume that the binary to be run (my_bin) is available through your $PATH environment variable. The working directory contains the following files:

bqsubmit.dat
Configuration file for BQTools. It is given below.
myCases
A directory containing the input files for each case to be run. Since we have 400 tasks to run, this directory should contain 400 different files.
runTask.sh
A script to start tasks. This file should look like this:
#!/bin/bash
module load openmpi_intel64/1.4.3_ofed
export OMP_NUM_THREADS=3
mpirun -n 4 my_bin ~~input~~

The configuration file (bqsubmit.dat) should look like this:

File : bqsubmit.dat
# Name of the batch
batchName = hybridCase
 
# Script to run a task
templateFiles = runTask.sh
 
# Link to access input files
linkFiles = myCases
 
# Command to run on a compute node
command = /bin/bash runTask.sh
 
# Number of task on a node at any given time
runJobsPerNode = 2
 
# Number of tasks that are submitted to a given node
accJobsPerNode = 4
 
# Required resources for each group of four tasks
submitOptions = -q qwork@mp2 -l walltime=4:00:00,nodes=1
 
# List of parameters for each task
param1 = input = load listInput.txt
 
# Number of groups of 4 tasks that can run concurrently.
concurrentJobs = 100


Before submitting those tasks, you need to create the file listInput.txt which contains the list of tasks to be run. This file can be created easily with the command:

[name@server $] ls -1 myCases/* >& listInput.txt


Resubmitting tasks

To protect your data, BQTools does not, by default, allow resubmission of a batch in the same directory. However, this might be desireable for multiple reasons. The appropriate way of resubmitting a batch depends on the reason for doing so. Three possible situation may arise:

  1. Some tasks did not complete successfully, but some did. In this case, you need to make sure there is no file .bqdone in the directories for the cases to rerun. Indeed, the .bqdone file is created by BQTools in the higher level directories, and indicates which tasks successfully completed. However, what BQTools considers as a success might not be what you consider a success. You might therefore need to delete the file .bqdone in the folders of the case you want to rerun.
  2. None of the results are good, and all tasks must be rerun. You then need to erase all directories crated by BQTools (rm -r *.BQ), and resubmit the batch once the required corrections have been made.
  3. All results are good, but you want to resubmit different cases in the same working directory. In this case, change the value of batchName and submit the new batch.
Outils personnels
Espaces de noms

Variantes
Actions
Navigation
Ressources de Calcul Québec
Outils
Partager