FAQ Moab Colosse

De Wiki de Calcul Québec
Aller à : Navigation, rechercher
Cette page est une traduction de la page FAQ Moab Colosse et la traduction est complétée à 100 % et à jour.

Autres langues :anglais 100% • ‎français 100%

In this page, you'll find a list of error messages or problems that you can come across when working with Moab. The list is specific to Colosse


My Jobs Become "Deferred"

It's possible that Moab or its resource manager Torque are encountering an error when starting the job. This can happen for example if there's a temporary communication problem with the node where the job is being started. When this occurs, Moab labels the job as "deferred". A job that has been so labelled will remain in this state for twenty minutes, after which Moab will try to start it again. This doesn't affect the priority of your job and it will be started as soon as possible. You can ignore this phenomenon unless the job is repeatedly "deferred". You can check the number of times that a job has been deferred and the amount of time remaining until it is put back in the queue with the command

[name@server $] checkjob -v <jobid>

There are nodes available but my job isn't starting

It's possible for the commands mshow and showq to show that there are free processors without your job being started. There are several potential explanations for this: firstly, certain nodes may be reserved for maintenance or for other users. It's also possible that a multi-node job has just finished and the nodes are being prepared for the next job.

My job doesn't appear in showq

The command showq is a cached command in Moab. It's possible that the command's output isn't synchronized with the jobs that are really in the queue. To ensure that Moab is synchronized, add the option --blocking,

[name@server $] showq --blocking <autres options>

You can also use mshow which is kept permanently in sync with Moab but which doesn't support the same options as showq.

My job is blocked

There are several possible explanations for this; it may simply be a temporary block (the job is "deferred" for example) or a permanent one. To get more information about why a job is blocked, use the command

[name@server $] checkjob -v -v <jobid>

Module Error

I get the following error message in my output file,

   /bin/bash: module: line 1: syntax error: unexpected end of file

/bin/bash: error importing function definition for module

This is a known error and doesn't affect your computations - you can ignore it.

ERROR: connection refused - no service listening at moab.colosse.clumeq.ca:42559

This error happens when we restart the Moab server. It's a temporary problem that usually lasts less than a minute. Try again in a few seconds.

I can't delete my job

When you try to delete your job with the command

[name@server $] mjobctl -c <jobid>

your get the message

   Message[0] job cannot be cancelled, reason== - job XXXXXXX - unknown error from resource manager torque

This arises from an error, usually temporary, with the resource manager Torque. Trying again in a few minutes normally solves the problem. If that doesn't work, you can try the command

[name@server $] mjobctl -F <jobid>

How can I know the priority of my jobs?

You can see the priority of your jobs with the command

[name@server $] mdiag -p

How can I know the job id for my jobs?

You can see a list of your jobs with their job ID and the name you've given to them with this command,

[name@server $] checkjob ALL

Outils personnels
Espaces de noms

Ressources de Calcul Québec