Backups and restarts
While hardware and software crashes are relatively uncommon on modern PCs, they are much more frequent when a computation necessitates a large number of processors, memory sticks, hard disks and a complicated network architecture. Under these circumstances, checkpoint-restart becomes a crucial component of an application that is carrying out a large-scale computation on a supercomputer. To summarize, if your problem needs a supercomputer, it probably also needs a checkpoint-restart strategy in order to avoid wasting resources.
In addition to protecting yourself from the consequences of hardware and software crashes, using a checkpoint-restart strategy allows you to execute a job during a virtually infinite period of time. Considering that for both technical and policy reasons all of the Calcul Québec supercomputers impose limits on the duration of a job, nothing however prevents you from resubmitting a job from its last checkpoint. A well-constructed job could thus resubmit itself at the end of its allocated time and so continue executing indefinitely.
To determine how to carry out your checkpoints, you will need to firstly estimate how much time is needed to write a checkpoint file. Ideally, you should ensure that the time needed for a checkpoint is negligible compared to the compute time. You still need to make sure that you checkpoint often enough to avoid losing your entire computation in case of a system crash.
Duration of a Checkpoint
On a high-performance system like those offered by Calcul Québec, writing a checkpoint file should never take more than a few minutes. All of Calcul Québec's supercomputers have a filesystem capable of reading and writing data with a bandwidth of several GB/s. Even with a job that uses a lot of memory, for instance a few tens of GB per compute node, you should still expect a checkpoint duration of around a few minutes. If this isn't the case for your application, it's likely not making very good use of the filesystem; an analyst can help you find the source of the problem and optimize the application's performance.
Assuming that writing a checkpoint file takes no more than a few minutes, you should aim for a checkpoint every few hours, even hourly if the checkpoint lasts no more than seconds.
It's important to checkpoint but you also need to be able to restart from the checkpoint. To do this, write your job submission script in such a way that it can detect the presence or absence of a checkpoint file and start the computation from scratch or from a checkpoint file depending on whether such a file exists.
How to Checkpoint?
There are three principal methods for obtaining a checkpoint-restart functionality.
The simplest approach is to first see if the application which you use supports checkpoint-restart, which is the case for a great many used on supercomputers. If it's the case for your application, you next need to verify that the checkpoints are written in an efficient manner or not (see the above section on the duration of a checkpoint).
Different libraries offer transparent checkpoint-restart functionality, most notably BLCR (Berkeley Lab Checkpoint-Restart), which is available on Colosse. These libraries allow, subject to certain limitations, the entire memory of an application to be written to disk and to restart the application from a checkpoint file at a later time.
If you wrote the application yourself, the highest performing solution (but also the one which requires the greatest investment of time) is certainly to write yourself the checkpoint and restart methods. The complexity of this task will depend on the complexity of how your data are organized in the system memory but you will have complete control over which data are written to disk and which are not.