Using available storage

De Wiki de Calcul Québec
(Redirigé depuis Using available storage)
Aller à : Navigation, rechercher
Cette page est une traduction de la page Utiliser l'espace de stockage et la traduction est complétée à 100 % et à jour.

Autres langues :anglais 100% • ‎français 100%

Sommaire

Description

On a Linux cluster, various types of storage are commonly available. The list of choices varies from one system to the next. Hence it is important to view what the system you are using offers you. Please find server-specific documentation at the bottom of the page. The appropriate choice for your needs depends on multiple parameters:

  • What is the size of the files in question.
  • How many files are needed?
  • Are your file temporary or do they need to be preserved?
  • What is the files' format?
  • Is the file accessed sequentially?

Once you have answered these questions, you can choose what kind of storage is most appropriate for your needs.

Best practices

  • Only use text format for files that are smaller than a few MB.
  • As far as possible, use local storage for temporary files.
  • If your program must search within a file, it is fastest to do it by reading it in completely before searching, or to use a RAM disk ($RAMDISK or /dev/shm).
  • Regularly clean up parallel file systems, because those systems are used for huge data collections.
  • If you no longer use certain files, compress them (you should group them before) and back them up (if possible).
  • If your needs are not well served by the available storage options please contact Calcul Québec's user support team

Storage types

Storage options are distinguished by the available hardware, access mode and write system. Typically, the most systems offer the following storage types :

Network file system (NFS)
This type of storage is generally equally visible on both login and compute nodes. This is the appropriate place to put files that are regularly used: source code, programs and configuration files. This type of storage offers performance comparable to a conventional hard disk.
Parallel file system (Lustre or GPFS)
This type of storage is generally equally visible on both login and compute nodes. Combining multiple disk arrays and fast servers, it offers excellent performance for large files and large input/output operations. Often two types of storage are distinguished on such systems: long term storage and temporary storage (scratch). Performance is subject to variations caused by other users.
Local file system
This type of storage consists of a local hard drive at every compute node. Its advantage is that its write performance is stable, because only one user can use it. However, you should not forget to repatriate your files before closing your session because everything will be cleaned after each job.
RAM (memory) file system
This is a file system that exists within a compute node's RAM. So its use reduces available memory for computations. Such file systems are very fast for small files and particularly faster than other systems when file access is random. A RAM disk is always cleaned at the end of a session.

The following table summarizes the properties of these storage types.

Description of storage type
Storage type Typical directory name Accessibility Throughput (large operations, > 1 MB per operation) Latency (small operations) Life span
Network File System (NFS) $HOME All nodes 100 MB/s shared High Long term
Long term parallel file system $HOME, $RAP, /home, /sb/project/, /gs/project All nodes 1-10 GB/s shared High Long term
Short term parallel file system $SCRATCH All nodes 1-10 GB/s shared High Short term (periodically cleaned)
Local file system $LSCRATCH Local on the node 100 MB/s Medium Very short term
Memory (RAM) file system $RAMDISK, /dev/shm Local on the node 1-10 GB/s Very short Very short term, cleaned after every job

Storage units

Storage units are distinguished by the hardware, access and available write system. Typically, most systems offer the following storage types (note that the exact names may change from one server to the the other):

$HOME
This is the directory where you arrive when you start a session. You can see it on all login nodes and all worker nodes. This is the appropriate place to put files that you regularly use: source code, programs, and configuration files. Often this data is backed up and can be retrieved in case of loss. For Cottos, Mp2, Ms2, and Psi this is a network file system, where as it is a parallel file system for Briarée, Colosse, Guillimin and Hadès.
$SCRATCH
This directory is placed on a parallel filesystem, Lustre (for Colosse, Mp2, Ms2 and Cottos) or GPFS (for Briarée and Guillimin). It is generally visible from all nodes. Using it is very fast for large files, but not very efficient for many small files. This is the appropriate place to store large files that you use for a few days or weeks only. Periodically, it may be automatically cleaned (files being deleted).
$LSCRATCH
If available, this storage unit is local, since it is located on every worker node's hard drive. Its advantage is that its write performance is stable, because only one user can use is. However, you should not forget to repatriate your files before closing the session because this space will be cleaned after every job.
$RAMDISK
This unit is located on a virtual drive within the node's RAM. So its use reduces available memory for computations. Such file systems are very fast for small files and particularly faster than other systems when file access is random. A RAM disk is always cleaned at the end of a session.

Certain systems contain a longer list of storage types. To get to know the specifics for the system you use, look at the tabs at the bottom of this page.

Storage formats

As for all units, various choices exist for your files' format. The number of choices depends on the application's type (serial or parallel), of the language it is written in (C, C++, Fortran, Python, etc.), of the size of the data you need to write, etc. Below we describe the most commonly used formats. The first two are the base formats on top of which the others are constructed.

Text
Also called ASCII (but other character encodings are possible), this format is usually human readable. It may be edited and modified in any text editor. However, reading and writing using this format is slow, and files use more space. Although portable, this format may require some changes, especially regarding carriage return and line feeds. This format is practical for configuration and parameter files. Structured files like XML files use this format. This file format can be read and written by any programming language.
Binary
The main inconvenience with this file format is that it is not human readable even though some editors can edit those files. It however provides much faster reading and writing speed and requires less space than the text format. Portability is somewhat limited if you change platform, due to endianness issues. This format can be read and written by any language, but portability is limited between Fortran and other languages.
MPI-IO
Only for parallel codes using MPI. This format is a subcategory of binary format. The difference with respect to binary is rather in the process of reading and writing than in the file itself. Using MPI-IO may help making a code independent of the number of MPI ranks (compared to writing in binary with only one rank). Reading/writing speed is similar to binary, but can be faster on a parallel filesystem such as Lustre or GPFS depending on the network. Portability issues related to Fortran are partially avoided.
HDF5
This library makes structuring complex data easier. Since it is a self-describing format, it helps maintaining a code with changing data schemes. Endianness issues are managed by the library, making the files more portable. Reading/writing speeds are similar to what can be achieved with MPI-IO. HDF5 also supports compression which may reduce the file sizes, and contains optimizations for Lustre and GPFS.
netCDF
This library is also a library for structuring complex data. Since version 4, it is built on top of HDF5. The main advantage is its interface which is simpler than that of HDF5. However, it only offers a subset of the capabilities of HDF5.
Storage format description
Format Size Speed Portability
Text Larger than required Slow Very good
Binary Good Fast Endianness
MPI-IO Good Fast Library required
HDF5 Compressed Fast Library required
netCDF Compressed Fast Library required

Server-specific documentation

$HOME
Individual space, different for each user.
Read- and write-accessible for all nodes, by the user only (by default).
Shared file system GPFS of 7.3 TB.
Data persists.
Regular backups.
$SCRATCH
Individual space, different for each user.
Read- and write-accessible for all nodes by the user.
Read-only access by group members.
Shared file system GPFS of 219 TB.
4 to 16 times faster than $HOME
Data persists.
No backups.
$LSCRATCH
Local storage space for each node a job uses.
Temporary directory created for the job at the job's start, erased at the end.
Local ext4 file system of 182 GB.
No backups, please copy your results elsewhere before the job ends.
$PARALLEL_LSCRATCH
Distributed local storage shared between the nodes associated to a given job (uses local disks).
Parallel filesystem between those nodes (FhGFS)
The space available is the sum of all $LSCRATCH of those nodes.
Available upon request. Add ENABLE_PARALLEL_LSCRATCH=1 in your submit script.
Temporary files for the duration of the job. You must copy files to your $HOME or $SCRATCH before the job is over.
$RAMDISK
Local storage space for each node, in memory.
Very fast.
Size smaller than one half of the node's memory.
Temporary directory created for the job at the job's start, erased at the end.
No backups, please copy your results elsewhere before the job ends.
$PARALLEL_RAMDISK
Distributed local storage shared between the nodes associated to a given job (uses local disks).
Parallel filesystem between those nodes (FhGFS)
The space available is the sum of all $RAMDISK of those nodes.
Available upon request. Add ENABLE_PARALLEL_RAMDISK=1 in your submit script.
Temporary files for the duration of the job. You must copy files to your $HOME or $SCRATCH before the job is over.
Colosse does not feature any local disk space on each node, except for the RAM disk. All user-available storage works on the file system Lustre. Similarly, no backups are made. Researchers are completely responsible for the integrity of their data.
$HOME
Accessible from all nodes.
Shared throughput of 10 GB/s.
The $HOME directory is usually accessible for reading by all group members, by only the owner can write.
$RAP
Accessible from all nodes.
Shared throughput of 10 GB/s.
Read- and write-accessible by all research group members.
$SCRATCH
Accessible from all nodes.
Shared throughput of 10 GB/s.
Read- and write-accessible by all research group members.
Periodically cleaned.
$RAMDISK
Local file system.
Throughput of 10 GB/s, size smaller than 12 GB.
Temporary, cleaned after every job.

Usage policy

Colosse's file systems usage policy is available by clicking this link.
$HOME
Individual space, different for each user.
Read- and write-accessible for all nodes, by the user only (by default).
Shared file system GPFS of 745 GB.
Data persists.
Regular backups.
$SCRATCH
Individual space, different for each user.
Read- and write-accessible for all nodes by the user.
Read-only access by group members.
Shared file system GPFS of 151 TB.
Fast access for large files.
Data persists.
No backups.
$LSCRATCH
Local storage space for each node a job uses.
Local ext3 file system of 129 GB.
Create a sub-directory where you write and delete yourself before the end of the job.
No backups, please copy your results elsewhere before the job ends.
/dev/shm
Local storage space for each node, in memory.
Very fast.
Size smaller than one half of the node's memory.
Create a sub-directory where you write and delete yourself before the end of the job.
No backups, please copy your results elsewhere before the job ends.
$HOME
Individual space, different for each user.
The $HOME directory is usually accessible for reading by all group members, but only the owner can write.
Shared file system (GPFS) of 3.7 PB, with a 10 GB quota per user.
Data persists.
Regular backups.
/gs/project/rapID or /sb/project/rapID
Group space, different for each group.
Read- and write-accessible from all nodes.
Read- and write-accessible to all group members.
Shared file system (GPFS) of 3.7 PB, with a 1 TB quota per group.
Data persists.
No backups.
$SCRATCH (/gs/scratch/username)
Individual space, different for each user.
Read- and write-accessible from all nodes.
Usually accessible for reading by all group members, but only the owner can write.
Shared file system (GPFS) of 3.7 PB.
Data persists.
No backups.
Periodically cleaned: files not modified in the last 45 days will be deleted on the 15th of every month.
$LSCRATCH (/localscratch/$PBS_JOBID)
Local storage space for each node a job uses.
Local ext4 file system of 343 GB.
Temporary directory created for the job at the job's start, erased at the end.
No backups, please copy your results elsewhere before the job ends.
$RAMDISK (/dev/shm/$PBS_JOBID)
Local storage space for each node, in memory.
Very fast.
Size smaller than one half of the node's memory.
Temporary directory created for the job at the job's start, erased at the end.
No backups, please copy your results elsewhere before the job ends.
$HOME
Individual space, different for each user.
Read- and write-accessible for all nodes, by the user only (by default).
Shared file system GPFS of 7.3 TB.
Data persists.
Regular backups.
$SCRATCH
Individual space, different for each user.
Read- and write-accessible for all nodes by the user.
Read-only access by group members.
Shared file system GPFS of 219 TB.
4 to 16 times faster than $HOME
Data persists.
No backups.
$LSCRATCH
Local storage space for each node a job uses.
Temporary directory created for the job at the job's start, erased at the end.
Local ext4 file system of 412 GB.
No backups, please copy your results elsewhere before the job ends.
$RAMDISK
Local storage space for each node, in memory.
Very fast.
Size smaller than one half of the node's memory.
Temporary directory created for the job at the job's start, erased at the end.
No backups, please copy your results elsewhere before the job ends.
Helios shares the same filesystems as Colosse. No backup is performed. Researchers are responsible for the integrity of their data.
$HOME
Accessible from all nodes.
Shared throughput of 10 GB/s.
The $HOME directory is usually accessible for reading by all group members, by only the owner can write.
$RAP
Accessible from all nodes.
Shared throughput of 10 GB/s.
Read- and write-accessible by all research group members.
$SCRATCH
Accessible from all nodes.
Shared throughput of 10 GB/s.
Read- and write-accessible by all research group members.
Periodically cleaned.
$RAMDISK
Local file system.
Throughput of 10 GB/s, size smaller than 12 GB.
Temporary, cleaned after every job.
$LSCRATCH (or $LSCRATCH_JOB)
Local storage space for each node, on disk.
Speed of about 1 GB/s, 2TB available.
Temporary. Deleted after every job.
$LSCRATCH_USER
Local storage space for each node, on disk.
Speed of about 1 GB/s, 2TB available.
Temporary. Deleted once there are no remaining jobs from this user running on the node.


Usage policy

Colosse's file systems usage policy is available by clicking this link.

Mammouth parallèle II

Some names have changed in 2014. We started from the following ideas

A name must well describe the system: Is it a parallel system ? Is it for temporary storage ? When is it purged ?
It does not matter if the name is long: users can create aliases and autocompletion can be used.
$HOME
Accessible from all nodes (including Ms2).
Shared throughput of 105 MB/s, size of 100 GB/group.
Backed up.
Please use the $HOME_GROUP directory to share data with colleagues.
$PARALLEL_SCRATCH_MP2_WIPE_ON_...
There are two versions with different puring dates.
Accessible from all nodes (but not from Ms2).
Shared throughput of 10 GB/s, size of 1 TB/group.
No backups.
Lustre distributed file system.
Please use the $PARALLEL_SCRATCH_GROUP_MP2_WIPE_ON_... directory to share data with colleagues.
$ARCHIVE
Accessible from all login nodes (including those of Ms2).
Shared throughput of 80 MB/s, size of 1 TB/group.
Backed up.
For long-term storage.
Please use the $ARCHIVE_GROUP directory to share data with colleagues.
$LSCRATCH
Local storage space on a node.
Throughput of 120 MB/s, size of 820 GB.
Temporary.
$PARALLEL_LOCAL_SCRATCH
Parallel storage shared between the nodes of a single task (uses local discs)
Parallel filesystem (FhGFS) with compression (lz4)
1.8TB times the number of nodes.
Striping of 512 KB.
Available on demand (define ENABLE_PARALLEL_LOCAL_SCRATCH=1 in your submit script).
Temporary files for the duration of the computation. You must copy files to $HOME or $PARALLEL_SCRATCH_MP2_... before the job ends.
$RAMDISK
Local storage space on a node.
Throughput of 1.8 GB/s, size lower than 32 GB.
Temporary.

Mammouth série II

Some names have changed in 2014. We started from the following ideas

A name must well describe the system: Is it a parallel system ? Is it for temporary storage ? When is it purged ?
It does not matter if the name is long: users can create aliases and autocompletion can be used.
$HOME
Accessible from all nodes (including Mp2).
Shared throughput of 80 MB/s, size of 100 GB/group.
Backed up.
Please use the $HOME_GROUP directory to share data with colleagues.
$PARALLEL_SCRATCH_MS2_WIPE_ON_...
Accessible from all nodes (except Mp2).
Shared throughput of 5.5 GB/s, size of 1 TB/group.
No backups.
Lustre distributed file system.
Please use the $PARALLEL_SCRATCH_GROUP_MS2_WIPE_ON_... directory to share data with colleagues.
$ARCHIVE
Accessible from all login nodes (including those of Mp2).
Shared throughput of 80 MB/s, size of 1 TB/group.
Backed up.
For long-term storage.
Please use the $ARCHIVE_GROUP directory to share data with colleagues.
$LSCRATCH
Local storage space on a node.
Throughput of 120 MB/s, size of 500 GB.
Temporary.
$RAMDISK
Local storage space on a node.
Throughput of 1.8 GB/s, size lower than 32 GB.
Temporary.


Outils personnels
Espaces de noms

Variantes
Actions
Navigation
Ressources de Calcul Québec
Outils
Partager