Throughout much of the modern computing era, the acceleration of computations was accomplished by increasing the processor's clock frequency. The number of operations needed to carry out a given problem being constant (for a deterministic computation), increasing the clock frequency allowed more operations to be performed per second and thus reduce the computation time. Unfortunately, due to basic physical laws, it eventually became impossible to increase the clock frequency beyond 3 or 4 GHz around the year 2004. With the clock frequency now fixed at 2 to 4 GHz, programmers can no longer rely on advances in processor design to automatically make programs run faster; the only way to achieve dramatic performance improvements is through parallelizing the code so that it can run on several 2-4 GHz CPUs at the same time.
Kinds of Parallelization
There exist several levels of parallelism, including some which are automatic like bit-level parallelism and instruction-level parallelism. These two kinds of parallelism are handled by the processor, using special instructions, and by compilers which make use of them.
Next is data parallelism. In this case, the same task is executed on a different set of data or using different parameters. The tasks are thus completely independent of one another. A good example of this kind of parallelism is the exploration of a system's phase space or parameter sweeping. Note that despite the independent of the tasks, they can share certain resources. On a supercomputer two independent tasks will often share the same memory bus and almost always the same network and filesystem. It's important to be aware of these shared resources in order to optimize the tasks. You'll find in general that it's important to limit as much as possible your filesystem and network operations in order to make your tasks independent of those of other users.
Finally there is also task parallelism. In this case, a task working on a given set of data is divided into several independent sub-tasks that are executed by distinct CPUs. The computation of the principal task generally requires synchronization and the exchange of data between the sub-tasks. These communications will generally limit the performance gain that can be achieved. Implementing this kind of parallelism in an application will demand modifications - sometimes substantial - to the source code.
Amdahl's Law expresses, in mathematical terms, the fact that the maximum possible performance gain for a given optimization of an algorithm is limited by the non-optimized part of the algorithm. Applied to parallelization as a form of optimization, it thus asserts that the attainable performance gain from parallelization is limited by the part of the algorithm that remains serial. In the best case scenario, the parallelization of the entire algorithm offers a linear gain - by doubling the available resources, the execution time is halved. In practice, only a part of the algorithm is ever parallelized so that if for instance we parallelize a part of the code which requires 10% of the runtime, we cannot hope to obtain a performance gain greater than 10%, even if we used an infinite amount of resources. What's more, it is almost always the case that parallelization requires a certain amount of supplementary communication and synchronization, so that this potential gain is further diminished. What's generally observed is that the performance gain for the parallel part of the code is roughly linear for few processors and then vanishes as more processors are added.
Data parallelism is probably the simplest method of parallelizing a task. In general no modification of the algorithm is necessary, all that's needed is to start several independent processes on one or more compute nodes, each process carrying out the computation on a given set of parameters.
Despite the fact that it isn't necessary to modify the algorithm in this case, il may be necessary or useful to modify your job submission script. This is for example necessary if the nodes are allocated in such a way that each job gets an entire node, as is the case on Colosse and Mp2. In such circumstances, if you don't adapt your script you will be wasting a significant fraction of the resources that have been allocated to you.
As demonstrated in the section Multiple serial jobs, you can start several tasks in parallel using a single job submission script if you add the character & at the end of the command and also add the line wait at the end of your script. The ampersand indicates to the shell that the command should be executed in the background. The command wait tells the shell that the job will not be complete until all of the background processes have first finished.
Note: If you forget the wait command, the script will terminate and the any background processes will be killed.
Arrays of serial jobs, supported by Torque and Moab, allow the user to manage the execution of a large number of jobs by varying parameters. You can combine this technique with that of background processing in order to execute hundreds or even thousands of jobs with a single submission. For an example of a job array submission script, see the section Sequential task array.
GNU Parallel is a tool for users on certain Calcul Québec supercomputers. Compared to running jobs in the background or job arrays, GNU Parallel facilitates the management of a large number of short-lived tasks (compared to the maximal compute time of a job) or those with a very irregular duration. It's a tool that's particularly useful for optimizing your resource usage because it automatically starts a new task when an earlier one completes.
BqTools is another tool which simplifies the submission of sets of serial jobs. It's available on Ms2 and Mp2 and is especially useful for varying the parameters of a simulation (phase space exploration).
Points to Consider
Though parallelization based on data parallelism gives you in theory an optimal performance improvement, it's not always the case because of the use of shared resources. It's important to reduce as much as possible the use of these shared resources; in order of importance, for a serial process you should reduce the use of the filesystem and memory accesses.
Task parallelism refers to the case in which it is the tasks to be accomplished which need to be parallelized, rather than the data or input parameters. In this case it's almost always necessary to modify the source code of the application. An exception: in certain very simple situations, the compiler is sometimes capable of parallelizing itself a part of the code. Traditionally, there exist two techniques for parallelizing a task: by using several threads or several processes.
In Linux, threads have access to a shared pool of memory. They can therefore operate on the same data. This offers the advantage that it isn't necessary to copy data from one thread to another and thus eliminates much of the need for communication between the threads. The catch however is that the programmer must ensure the coherence of the memory among the threads. For example, if thread 0 needs to read and modify an element x in the memory, it's vital to ensure that no other thread will modify this same element between the reading and writing of it. Programmers typically employ locks to "serialize" the write access to variables among the threads. Additionally, the threads must all be running on the same node, which obviously limits the number of CPU cores (and thus threads) available for computation, so that on most Calcul Québec machines a thread-based approach cannot use more than 8 to 12 threads.
To parallelize a program using threads, you can use (in Linux) pthreads or POSIX threads. This technique is generally recommended for the parallelization of programs in general use. For the parallelization of scientific and HPC programs, which make intensive use of loops, it's recommended to instead consider OpenMP. This technique involves a set of compiler directives - limited but simple and powerful - which is interpreted by the C, C++ or Fortran preprocessor and simplifies enormously thread management. Parallelizing a program using OpenMP can be as simple as adding two or three lines of well-placed instructions. The program so modified also has the advantage that it can be compiled to execute with several or just one thread.
Some introductory tutorials on pthreads are available on the Web, here's a short list:
By Process: MPI
Unlike with threads, processes in Linux don't share their memory space. That has the advantage of making it easier to ensure the integrity of the data in memory; leaving aside advanced techniques such as RDMA, a process cannot, even by accident, alter the memory space of another process. The disadvantage of this security arises from the fact that it's now necessary to use explicit communications to copy the data between two processes. For this, we generally use the message passing interface MPI library.
Unlike using threads, parallelization by process using MPI allows you to use a much greater amount of resources. In this case processes can be on the same or different compute nodes. MPI allows communication between different processes, regardless of the inter-connection (Ethernet, InfiniBand or shared memory for instance).
Massively Parallel Architectures
Following the plateau in processor frequencies, the tendency has been to manufacture processors with several cores. In short order, researchers realized that they could exploit graphics processors, which are massively parallel processors, in the same way that they use multi-core processors. This tendency is continuing to accelerate with the introduction of Intel's many integrated cores platform in 2012.
Graphics processors or GPUs are low frequency processors with hundreds or even thousands of cores. For example, NVidia's K20X processor has 2688 cores clocked at 732 MHz. These compute units each have their own memory, separate from the system. This separation creates a further challenge: data have to be transferred from the system memory to the GPU memory, the computation done and then the data have to be transferred back to the system memory. In order to obtain a worthwhile performance gain, an algorithm should be designed to minimize these transfers.
Graphics processors have historically been optimized for single precision (32 bit) calculations, which was sufficient for their primary application, video games. With the development of high-performance computing using GPUs, double precision calculations have become possible in the past few years but are generally slower than single precision calculations. For example, the GTX 580 cards in Hadès carry out double precision calculations at an eighth of the speed of single precision calculations. However, with the Tesla family, the GPUs belonging to the Kepler 20 generation are much better at carrying out double precision calculations, able to achieve a third of the speed of single precision calculations.
Graphics processors excel at carrying out the same operation on thousands of different elements. They are however ill-adapted to codes that involve many conditional branches (if).
Many-Integrated Cores (MIC) Processors
Processors with many integrated cores are Intel's response to graphics processors. In fact these are co-processors, designed to offload certain calculations from the main processor. Unlike graphics processors, they can directly access the system memory. However, they have fewer processing cores than GPUs. The recent Xeon Phi processors for example have around 60 processing cores.
Usage of Massively Parallel Architectures
Several different technologies are competing to enhance the usage of massively parallel architectures. To begin with, concerning graphics processors, NVidia proposes CUDA, which has become the de facto standard as a proprietary API. OpenCL is an open alternative to CUDA and also supports graphics processors from ATI and Intel. Finally, OpenACC has also emerged as a new standard in order to simplify the use of these architectures. Following the same approach as OpenMP, OpenACC permits the utilization of graphics processors by simply adding a few directives to the compiler preprocessor.
Hybrid parallelism consists of using two or more forms of parallelism at the same time. For example, one part of the code could be parallelized by means of MPI while another part uses OpenMP. A typical situation would be a master-slave algorithm, where one process distributes work to several others. This distribution could be carried out across several compute nodes, each of which corresponds to a slave process. These slaves could then do their work in parallel using several threads. It's important in such an example to make sure to inform the process manager mpiexec that it should only start a single process per node. Other points to consider include specifying which of the threads should handle the communications but this goes beyond the level of this introduction. Another example of hybrid parallelism would be to use several different graphics cards.