Using the Intel Xeon Phi
Intel Xeon Phi 5110P co-processors are available on the Guillimin cluster. The following table summarizes the Xeon Phi nodes available on Guillimin.
|Node Type||Count||Processors/node||Total Cores||Memory (GB/core)||Total Memory (GB)||Cards||Total peak SP FP (TFlops)||Total peak DP FP (TFlops)|
|AW Phi||50||16 Sandybridge||800||4||3,200||Dual Intel Phi 5110P, 60 cores, 1.053 GHz, 30 MB cache, 8 GB memory, Peak SP FP: 2.0 TFlops, Peak DP FP: 1.0 TFlops||200||100|
The Intel Xeon Phi has a less specialized architecture than a GPU, and is designed to be familiar to anyone who has experience with parallel programming in an x86 environment. The Phi contains Intel Pentium generation processors and runs a version of the Linux operating system. Thus, it can execute parallel code written for ‘normal’ computers using a wide variety of modern and legacy programming models including Pthreads, OpenMP, MPI and even GPU software (e.g. CUDA or OpenCL). So, you may be able to straightforwardly port your applications to use the Phi without much modification. However, optimizing your application specifically for Phi use is still recommended to achieve the best performance. Because of their compatibility with standard x86 hardware, Phi programmers can enjoy using their favourite compilers, profilers, and debuggers. The Xeon Phi supports an offload programming model similar to how GPUs are used, but programs can also be run natively directly on the card. Some of the most exciting new capabilities of the Kepler generation of Nvidia GPUs (Hyper-Q and dynamic parallelism) are quite natural on the Xeon Phi.
Submitting Xeon Phi jobs
Submitting a Xeon Phi job is similar to submitting a regular job. The main difference is that the submission script and/or the qsub command should specify the "mics=2" resource in the resource list and the job must be submitted to the phi queue (see the example below).
Example Intel Xeon Phi job:
$ qsub -q phi -l nodes=1:ppn=16:mics=2 ./submit_script.sh
Each Xeon Phi node has 2 co-processors and 16 cores. You may request mics=1 or mics=2, but your access will be limited to 1 or 2 accelerators (respectively) for each node reserved for your job. You must request at least one core per node to gain access to the Xeon Phis.
Xeon Phi software modules
If you are using our Xeon Phi nodes, please note the availability of the following modules:
- PGI accelerator compilers (including OpenACC and CUDA-fortran) - module add pgiaccel
- Intel compilers version 14.0 (including support for Xeon Phi) - module add ifort_icc/14.0
- Xeon Phi development environment variables (equivalent to source compilervars.sh) - module add MIC
- Intel SDK for OpenCL applications 1.2 (OpenCL compilers and ICD) - module add intel_opencl MIC
- Intel MPI 4.4.1 (including support for Xeon Phi) - module add ifort_icc/14.0 intel_mpi MIC
Please note that loading the MIC module is necessary to use many functions of the Xeon Phi cards. Please make sure this module is loaded if you encounter any errors trying to run Xeon Phi software.
Offload Mode Jobs
In offload mode, the accelerator is sent work (computational hotspots) by a process running on the host CPU(s). To use this mode, special instructions (such as directives or pragmas) must be used in the source code to indicate to the compiler how the accelerator is to be used. Please see our training materials for examples.
$ module add ifort_icc MIC $ icc -o offload -openmp offload.c $ ./offload
Native Mode Jobs
A native mode program is compiled for execution directly on the Xeon Phi and does not normally use any host resources. Often, parallel code will not need to be modified from its host-only version to compile and run it in native mode. To compile for native mode, use the -mmic compiler flag
$ module add ifort_icc MIC $ icc -mmic -o mm_omp.MIC -openmp mm_omp.c $ micnativeloadex ./mm_omp.MIC
It is a good policy to add .MIC extensions to MIC binaries so they are not confused with CPU binaries. micnativeloadex is a program supplied by Intel that attempts to copy over any required libraries to the MIC device and then runs the MIC binary. In some cases, users may wish to manually set their paths instead of using micnativeloadex.
$ scp mm_omp.MIC mic0:~/. $ ssh mic0 "export LD_LIBRARY_PATH=/software/compilers/Intel/2013-sp1-14.0/lib/mic:$LD_LIBRARY_PATH; ./mm_omp.MIC"
To use MPI in native mode, please use the intel_mpi module and set the I_MPI_MIC environment variable
$ module add intel_mpi ifort_icc MIC $ export I_MPI_MIC=enable $ export I_MPI_MIC_POSTFIX=.MIC $ mpiicc -mmic -o hello.MIC hello.c $ mpirun -n 60 -host mic0 ./hello MIC: Hello from aw-4r12-n37-mic0 4 of 60 MIC: Hello from aw-4r12-n37-mic0 21 of 60 MIC: Hello from aw-4r12-n37-mic0 30 of 60 ...
It is possible to have MPI processes run on both host cores and accelerator cores. Note that the different speeds of these cores can create load-balancing problems that should be addressed when designing MPI code for symmetric mode use on Xeon Phis. To use symmetric mode, the MPI code should be compiled separately for the host and the device. All of the advice for running MPI in native mode should be followed, and additionally the I_MPI_FABRICS environment variable should be set to shm:tcp
$ module add intel_mpi ifort_icc MIC $ export I_MPI_MIC=enable $ export I_MPI_MIC_POSTFIX=.MIC $ export I_MPI_FABRICS=shm:tcp $ mpiicc -o hello hello.c $ mpiicc -mmic -o hello.MIC hello.c $ mpirun -perhost 1 -n 4 -host $(cat $PBS_NODEFILE $PBS_MICFILE | tr '\n' ',') ./hello CPU: Hello from aw-4r12-n40 1 of 4 CPU: Hello from aw-4r12-n37 0 of 4 MIC: Hello from aw-4r12-n37-mic0 2 of 4 MIC: Hello from aw-4r12-n40-mic0 3 of 4
Xeon Phi Training/Education
Please see the recent Xeon Phi training event materials for more information about how to use Intel Xeon Phi co-processors effectively for your research. General information about parallel programming with MPI or OpenMP can also be found on this Wiki.