Writing Your Own Code: Best Practices, Pitfalls and Tips
When you write your own code to carry out a numerical simulation, how you go about writing the source code can have a major influence on the program's performance and the time needed to obtain the desired results. You could write a parallel program where each process takes advantage of the server's hardware and produces results in a few days or alternatively you could write a program that is difficult to parallelize and whose inefficient algorithms will require years or even decades to produce useful results.
The following is some advice to avoid ending up in the latter situation.
Choose the Right Algorithm
The algorithm you use should be adapted to the size of the data. If the data size can increase without limit, it is vital to employ an algorithm whose computation time increases slowly with the size of the problem. For example, if you need to sort an array of size n where n is large, it's preferable to use algorithms like merge sort, quick sort or heap sort, which all have an algorithmic complexity O(n log(n)) instead of a bubble sort, selection sort and insertion sort, which have an algorithmic complexity of O(n2).
Other criteria than just the algorithmic complexity can be important: the precision of the results or the ease of parallelization for example. For sorting algorithms, we could also add the stability, the worst case algorithmic complexity and whether or not the sort is done in place.
Take Advantage of Existing Libraries
Why reinvent the wheel? There are a great number of highly optimized software libraries available, for linear algebra (MKL, SuperLU), Fourier transforms (FFTW) and others. For a complete overview, including packages not available on Calcul Québec servers, you can consult the GAMS or Guide to Available Mathematical Software.
Use a Compiled Language
Compiled languages (Fortran, C/C++) offer in general a better performance than interpreted languages (Java, Python, Perl, R). Most interpreted languages allow you to call functions written in a compiled language like C, so that critical portions of a program can be rewritten in a compiled language as needed.
Understand the Algorithm's Limitations
The performance of an algorithm can be limited by the processor, memory access or the network. The way to optimize the algorithm depends on understanding these limitations.
Profile a Representative Simulation
It's pointless to try to optimize a part of the code which takes a negligible portion of the runtime. To work on the program's hot spots, you need to first identify them using a profiler (for example gprof, Tau or Open SpeedShop). You can then optimize their performance by modifying how the computation is done (changing the algorithm or some other method) or by finding a way to call these code sections less frequently.
Respect the Memory Hierarchy
In increasing order of size and decreasing order of performance, you can write variables in the CPU registers, in levels 1, 2 or 3 of cache, the main memory (local or not) and on disk. To maximize the performance of your code, you should reuse as much as possible the data when they're in the register or a CPU cache. Computations should therefore be ordered with care in performance-critical parts of the code.
Use the Right Compilation Options
Consult the page Compilation de code/en for more details.
Wasting CPU Cycles Instead of Memory
Calcul Québec's servers have from 16 to 512 GB of memory per node, depending on the machine. As this memory is often being used by a single simulation, it makes more sense to use it if that can improve the performance, so avoid recalculating the same quantity several times to save memory.
Excessive Memory Use
If you use more memory than is physically available on the node, the operating system will begin to swap, i.e. to use part of the hard disk space as an extended (and extremely slow) form of memory. Needless to say, allowing this to happen is a bad idea - you should either choose a server or category of nodes which have enough memory for your simulation or else use an algorithm that requires less memory.
Writing Results to Disk Inefficiently or Too Often
Parallel filesystems have an elevated latency, you need to write your data to the disk in large blocks to avoid overloading the filesystem and so slowing down your simulation, as well as those of other users. See the page Using_available_storage.
This is where a typical scientific program will spend most of its time and it's therefore important to put a fair amount of thought into your code's loops, in particular nested loops.
Here are some tips that can improve the performance of loops:
- Move outside of the loop those calculations which are constant from one iteration to the next;
- Sweep through arrays in a continuous fashion in memory by looping over the first dimension of the array in the innermost loop in Fortran (in C/C++, you should loop over the last dimension of the array in the innermost loop);
- Avoid branching (if statements) and input/output operations inside a loop, if possible;
- Unroll loops (in some cases, this is handled automatically by the compiler);
- Process data in blocks or loop blocking (here as well the compiler often does this automatically).