Note: This documentation has been tested on Briarée. Some instructions may be different on other servers.
HADOOP is a tool that created to facilitate map/reduce-style calculations; it has been developed by Apache and is written in Java.
It's an open source code.
HADOOP has a few idiosyncrasies that it's important to understand in order to use it as efficiently as possible on a compute cluster and not annoy other users. On Briarée, we have configured HADOOP so that it uses the $SCRATCH as a shared global disk space, whereas the more classical use of HADOOP supposes that each node has its own distinct disk space, such as if we had configured it to use the local scratch on the compute nodes. This implies that your HADOOP job will be an additional load on the $SCRATCH filesystem and you should therefore take any necessary steps to ensure that the I/O operations carried out by HADOOP are reasonable and don't disrupt the activities of other users.
A sample PBS job script for HADOOP is as follows:
This example assumes that you want to use the sample randomwriter program which is contained in the Java archive distributed with HADOOP. In this case, the output data from HADOOP will be written in the directory hadoop_out of your $SCRATCH whereas those of the job itself will be written to the file hadoop.txt.