Apache Spark is an open source distributed computation framework initially developed by the AMPLab at UC-Berkeley and now a project associated with the Apache Foundation. Unlike the Map/Reduce paradigm used by Hadoop which uses local disk space, Spark employs primitives which are stored in memory thereby allowing it to reach performances up to 100 times faster for certain applications. Loading the data in memory allows them to be frequently accessed, making Spark a framework that is particularly well-adapted to machine learning and interactive data analysis.
Usage of Apache Spark
The following instructions have only been tested on Colosse.
Starting a Spark Cluster
In the context of using a Calcul Québec machine, a Spark cluster must be started at the beginning of each job. To accomplish this, enter the following command:
This command at first starts the Spark scheduler and then the Spark daemons for each node associated with the job.
Submitting an Application
To submit an application to the Spark scheduler, we use the command
spark-submit. The most important arguments for this command during its use on a Calcul Québec cluster are described in the following table.
spark-submit --master spark://$HOSTNAME:7077\ --executor-memory 20G\ application [arg1 arg2 ...]
--master spark://$HOSTNAME:7077: denotes the address of the Spark scheduler; at Calcul Québec, il will always be the node on which the job script is running and we can thus use the environment variable
$HOSTNAMEwhich stores the hostname of the node where this script is running.
--executor-memory 20G: denotes how much memory will be allocated to each of the Spark workers; the value should be less than the amount of memory available on the compute node.
application: the file containing your Spark application, this could be a Java archive (.jar), a Python script (.py) or an R script (.R).
[arg1 arg2 ...]: any arguments needed for the execution of your application, if necessary.
Stopping a Spark Cluster
When your use of Spark is done, you can stop the Spark cluster by means of the following command:
The command first stops the daemons running on each node and then the Spark scheduler.
Calculation of the Digits of Pi using Scala