Apache Spark

De Wiki de Calcul Québec
Aller à : Navigation, rechercher
Cette page est une traduction de la page Apache Spark et la traduction est complétée à 100 % et à jour.

Autres langues :anglais 100% • ‎français 100%

Sommaire

Description

Apache Spark is an open source distributed computation framework initially developed by the AMPLab at UC-Berkeley and now a project associated with the Apache Foundation. Unlike the Map/Reduce paradigm used by Hadoop which uses local disk space, Spark employs primitives which are stored in memory thereby allowing it to reach performances up to 100 times faster for certain applications. Loading the data in memory allows them to be frequently accessed, making Spark a framework that is particularly well-adapted to machine learning and interactive data analysis.

Usage of Apache Spark

The following instructions have only been tested on Colosse.

Starting a Spark Cluster

In the context of using a Calcul Québec machine, a Spark cluster must be started at the beginning of each job. To accomplish this, enter the following command:

start-all.sh

This command at first starts the Spark scheduler and then the Spark daemons for each node associated with the job.

Submitting an Application

To submit an application to the Spark scheduler, we use the command spark-submit. The most important arguments for this command during its use on a Calcul Québec cluster are described in the following table.

spark-submit --master spark://$HOSTNAME:7077\
             --executor-memory 20G\
             application [arg1 arg2 ...]
  • --master spark://$HOSTNAME:7077 : denotes the address of the Spark scheduler; at Calcul Québec, il will always be the node on which the job script is running and we can thus use the environment variable $HOSTNAME which stores the hostname of the node where this script is running.
  • --executor-memory 20G : denotes how much memory will be allocated to each of the Spark workers; the value should be less than the amount of memory available on the compute node.
  • application: the file containing your Spark application, this could be a Java archive (.jar), a Python script (.py) or an R script (.R).
  • [arg1 arg2 ...]: any arguments needed for the execution of your application, if necessary.

Stopping a Spark Cluster

When your use of Spark is done, you can stop the Spark cluster by means of the following command:

stop-all.sh

The command first stops the daemons running on each node and then the Spark scheduler.

Job Examples

Calculation of the Digits of Pi using Scala

File : submit_sparkpi.pbs
#!/bin/bash
#PBS -N SparkPi
#PBS -l nodes=4:ppn=8
#PBS -l walltime=00:20:00
 
cd "${PBS_O_WORKDIR}"
 
module load apps/spark/2.0.0
 
# Launch Spark cluster
start-all.sh
 
# Submit SparkPi.jar application to the Spark cluster
spark-submit --master spark://$HOSTNAME:7077\
                      --executor-memory 20G\
                      SparkPi.jar
 
# Stop Spark cluster
stop-all.sh








Further Reading

Outils personnels
Espaces de noms

Variantes
Actions
Navigation
Ressources de Calcul Québec
Outils
Partager