Running Spark Applications

You can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.

Because of a limitation in the way Scala compiles code, some applications with nested definitions running in an interactive shell may encounter a Task not serializable exception. Cloudera recommends submitting these applications.

To run applications distributed across a cluster, Spark requires a cluster manager. Cloudera supports two cluster managers: YARN and Spark Standalone. When run on YARN, Spark application processes are managed by the YARN ResourceManager and NodeManager roles. When run on Spark Standalone, Spark application processes are managed by Spark Master and Worker roles.

In CDH 5, Cloudera recommends running Spark applications on a YARN cluster manager instead of on a Spark Standalone cluster manager, for the following benefits:

You can dynamically share and centrally configure the same pool of cluster resources among all frameworks that run on YARN.
You can use all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
You choose the number of executors to use; in contrast, Spark Standalone requires each application to run an executor on every host in the cluster.
Spark can run against Kerberos-enabled Hadoop clusters and use secure authentication between its processes.

For information on monitoring Spark applications, see Monitoring Spark Applications .

Continue reading:

Submitting Spark Applications
spark-submit Options
Cluster Execution Overview

Submitting Spark Applications

To submit an application consisting of a Python file or a compiled and packaged Java or Spark JAR, use the spark-submit script.

spark-submit Syntax

spark-submit --option value \
  application jar | python file [application arguments]

Example: Running SparkPi on YARN demonstrates how to run one of the sample applications, SparkPi, packaged with Spark. It computes an approximation to the value of pi.

spark-submit Arguments
Option	Description
`application jar`	Path to a JAR file containing a Spark application. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster; see Advanced Dependency Management.
`python file`	Path to a Python file containing a Spark application. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster; see Advanced Dependency Management.
`application arguments`	Arguments to pass to the main method of your application.

spark-submit Options

You specify spark-submit options using the form --optionvalue instead of --option=value. (Use a space instead of an equals sign.)

Option	Description
`class`	For Java and Scala applications, the fully qualified classname of the class containing the main method of the application. For example, `org.apache.spark.examples.SparkPi`.
`conf`	Spark configuration property in `key`=`value` format. For values that contain spaces, surround "`key`=`value`" with quotes (as shown).
`deploy-mode`	Deployment mode: cluster and client. In cluster mode, the driver runs on worker hosts. In client mode, the driver runs locally as an external client. Use cluster mode with production jobs; client mode is more appropriate for interactive and debugging uses, where you want to see your application output immediately. To see the effect of the deployment mode when running on YARN, see Deployment Modes. Default: `client`.
`driver-class-path`	Configuration and classpath entries to pass to the driver. JARs added with --jars are automatically included in the classpath.
`driver-cores`	Number of cores used by the driver in cluster mode. Default: 1.
`driver-memory`	Maximum heap size (represented as a JVM string; for example 1024m, 2g, and so on) to allocate to the driver. Alternatively, you can use the `spark.driver.memory` property.
`files`	Comma-separated list of files to be placed in the working directory of each executor. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster; see Advanced Dependency Management.
`jars`	Additional JARs to be loaded in the classpath of drivers and executors in cluster mode or in the executor classpath in client mode. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster; see Advanced Dependency Management.
`master`	The location to run the application.
`packages`	Comma-separated list of Maven coordinates of JARs to include on the driver and executor classpaths. The local Maven, Maven central, and remote repositories specified in `repositories` are searched in that order. The format for the coordinates is `groupId:artifactId:version`.
`py-files`	Comma-separated list of .zip, .egg, or .py files to place on `PYTHONPATH`. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster; see Advanced Dependency Management.
`repositories`	Comma-separated list of remote repositories to search for the Maven coordinates specified in `packages`.

Master Values
Master	Description
local	Run Spark locally with one worker thread (that is, no parallelism).
local[`K`]	Run Spark locally with `K` worker threads. (Ideally, set this to the number of cores on your host.)
local[*]	Run Spark locally with as many worker threads as logical cores on your host.
spark://`host`:`port`	Run using the Spark Standalone cluster manager with the Spark Master on the specified host and port (7077 by default).
yarn	Run using a YARN cluster manager. The cluster location is determined by `HADOOP_CONF_DIR` or `YARN_CONF_DIR`. See Configuring the Environment.

Cluster Execution Overview

Spark orchestrates its operations through the driver program. When the driver program is run, the Spark framework initializes executor processes on the cluster hosts that process your data. The following occurs when you submit a Spark application to a cluster:

The driver is launched and invokes the main method in the Spark application.
The driver requests resources from the cluster manager to launch executors.
The cluster manager launches executors on behalf of the driver program.
The driver runs the application. Based on the transformations and actions in the application, the driver sends tasks to executors.
Tasks are run on executors to compute and save results.
If dynamic allocation is enabled, after executors are idle for a specified period, they are released.
When driver's main method exits or calls SparkContext.stop, it terminates any outstanding executors and releases resources from the cluster manager.

Configuring Spark Applications

Running Spark Applications on YARN