Running Your First Spark Application
The simplest way to run a Spark application is by using the Scala or Python shells.
- To start one of the shell applications, run one of the following commands:
- Scala:
$ SPARK_HOME/bin/spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version ... /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) Type in expressions to have them evaluated. Type :help for more information ... SQL context available as sqlContext. scala>
- Python:
$ SPARK_HOME/bin/pyspark Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56) [GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2 Type "help", "copyright", "credits" or "license" for more information ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version ... /_/ Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56) SparkContext available as sc, HiveContext available as sqlContext. >>>
In a CDH deployment, SPARK_HOME defaults to /usr/lib/spark in package installations and /opt/cloudera/parcels/CDH/lib/spark in parcel installations. In a Cloudera Manager deployment, the shells are also available from /usr/bin.
For a complete list of shell options, run spark-shell or pyspark with the -h flag.
- Scala:
- To run the classic Hadoop word count application, copy an input file to HDFS:
$ hdfs dfs -put input
- Within a shell, run the word count application using the following code examples, substituting for namenode_host, path/to/input, and path/to/output:
- Scala
scala> val myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input") scala> val counts = myfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) scala> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
- Python
>>> myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input") >>> counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2) >>> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
- Scala