Building and Running a Crunch Application with Spark

Developing and Running a Spark WordCount Application provides a tutorial on writing, compiling, and running a Spark application. Using the tutorial as a starting point, do the following to build and run a Crunch application with Spark:

Along with the other dependencies shown in the tutorial, add the appropriate version of the crunch-core and crunch-spark dependencies to the Maven project.

<dependency>
  <groupId>org.apache.crunch</groupId>
  <artifactId>crunch-core</artifactId>
  <version>${crunch.version}</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.crunch</groupId>
  <artifactId>crunch-spark</artifactId>
  <version>${crunch.version}</version>
  <scope>provided</scope>
</dependency>

Use SparkPipeline where you would have used MRPipeline in the declaration of your Crunch pipeline. SparkPipeline takes either a String that contains the connection string for the Spark master (local for local mode, yarn for YARN) or a JavaSparkContext instance.
As you would for a Spark application, use spark-submit start the pipeline with your Crunch application app-jar-with-dependencies.jar file.

For an example, see Crunch demo. After building the example, run with the following command:

spark-submit --class com.example.WordCount crunch-demo-1.0-SNAPSHOT-jar-with-dependencies.jar \
hdfs://namenode_host:8020/user/hdfs/input hdfs://namenode_host:8020/user/hdfs/output

Categories: Crunch | Developers | Spark | All Categories

Spark and Hadoop Integration

Cloudera Glossary