Snappy Compression
Snappy is supported for all CDH components. How you specify compression depends on the component.
Using Snappy with HBase
If you install Hadoop and HBase from RPM or Debian packages, Snappy requires no HBase configuration.
Using Snappy with Hive or Impala
To enable Snappy compression for Hive output when creating SequenceFile outputs, use the following settings:
SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK;
For information about configuring Snappy compression for Parquet files with Hive, see Using Parquet Tables in Hive. For information about using Snappy compression for Parquet files with Impala, see Snappy and GZip Compression for Parquet Data Files in the Impala Guide.
Using Snappy with MapReduce
Enabling MapReduce intermediate compression can make jobs run faster without requiring application changes. Only the temporary intermediate files created by Hadoop for the shuffle phase are compressed; the final output may or may not be compressed. Snappy is ideal in this case because it compresses and decompresses very quickly compared to other compression algorithms, such as Gzip. For information about choosing a compression format, see Choosing and Configuring Data Compression.
To enable Snappy for MapReduce intermediate compression for the whole cluster, set the following properties in mapred-site.xml:
- MRv1
<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
- YARN
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
You can also set these properties on a per-job basis.
Use the properties in the following table to compress the final output of a MapReduce job. These are usually set on a per-job basis.
MRv1 Property | YARN Property | Description |
---|---|---|
mapred.output.compress |
mapreduce.output. fileoutputformat. compress |
Whether to compress the final job outputs (true or false). |
mapred.output. compression.codec |
mapreduce.output. fileoutputformat. compress.codec |
If the final job outputs are to be compressed, the codec to use. Set to org.apache.hadoop.io.compress.SnappyCodec for Snappy compression. |
mapred.output. compression.type |
mapreduce.output. fileoutputformat. compress.type |
For SequenceFile outputs, e type of compression to use (NONE, RECORD, or BLOCK). Cloudera recommends BLOCK. |
Using Snappy with Pig
Set the same properties for Pig as for MapReduce.
Using Snappy with Spark SQL
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
Using Snappy Compression with Sqoop 1 and Sqoop 2 Imports
- Sqoop 1 - On the command line, use the following option to enable Snappy compression:
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
Cloudera recommends using the --as-sequencefile option with this compression option.
- Sqoop 2 - When you create a job (sqoop:000> create job), choose 7 (SNAPPY) as the compression format.