Tuning Apache Hive Performance on the Amazon S3 Filesystem in CDH

Some of the default behaviors of Apache Hive might degrade performance when reading and writing data to tables stored on Amazon S3. Cloudera has introduced the following enhancements that make using Hive with S3 more efficient.

Tuning Hive Write Performance on S3

In releases lower than CDH 5.10, creating or writing Hive tables or partitions to S3 caused performance issues due to the differences between the HDFS and S3 file systems. This occurred because parallel writes to S3 were not supported, and the S3 file system lacks an efficient move operation. In CDH 5.10, these issues are resolved. For details, see HIVE-14269.

These optimizations enable the final job in the query plan to write data efficiently in parallel to the S3 file system. HiveServer2 then uses a thread pool of workers to transfer the data to the final table location on S3. The default values of these parameters yield good performance for a wide range of workloads. However, you can further tune the parameters to optimize for specific workloads.

Hive S3 Write Performance Tuning Parameters

To improve write performance for Hive tables stored on S3, use Cloudera Manager to set the parameters listed below. See Setting Parameters as Service-Wide Defaults with Cloudera Manager.

Parameter Name Description Settings Default
hive.mv.files.threads

Sets the number of threads used to move files in a move task. Increasing the value of this parameter increases the number of parallel copies that can run on S3.

A separate thread pool is used for each Hive query. When running only a few queries in parallel, you can increase this parameter for greater per-query write throughput. However, when you run a large number of queries in parallel, decrease this parameter to avoid thread exhaustion.

To disable multi-threaded file moves, set this parameter to 0. This can prevent thread contention on HiveServer2.

This parameter also controls renames on HDFS, so increasing this value increases the number of threads responsible for renaming files on HDFS.

Range between: 0 and 40 15
hive.blobstore.use.blobstore. as.scratchdir

When set to true, this parameter enables the use of scratch directories directly on S3.

true | false false

Setting Parameters on a Per-Query Basis with the Hive SET Command

Optimize on a per-query basis by setting these parameters in the query code with the Hive SET command.

For example, to set the thread pool to 20 threads and enable scratch directories on S3:

set hive.mv.files.threads=20
set hive.blobstore.use.blobstore.as.scratchdir=true
          

Setting Parameters as Service-Wide Defaults with Cloudera Manager

Use Cloudera Manager to set hive.mv.files.threads and hive.blobstore.use.blobstore.as.scratchdir as service-wide defaults:

  1. In the Cloudera Manager Admin Console, go to the Hive service.
  2. In the Hive service page, click the Configuration tab.
  3. On the Configuration page, click the HiveServer2 scope.
  4. Click the Performance category.
  5. Search for each parameter to set them.
  6. Click Save Changes.

Tuning the S3A Connector to Improve Hive Write Performance on S3

The fs.s3a parameters are used to tune the S3A Connector inside the Hadoop code base. The S3A Connector configurations control the number of threads used to issue concurrent upload and copy requests. A single instance of the S3A Connector is used with a HiveServer2 instance, so different Hive queries can share the same connector instance. The same thread pool is used to issue upload and copy requests. This means that the fs.s3a parameters cannot be set on a per-query basis. Instead, set them for each HiveServer2 instance. In contrast, the thread pool controlled by hive.mv.files.thread is created for each query separately.

Parameter Name How To Tune
fs.s3a.threads.core Increase the value to increase the number of core threads in the thread pool used to run any data uploads or copies.
fs.s3a.threads.max Increase the value to increase the maximum number of concurrent active partition uploads and copies, which each use a thread from the thread pool.
fs.s3a.max.total.tasks Increase the value to increase the number of partition uploads and copies allowed to the queue before rejecting additional uploads.
fs.s3a.connection.maximum

Increase the value to increase the maximum number of simultaneous connections to S3. Cloudera recommends setting this value to 1500.

Setting S3A Connector Parameters as Service-Wide Defaults

Use Cloudera Manager to set the S3A Connector parameters as service-wide defaults for Hive:

  1. In the Cloudera Manager Admin Console, go to the Hive service.
  2. In the Hive service page, click the Configuration tab.
  3. On the Configuration page, click the HiveServer2 scope.
  4. Click the Advanced category.
  5. Search for the HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml configuration setting and click the plus sign to add parameters.
  6. For each fs.s3a parameter, type the parameter name into the Name field and the value in the Value field.
  7. Click Save Changes.

Known Limitations

  1. If you have a large number of concurrent Hive query operations running, a deadlock might occur in the S3AFileSystem class of the Hadoop platform. This is caused by thread pool limits and causes HiveServer2 to freeze. If this occurs, you must restart HiveServer2. To work around the issue, increase the values of fs.s3a.threads.core and fs.s3a.threads.max. See HADOOP-13826.

    This behavior might occur more frequently if fs.s3a.blocking.executor.enabled is set to true. This parameter is turned off by default in CDH.

  2. S3 is an eventually consistent storage system. See the S3 documentation. This eventual consistency affects Hive behavior on S3 and, in rare cases, can cause intermittent failures. Retrying the failed query usually works around the issue.

Tuning Hive Dynamic Partitioning Performance on S3

Dynamic partitioning is a Hive feature that enables dynamic insertions of data into partitions based on the value of a column in a record. It is useful for bulk creating or updating partitions. Prior to CDH 5.11, performance of Hive queries that performed dynamic partitioning on S3 was diminished because partitions were loaded into the target table one at a time. CDH 5.11 optimizations change the underlying logic so that partitions are loaded in parallel.

Use the following parameter to tune performance on a wide range of workloads that use dynamic partitioning. This parameter can be set with Cloudera Manager at the service level or on a per-query basis using the Hive SET command. See Setting the Hive Dynamic Partition Loading Parameter as a Service-Wide Default with Cloudera Manager.

Parameter Name Description Settings Default
hive.load.dynamic.partitions.thread

Sets the number of threads used to load dynamically generated partitions.

Loading dynamically generated partitions requires renaming the files to their destination location and updating the new partition metadata.

Increasing the value set for this parameter can improve performance when you have several hundred dynamically generated partitions.

Range between: 0 and 25 15

Tuning Tips

Increase the value set for hive.load.dynamic.partitions.thread to improve dynamic partitioning query performance on S3. However, do not set this parameter to values exceeding 25 to avoid placing an excessive load on S3, which can lead to throttling issues.

Setting the Hive Dynamic Partition Loading Parameter on a Per-Query Basis

Optimize dynamic partitioning at the session level by using the Hive SET command in the query code.

For example, to set the thread pool to 25 threads:

set hive.load.dynamic.partitions.thread=25

Setting the Hive Dynamic Partition Loading Parameter as a Service-Wide Default with Cloudera Manager

Use Cloudera Manager to set hive.load.dynamic.partitions.thread as a service-wide default:

  1. In the Cloudera Manager Admin Console, go to the Hive service.
  2. In the Hive service page, click the Configuration tab.
  3. On the Configuration page, click the HiveServer2 scope.
  4. Click the Performance category.
  5. Search for Load Dynamic Partitions Thread Count and enter the value you want to set as a service-wide default.
  6. Click Save Changes.

Tuning Hive INSERT OVERWRITE Performance on S3

INSERT OVERWRITE queries write data to a specific table or partition, overwriting any existing data. When Hive detects existing data in the target directory, it moves the existing data to the HDFS trash directory. Moving data to the trash directory can significantly degrade performance when it is run on S3. In CDH 5.11, an optimization is added to move data to the trash directory in parallel by using the following parameter. Use Cloudera Manager to set this parameter as a service-wide default or use the Hive SET command to set the parameter on a per-query basis. See Setting the Hive INSERT OVERWRITE Performance Tuning Parameter as a Service-Wide Default with Cloudera Manager.

Parameter Name Description Settings Default
hive.mv.files.thread

Set this parameter to control the number of threads used to delete existing data in the HDFS trash directory for INSERT OVERWRITE queries.

Range between: 0 and 40 15

Tuning Tips

The hive.mv.files.thread parameter can be tuned for INSERT OVERWRITE performance in the same way it is tuned for write performance. See Hive S3 Write Performance Tuning Parameters.

If setting the above parameter does not produce acceptable results, you can disable the HDFS trash feature by setting the fs.trash.interval to 0 on the HDFS service. In Cloudera Manager, choose HDFS > Configuration > NameNode > Main and set Filesystem Trash Interval to 0.

Setting the Hive INSERT OVERWRITE Performance Tuning Parameter on a Per-Query Basis

Configure Hive to move data to the HDFS trash directory in parallel for INSERT OVERWRITE queries using the Hive SET command.

For example, to set the thread pool to use 30 threads at a maximum:
set hive.mv.files.thread=30

Setting the Hive INSERT OVERWRITE Performance Tuning Parameter as a Service-Wide Default with Cloudera Manager

Use Cloudera Manager to set hive.mv.files.threads as a service-wide default:

  1. In the Cloudera Manager Admin Console, go to the Hive service.
  2. In the Hive service page, click the Configuration tab.
  3. On the Configuration page, click the HiveServer2 scope.
  4. Click the Performance category.
  5. Search for Move Files Thread Count and enter the value you want to set as a service-wide default.
  6. Click Save Changes.

Tuning Hive Table Partition Read Performance on S3

Prior to CDH 5.11, Hive queries that read over 1,000 partitions stored on S3 experienced performance degradation because metadata operations against S3 are much slower than metadata operations performed against HDFS. When Hive runs a query, it needs to collect metadata about the files and about the directory it is reading from. This metadata includes information such as number of files or file sizes. To collect this metadata, Hive must make calls to S3. Before CDH 5.11, these metadata calls were issued serially (one at a time). In CDH 5.11, the metadata operations have been optimized so that the calls are now issued in parallel. This optimization delivers the most benefit for queries that read from multiple partitions. Benefits for queries that read from non-partitioned tables are less significant.

Use the following parameters to tune Hive table partition read performance on S3. The default values yield good performance gains for a wide range of workloads, but you can further tune them to optimize for specific workloads. These parameters can be set with Cloudera Manager at the service level or on a per-query basis using the Hive SET command. See Setting Hive Table Partition Read Performance Tuning Parameters as Service-Wide Defaults with Cloudera Manager.

Parameter Name Description Settings Default
hive.exec.input.listing.max.threads

Sets the maximum number of threads that Hive uses to list input files. Increasing this value can improve performance when there are many partitions being read.

Range between: 0 and 50 15
mapreduce.input.fileinputformat.list-status.num-threads

Sets the number of threads used by the FileInputFormat class when listing and fetching block locations for the specified input paths.

Range between: 0 and 50 1

Tuning Tips

If listing input files becomes a bottleneck for the Hive query, increase the values for hive.exec.input.listing.max.threads and mapreduce.input.fileinputformat.list-status.num-threads. This bottleneck might occur if the query takes a long time to list input directories or to run split calculations when reading several thousand partitions. However, do not set these parameters to values over 50 to avoid putting excessive load on S3, which might lead to throttling issues.

Setting the Hive Table Partition Read Performance Tuning Parameters on a Per-Query Basis

Configure Hive to perform metadata collection in parallel when reading table partitions on S3 using the Hive SET command.

For example, to set the maximum number of threads that Hive uses to list input files to 20 and the number of threads used by the FileInputFormat class when listing and fetching block locations for input to 5:

set hive.exec.input.listing.max.threads=20
set mapreduce.input.fileinputformat.list-status.num-threads=5

Setting Hive Table Partition Read Performance Tuning Parameters as Service-Wide Defaults with Cloudera Manager

Use Cloudera Manager to set hive.exec.input.listing.max.threads and mapreduce.input.fileinputformat.list-status.num-threads as service-wide defaults.

To set hive.exec.input.listing.max.threads:

  1. In the Cloudera Manager Admin Console, go to the Hive service.
  2. In the Hive service page, click the Configuration tab.
  3. On the Configuration page, click the HiveServer2 scope.
  4. Click the Performance category.
  5. Search for Input Listing Max Threads and enter the value you want to set as a service-wide default.
  6. Click Save Changes.

To set mapreduce.input.fileinputformat.list-status.num-threads:

  1. In the Cloudera Manager Admin Console, go to the MapReduce service.
  2. In the MapReduce service page, click the Configuration tab.
  3. Search for MapReduce Service Advanced Configuration Snippet (Safety Valve) for mapred-site.xml and enter the parameter, value, and description:
    <property>
       <name>mapreduce.input.fileinputformat.list-status.num-threads</name>
       <value>number_of_threads</value>
       <description>Number of threads used to list and fetch block locations for input paths
       specified by FileInputFormat</description>
    </property>
    
  4. Click Save Changes.

Tuning Hive MSCK (Metastore Check) Performance on S3

Running the MSCK command with the REPAIR TABLE option is a simple way to bulk add partitions to Hive tables. See the Apache Language Manual for details about using MSCK REPAIR TABLE. MSCK REPAIR TABLE scans the file system to look for directories that correspond to a partition and then registers them with the Hive metastore. Prior to CDH 5.11, MSCK performance was slower on S3 when compared to HDFS due to the overhead created by collecting metadata on S3. In CDH 5.11, MSCK metadata calls are now issued in parallel, which significantly improves performance.

Use the following parameters to tune Hive MSCK metadata call performance on S3. The default values yield good performance gains for a wide range of workloads, but you can further tune them to optimize for specific workloads. The hive.metastore.fshandler.threads parameter can be set as a service-wide default with Cloudera Manager, but cannot be set at the session level. The hive.msck.repair.batch.size parameter can be set with Cloudera Manager at the service level or on a per-query basis using the Hive SET command. See Setting the Hive MSCK REPAIR TABLE Tuning Parameters as Service-Wide Defaults with Cloudera Manager.

Parameter Name Description Settings Default
hive.metastore.fshandler.threads

Sets the number of threads that the Hive metastore uses when adding partitions in bulk to the metastore. Each thread performs metadata operations for each partition added, such as collecting statistics for the partition or checking if the partition directory exists.

This parameter is also used to control the size of the thread pool that is used by MSCK whe4n it scans the file system looking for directories that correspond to table partitions. Each thread performs a list status on each possible partition directory.

Range between: 0 and 30 15
hive.msck.repair.batch.size

Sets the number of partition objects sent per batch from the HiveServe2 service to the Hive metastore service with the MSCK REPAIR TABLE command. If this parameter is set to a value higher than zero, new partition information is sent from HiveServer2 to the Hive metastore in batches. Sending this information in batches improves how memory is used in the metastore, avoiding client read timeout exceptions. If this parameter is set to 0, all partition information is sent at once in a single Thrift call.

Range between: 0 and 2,147,483,647 0

Tuning Tips

The hive.metastore.fshandler.threads parameter can be increased if the MSCK REPAIR TABLE command is taking excessive time to scan S3 for potential partitions to add. Do not set this parameter to a value higher than 30 to avoid putting excessive load on S3, which can lead to throttling issues.

Increase the value set for the hive.msck.repair.batch.size parameter if you receive the following exception:

SocketTimeoutException: Read timed out

This exception is thrown by HiveServer2 when a metastore operation takes longer to complete than the time specified for the hive.metastore.client.socket.timeout parameter. If you simply increase the timeout, it must be set across all metastore operations and requires restarting the metastore service. It is preferable to increase the value set for hive.msck.repair.batch.size, which specifies the number of partition objects that are added to the metastore at one time. Increasing hive.msck.repair.batch.size to 3000 can help mitigate timeout exceptions returned when running MSCK commands. Set to a lower value if you have multiple MSCK commands running in parallel.

Setting hive.msck.repair.batch.size on a Per-Query Basis

Use the Hive SET command to specify how many partition objects are sent per batch from the HiveServer2 service to the Hive metastore service at the session level.

For example, to specify that batches containing 3,000 partition objects each are sent:

set hive.msck.repair.batch.size=3000

Setting the Hive MSCK REPAIR TABLE Tuning Parameters as Service-Wide Defaults with Cloudera Manager

Use Cloudera Manager to set the hive.metastore.fshandler.threads and the hive.msck.repair.batch.size parameters as service-wide defaults:

  1. In the Cloudera Manager Admin Console, go to the Hive service.
  2. In the Hive service page, click the Configuration tab.
  3. On the Configuration page, search for each parameter to set them.
  4. Click Save Changes.