Tuning Impala for Performance
The following sections explain the factors affecting the performance of Impala features, and procedures for tuning, monitoring, and benchmarking Impala queries and other SQL operations.
This section also describes techniques for maximizing Impala scalability. Scalability is tied to performance: it means that performance remains high as the system workload increases. For example, reducing the disk I/O performed by a query can speed up an individual query, and at the same time improve scalability by making it practical to run more queries simultaneously. Sometimes, an optimization technique improves scalability more than performance. For example, reducing memory usage for a query might not change the query performance much, but might improve scalability by allowing more Impala queries or other kinds of jobs to run at the same time without running out of memory.
- Partitioning for Impala Tables. This technique physically divides the data based on the different values in frequently queried columns, allowing queries to skip reading a large percentage of the data in a table.
- Performance Considerations for Join Queries. Joins are the main class of queries that you can tune at the SQL level, as opposed to changing physical factors such as the file format or the hardware configuration. The related topics Overview of Column Statistics and Overview of Table Statistics are also important primarily for join performance.
- Overview of Table Statistics and Overview of Column Statistics. Gathering table and column statistics, using the COMPUTE STATS statement, helps Impala automatically optimize the performance for join queries, without requiring changes to SQL query statements. (This process is greatly simplified in Impala 1.2.2 and higher, because the COMPUTE STATS statement gathers both kinds of statistics in one operation, and does not require any setup and configuration as was previously necessary for the ANALYZE TABLE statement in Hive.)
- Testing Impala Performance. Do some post-setup testing to ensure Impala is using optimal settings for performance, before conducting any benchmark tests.
- Benchmarking Impala Queries. The configuration and sample data that you use for initial experiments with Impala is often not appropriate for doing performance tests.
- Controlling Impala Resource Usage. The more memory Impala can utilize, the better query performance you can expect. In a cluster running other kinds of workloads as well, you must make tradeoffs to make sure all Hadoop components have enough memory to perform well, so you might cap the memory that Impala can use.
- Using Impala with the Amazon S3 Filesystem. Queries against data stored in the Amazon Simple Storage Service (S3) have different performance characteristics than when the data is stored in HDFS.
Continue reading:
A good source of tips related to scalability and performance tuning is the Impala Cookbook presentation. These slides are updated periodically as new features come out and new benchmarks are performed.