Using & Managing Apache Hive in CDH
Apache Hive is a powerful data warehousing application for Hadoop. It enables you to access your data using HiveQL, a language similar to SQL.
Hive Roles
- Hive metastore - Provides metastore services when Hive is configured with a remote metastore.
Cloudera recommends using a remote Hive metastore, especially for CDH 4.2 or higher. Because the remote metastore is recommended, Cloudera Manager treats the Hive Metastore Server as a required role for all Hive services. A remote metastore provides the following benefits:
- The Hive metastore database password and JDBC drivers do not need to be shared with every Hive client; only the Hive Metastore Server does. Sharing passwords with many hosts is a security issue.
- You can control activity on the Hive metastore database. To stop all activity on the database, stop the Hive Metastore Server. This makes it easy to back up and upgrade, which require all Hive activity to stop.
For information about configuring a remote Hive metastore database with Cloudera Manager, see Cloudera Manager and Managed Service Datastores. To configure high availability for the Hive metastore, see Configuring Apache Hive Metastore High Availability in CDH.
- HiveServer2 - Enables remote clients to run Hive queries, and supports a Thrift API tailored for JDBC and ODBC clients, Kerberos authentication, and multi-client concurrency. A CLI named Beeline is also included. See HiveServer2 documentation (CDH 4) or HiveServer2 documentation (CDH 5) for more information.
- WebHCat - HCatalog is a table and storage management layer for Hadoop that makes the same table information available to Hive, Pig, MapReduce, and Sqoop. Table definitions are maintained in the Hive metastore, which HCatalog requires. WebHCat allows you to access HCatalog using an HTTP (REST style) interface.
Hive Execution Engines
- Beeline - (Can be set per query) Run the set hive.execution.engine=engine command, where engine is either mr or spark. The default is
mr. For example:
set hive.execution.engine=spark;
To determine the current setting, runset hive.execution.engine;
- Cloudera Manager (Affects all queries, not recommended).
- Go to the Hive service.
- Click the Configuration tab.
- Search for "execution".
- Set the Default Execution Engine property to MapReduce or Spark. The default is MapReduce.
- Click Save Changes to commit the changes.
- Return to the Home page by clicking the Cloudera Manager logo.
- Click to invoke the cluster restart wizard.
- Click Restart Stale Services.
- Click Restart Now.
- Click Finish.
Use Cases for Hive
Because Hive is a petabyte-scale data warehouse system built on the Hadoop platform, it is a good choice for environments experiencing phenomenal growth in data volume. The underlying MapReduce interface with HDFS is hard to program directly, but Hive provides an SQL interface, making it possible to use existing programming skills to perform data preparation.
Hive on MapReduce or Spark is best-suited for batch data preparation or ETL:
-
You must run scheduled batch jobs with very large ETL sorts with joins to prepare data for Hadoop. Most data served to BI users in Impala is prepared by ETL developers using Hive.
-
You run data transfer or conversion jobs that take many hours. With Hive, if a problem occurs partway through such a job, it recovers and continues.
-
You receive or provide data in diverse formats, where the Hive SerDes and variety of UDFs make it convenient to ingest and convert the data. Typically, the final stage of the ETL process with Hive might be to a high-performance, widely supported format such as Parquet.