Viewing Lineage Information for Impala Data
Lineage is a feature in the Cloudera Navigator data management component that helps you track where data originated, and how data propagates through the system through SQL statements such as SELECT, INSERT, and CREATE TABLE AS SELECT. Impala is covered by the Cloudera Navigator lineage features in CDH 5.4 / Impala 2.2 and higher.
This type of tracking is important in high-security configurations, especially in highly regulated industries such as healthcare, pharmaceuticals, financial services and intelligence. For such kinds of sensitive data, it is important to know all the places in the system that contain that data or other data derived from it; to verify who has accessed that data; and to be able to doublecheck that the data used to make a decision was processed correctly and not tampered with.
You interact with this feature through lineage diagrams showing relationships between tables and columns. For instructions about interpreting lineage diagrams, see Cloudera Navigator Lineage Diagram Reference.
Column Lineage
Column lineage tracks information in fine detail, at the level of particular columns rather than entire tables.
For example, if you have a table with information derived from web logs, you might copy that data into other tables as part of the ETL process. The ETL operations might involve transformations through expressions and function calls, and rearranging the columns into more or fewer tables (normalizing or denormalizing the data). Then for reporting, you might issue queries against multiple tables and views. In this example, column lineage helps you determine that data that entered the system as RAW_LOGS.FIELD1 was then turned into WEBSITE_REPORTS.IP_ADDRESS through an INSERT ... SELECT statement. Or, conversely, you could start with a reporting query against a view, and trace the origin of the data in a field such as TOP_10_VISITORS.USER_ID back to the underlying table and even further back to the point where the data was first loaded into Impala.
When you have tables where you need to track or control access to sensitive information at the column level, see Enabling Sentry Authorization for Impala for how to implement column-level security. You set up authorization using the Sentry framework, create views that refer to specific sets of columns, and then assign authorization privileges to those views rather than the underlying tables.
Lineage Data for Impala
The lineage feature is enabled by default. When lineage logging is enabled, the serialized column lineage graph is computed for each query and stored in a specialized log file in JSON format.
Impala records queries in the lineage log if they complete successfully, or fail due to authorization errors. For write operations such as INSERT and CREATE TABLE AS SELECT, the statement is recorded in the lineage log only if it successfully completes. Therefore, the lineage feature tracks data that was accessed by successful queries, or that was attempted to be accessed by unsuccessful queries that were blocked due to authorization failure. These kinds of queries represent data that really was accessed, or where the attempted access could represent malicious activity.
Impala does not record in the lineage log queries that fail due to syntax errors or that fail or are cancelled before they reach the stage of requesting rows from the result set.
To enable or disable this feature on a system not managed by Cloudera Manager, set or remove the -lineage_event_log_dir configuration option for the impalad daemon. For information about turning the lineage feature on and off through Cloudera Manager, see Managing Hive and Impala Lineage Properties.