Configuring and Managing Extraction
Entities extracted from cluster services and the metadata that is applied to them supports Cloudera Navigator features, such as the ability to trace entities to their source in lineage diagrams, search for specific entities, and so on. Extraction is enabled for some services by default, while for other services, extraction must be specifically enabled. For example, extraction is enabled for Spark by default. Extraction consumes computing resources, such as memory and storage, so administrators may want to disable extraction for some services entirely, or configure more selective extraction for specific services.
In addition to configuring extraction for specific services, specific filters can be configured to blacklist specific HDFS paths to remove them from the extraction process which both speeds up the process and cuts down on indexing time, as well as the amount of storage consumed by the datadir. Filters can also be configured to blacklist or whitelist Amazon S3 buckets.
Continue reading:
Cloudera Manager Required Role: Navigator Administrator (or Full Administrator)
Enabling Hive Metadata Extraction in a Secure Cluster
- Hive Metastore Access Control and Proxy User Groups Override (Inherits from Hive Proxy User Groups when left empty (the default)
- Hive Proxy User Groups
- Log in to Cloudera Manager Admin Console.
- Select .
- Click the Configuration tab.
- Select Proxy for the Category filter.
- Add hue to the Hive Metastore Access Control and Proxy User Groups Override list if necessary:
- Click the plus icon to open an entry field to add a row to the property.
- Type hue in the entry field.
- Click Save Changes, and repeat the process to add the Hive Proxy User Groups property on the HDFS server:
- Select .
- Click the Configuration tab.
- Select Proxy for the Category filter.
- Add hue to the Hive Proxy User Groups list by adding a row and typing hue in the entry field.
- Click Save Changes.
- Restart the Hive service.
Disabling Spark Metadata Extraction
- Log in to Cloudera Manager Admin Console.
- Select .
- Click the Configuration tab.
- Select Cloudera Navigator for the Category filter. The state of the Enable Lineage Collection feature displays, as
in this example:
- To disable lineage collection, click the checked box.
Removing Manually Enabled Lineage Collection Property
Prior to Cloudera Navigator 2.10 (Cloudera Manager 5.11), enabling lineage collection from Spark required setting a safety valve. If the cluster was upgraded from a previous release of Cloudera Navigator and an Advanced Configuration Snippet (Safety Valve) was used to enable lineage, you must remove that snippet to avoid conflict with the new Enable Lineage Collection property. Using a safety valve to enable Spark metadata extraction has been deprecated.
- Log in to the Cloudera Manager Admin Console.
- Select .
- Click the Configuration tab.
- Select Navigator Metadata Server for the Scope filter.
- Select Advanced for the Category filter.
- Scroll to the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties setting.
- Remove any deprecated setting for Spark extraction, such as:
nav.spark.extraction.enable
- Click Save Changes.
- Restart the Navigator Metadata Server role.
Filtering File System Metadata
You can choose to leave some file system paths out of the scope of information tracked in Cloudera Navigator. Cloudera Manager provides a blacklist where you can specify file systems paths that should be filtered out of metadata extracted from HDFS and S3.
To filter file system paths from tracked metadata:
- Log in to Cloudera Manager Admin Console.
- Select .
- Click the Configuration tab.
- Select Navigator Metadata Server for the Scope filter.
- Select Extractor Filter for the Category filter.
- Enable the filter:
- HDFS Filter Enable
- S3 Filter Enable
- In the appropriate filter list, include the file system path that you want to exclude from Navigator Metadata Server tracking:
- HDFS Filter Blacklist
- S3 Filter list
The entry can be a specific path or a Java regular expression specifying a path. For example, to specify a directory and all subdirectories, use an expression such as
/path/to/dir(?:/.*)?
- Enter additional entries in the filter list by clicking to open another entry.
- For S3, set the S3 Filter Default Action to DISCARD.
- Click Save Changes.
- Click the Instances tab.
- Restart the role.
- Normal operation
(num_nav_elements * 200 bytes) + 2 GB
- Upgrade between CM 5.9 and 5.10
((num_nav_elements + num_nav_relations) * 200 bytes) + 2 GB
Editing MapReduce Custom Metadata
You can associate custom metadata with arbitrary configuration parameters to MapReduce jobs and job executions. The configuration parameters to be extracted by Cloudera Navigator can be specified statically or dynamically.
- Log in to Cloudera Manager Admin Console.
- Select .
- Click the Configuration tab.
- Select Navigator Metadata Server for the Scope filter.
- Select Advanced for the Category filter.
- Scroll to find the Navigator Metadata Server Advanced Configuration Snippet for cloudera-navigator.properties and enter the custom
- Specify values for the following properties:
- nav.user_defined_properties - A comma-separated list of user-defined property names.
- nav.tags - A comma-separated list of property names that serve as tags. The property nav.tags can point to multiple property names that serve as tags, but each of those property names can only specify a single tag.
- Click Save Changes.
- Click the Instances tab.
- Restart the role.
- In the MapReduce job configuration, set the value of the property names you specified in ste.
- Specify one or more of the following properties in a job configuration:
- Job properties (type:OPERATION)
- nav.job.user_defined_properties - A comma-separated list of user-defined property names
- nav.job.tags - A comma-separated list of property names that serve as tags
- Job execution properties (type:OPERATION_EXECUTION)
- nav.jobexec.user_defined_properties - A comma-separated list of user-defined property names
- nav.jobexec.tags - A comma-separated list of property names that serve as tags
- Job properties (type:OPERATION)
- In the MapReduce job configuration, set the value of the property names you specified in step 1.
Setting Properties Dynamically
- Dynamically add the job_tag1 and job_tag2 properties:
conf.set("nav.job.tags", "job_tag1, job_tag2");
- Set the job_tag1 property to onetag:
conf.set("job_tag1", "onetag");
- Set the job_tag2 property to twotag:
conf.set("job_tag2", "twotag");
- Dynamically add the job_tag property:
conf.set("nav.jobexec.tags","job_exec_tag");
- Set the job_exec_tag property to atag:
conf.set("job_exec_tag", "atag");
- Dynamically add the user-defined key bar:
conf.set("nav.job.user_defined_properties", "key");
- Set the value of the user-defined key key to value:
conf.set("key", "value")
Enabling Inputs and Outputs to Display
The Cloudera Navigator console displays a Details page for selected entities. Details include an entity's type and can optionally include table inputs and operation inputs and outputs. The inputs and outputs are not displayed by default because rendering them can slow down the display. Enabling the display of inputs and outputs in the Details page requires changing the nav.ui.details_io_enabled on the Navigator Metadata Server to true, as follows:
- Log in to the Cloudera Manager Admin Console.
- Select .
- Click the Configuration tab.
- Select Navigator Metadata Server for the Scope filter.
- Select Advanced for the Category filter.
- In the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties, enter the following:
nav.ui.details_io_enabled=true
- Click Save Changes.
- Restart the Navigator Metadata Server role.