Configuring Extraction for Altus Clusters on AWS

Follow the steps below to configure Cloudera Navigator to extract metadata and lineage from single-user transient clusters deployed to Amazon Web Services using Cloudera Altus. The Cloudera Navigator extraction process for clusters launched by Cloudera Altus works as follows:

Any HDFS paths in a job, query, or data entity are extracted as proxy entities for the path, similar to how Hive entities are extracted. That means that HDFS is not bulk extracted from an Altus cluster.
Hive Metastore (HMS) entities are also not bulk extracted. Cloudera Navigator extracts Hive entities used in queries that generate lineage, such as databases, tables, and so on.

Continue reading:

Requirements
Obtaining AWS Credentials for the Amazon S3 Bucket
Cloudera Altus Configuration
Cloudera Navigator Configuration

Requirements

Cloudera Navigator collects metadata and lineage entities from transient clusters deployed to AWS by Cloudera Altus users. The metadata and lineage data is not collected directly from the transient clusters but rather from an Amazon S3 bucket that serves as the storage mechanism for the Telemetry Publisher running in the cluster (see How it Works: Background to the Setup Tasks for details).

As detailed in the steps below, successful integration of Cloudera Navigator and Cloudera Altus clusters depends on correctly:

Identifying the Amazon S3 bucket, and
Configuring correct access permissions to it from both Cloudera Altus and Cloudera Navigator. Transient clusters instantiated by Altus users must have read and write permissions to the Amazon S3 bucket used by Telemetry Publisher. The on-premises centralized Cloudera Navigator instance must have read permissions on the same Amazon S3 bucket.

—Altus user account (with cross-account privileges to AWS account). Privileges to launch EC2 clusters and use other AWS resources, including Amazon S3 buckets identified in the Environment (data input, data output, logs, and the bucket for Telemetry Publisher).

— Read and write privileges to the Amazon S3 bucket configured in the Environment for the Altus user.

-- AWS access key ID and AWS secret key for the AWS account associated with the Amazon S3 bucket.

The steps below assume that you have:

An Amazon Web Services account.
A Cloudera Altus account.
A Cloudera Altus user account that can run jobs on transient clusters deployed to AWS.
Access to the on-premises or persistent Cloudera Manager cluster running Cloudera Navigator. The Cloudera Manager user role of Full Administrator and the ability to log in to the Cloudera Manager Admin Console is required.
AWS Credentials for the AWS account hosting the Amazon S3 bucket that serves as the storage mechanism for metadata and lineage data from clusters on AWS launched by Cloudera Altus.

Obtaining AWS Credentials for the Amazon S3 Bucket

AWS Credentials are available to be downloaded whenever you create an IAM user account through the AWS Management Console. If you are configuring an existing Amazon S3 bucket and you do not have the AWS Credentials for it, you can generate new AWS Credentials from the AWS account using either the AWS Management Console or the AWS CLI.

Generating new AWS Credentials deactivates any previously issued credentials and makes the newly generated credentials Active for the AWS account. Keep that in mind if you obtain new AWS Credentials to use for the Cloudera Navigator-Cloudera Altus integration.

These steps assume you have an AWS account and that an Amazon S3 bucket exists on that account that you want to use as the storage location for metadata and lineage.

Log in to the AWS Management Console using the account associated with the Amazon S3 bucket.
Navigate to the Security credentials section of the Users page in IAM for this account. For example:
Click the Create access key button to generate new AWS Credentials. Extract the credentials (the Access Key Id and Secret Key) from the user interface or download the credentials.csv for later use.

New credentials can be created by using the AWS CLI rather than the AWS Management Console. See Amazon documentation for details.

Cloudera Altus Configuration

Cloudera Altus instantiates single-user transient clusters focused on data engineering workloads that use compute services such as Hive, Impala, or MapReduce2. The typical deployment scenario involves running scripts that invoke the Cloudera Altus CLI to instantiate the cluster, in this case, using Amazon Web Services according to the details specified in the Altus Environment. An Altus environment specifies all resources needed by the cluster, including the AWS account that will be used to instantiate the cluster. The Cloudera Altus user account is configured to provide cross-account access to the AWS account that has permissions to launch AWS Elastic Compute Cloud (EC2) instances and use other AWS resources, including Amazon S3 buckets.

Although the Altus Environment can be created using Quickstart, Cloudera recommends using the Environment Wizard or the Altus CLI instead. The Wizard provides better control over configuring resources, including letting you specify the Amazon S3 bucket that clusters will use to store metadata and lineage information for collection by Cloudera Navigator. Specifically, the Resources & Policies page of the Configuration Wizard lets you enable integration with Cloudera Navigator and specify the Amazon S3 bucket that will hold collected metadata and lineage information.

On the Resources & Policies page of the Configuration Wizard, under the EC2 Instance Settings:

Click the Enable Cloudera Navigator checkbox to enable the feature.
In the Cloudera Navigator S3 Data Bucket field, enter the path to the Amazon S3 bucket, including the final /, which identifies the target as an S3 bucket. For example:
```
s3a://cluster-lab.example.com/cust-input/
```

To provide the correct access to the S3 bucket, you must also create the appropriate policy in the AWS Management Console and apply the policy to the Amazon S3 bucket. See Cloudera Altus documentation for details.

Cloudera Navigator Configuration

The Cloudera Navigator runs in the context of Cloudera Manager Server. Its two role instances, the Navigator Audit Server and Navigator Metadata Server, run on the Cloudera Management Service. The Navigator Metadata Server role instance is the component that extracts metadata and lineage from the Amazon S3 bucket using the AWS Credentials configured for connectivity in the steps below:

Follow the steps in Adding AWS Credentials and Configuring Connectivity to add new or regenerated AWS Credentials to the Cloudera Manager Server and then configure connectivity.
Follow the steps in Configuring Connectivity for AWS Credentials to configure connectivity for AWS Credentials that are already available to be used for the Amazon S3 bucket but have not yet been configured for connectivity.

Adding AWS Credentials and Configuring Connectivity

Cloudera Manager Required Role: Full Administrator

The AWS Credentials must be added to the Cloudera Manager Server for use by Cloudera Navigator. These credentials must be from the AWS account hosting the Amazon S3 bucket that is configured in the Altus environment.

Log in to the Cloudera Manager Admin Console.
Select Administration > AWS Credentials.
Click the Add Access Key Credentials button on the AWS Credentials page.
1. Enter a meaningful name for the AWS Credential, such as the type of jobs the associated clusters will run (for example, etl-processing). This name is for your own information and is not checked against any Cloudera Altus or AWS attributes.
2. Enter the AWS Access Key ID and the AWS Secret Key.
Click Add to save the credentials. The S3Guard option page displays, reflecting the credential name (for example, Edit S3Guard: etl-processing). Disregard this option.
Click Save. The Connect to Amazon Web Services page displays, showing the options available for this specific AWS credential.
Click Enable Metadata and Lineage Extraction from Cloudera Altus. The Metadata and Lineage Extraction Configuration setting page displays a field for specifying the Amazon S3 bucket name.
Enter the name of the Amazon S3 bucket. For example:
Click OK. The AWS Credentials page re-displays, and the newly added AWS Credential is listed with any other AWS Credentials held by the Cloudera Manager Server.
Restart the Cloudera Management Service.

When the service restarts, the AWS credentials will be used by Cloudera Navigator to authenticate to the AWS account and extract metadata and lineage stored on the specified S3 bucket.

Configuring Connectivity for AWS Credentials

If the AWS Credentials are available for the Amazon S3 bucket, you can configure them as follows:

Log in to the Cloudera Manager Admin Console.
Select Administration > AWS Credentials.
Find the available AWS Credentials that provide access to the Amazon S3 bucket used to collect metadata and lineage from transient clusters.
Click the Actions drop-down menu and select Edit Connectivity. The Connect to Amazon Web Services page displays the three sections of possible configurations.
In the Cloudera Navigator section, click the Enable Metadata and Lineage Extraction from Cloudera Altus link. The Metadata and Lineage Extraction Configuration page displays.
Enter the name of the Amazon S3 bucket in the S3 Bucket Name field.
Click OK.
Restart the Cloudera Management Service.

This completes the setup process. After the restart, metadata and lineage for transient clusters deployed using Cloudera Altus should be available in the Cloudera Navigator console.

Technical metadata specific to clusters deployed using Altus include the following property names and types:

Cluster (Source Type)
Cluster-name (Cluster Group)
Transient (Deployment Type)
Cluster Template, Cluster Instance (Classname)

For example:

See Search: Syntax and Properties Reference for more information.

Using Cloudera Navigator with Altus Clusters

Configuring Extraction for Amazon S3