Asset filtering use case for Statistics Collector Profiler

Using the allow and deny lists in Asset filtering, you can create a profiler job that profiles only relevant production tables, saving significant cluster resources by excluding development and temporary tables.

Scenario: Profiling Production Airline Databases

Imagine you're the administrator of a large Cloudera on cloud deployment for an airline. Your environment contains hundreds of databases, including:
  • Production databases like airline_operations and finance_prod.
  • Development databases like airline_dev.
  • Temporary staging databases like airline_staging_20251003.
  • User sandbox databases like user_anna.

Running the Statistics Collector Profiler on every table every night is inefficient and resource-intensive. You need a way to automatically profile only important, active production tables, while ignoring everything else.

Filtering Rules Logic

The job should first exclude any table that matches any of the following Deny List rules:

  1. Database name ends with _dev
  2. Database name starts with airline_staging_
  3. Owner equals temp_user
  4. Creation Date greater than 365 days ago to cover only one year
  5. Name ends with _archive_hive

From the remaining assets, the job will consider any table that matches any of the following Allow List rules:

  1. Database name starts with airline_
  2. Database name equals finance_prod

Configuration Steps

To set up these rules:

  1. Go to Profilers > Statistics Collector Profiler > Configuration > Asset Filtering Rules.
  2. Click Add New Rule.
  3. Set up your rules as the following:
    Figure 1. Allow rules
    Figure 2. Deny rules

Validation

You can check the affected assets of your rules by clicking the and selecting Affected Assets:
Figure 3. List of affected assets

In the example, files ending with *_archive_hive will not be processed.