What's new in Cloudera Data Warehouse on premises

Review the new features in this release of Cloudera Data Warehouse on premises service.

Cloudera Data Warehouse on premises

Integrating third-party Certification Manager
Cert-manager is an open-source tool for Kubernetes that automates the provisioning, management, and renewal of TLS certificates. Its documentation at https://cert-manager.io/docs/ provides comprehensive guidance on installing, configuring, and using cert-manager to secure workloads with trusted X.509 certificates. Cloudera provides out-of-the-box support for Venafi TPP as part of the Cloudera Embedded Container Service installation. By integrating cert-manager, the Cloudera Data Services on premises achieve secure communication, reduced manual overhead, and compliance with security standards, leveraging its robust automation and flexibility. For more information on integrating Cert-manager using Venafi TPP in Cloudera Data Warehouse, see Configuring cluster issuer for Certificate Manager.
Quota management improvements to support multiple environments
As part of this release, Quota Management capabilities have been enhanced to support multiple environments. Previously, root served as the top-level resource for the cluster. With the new changes, each environment now has its own resource pool for the respective data service.

When an environment is activated in Cloudera Data Warehouse, a root.<environment-name>.cdw resource pool is automatically created. This newly created resource pool can be selected as the top-level resource pool. For more details, refer to Quota management in Cloudera Data Warehouse on premises.

Improvements to Impala Autoscaler Dashboard

The following inprovements were introduced for the Impala Autoscaler Dashboard:

  • Ability to select the log-level configuration for the autoscaler and autoscaler metrics containers.
  • A new “Understanding The Dashboard” page has been added which explains the metrics displayed on the UI and how they are calculated.
  • Empty data points that manifest as gaps in the graphs are skipped. Zero values are accurately displayed.

For more information, see About Impala Autoscaling dashboard.

Ability to view end-of-support information through UI and CDP CLI
Cloudera Data Warehouse releases reach the end of support every six months. The Cloudera Data Warehouse UI displays whether your deployment is nearing its end of support time or is unsupported, enabling you to plan an upgrade. You can also view the upgrade instructions on the UI. The end of support information is also displayed when you run the list-clusters and describe-clusters CDP CLI commands.
Streamlined option for downloading Cloudera Data Warehouse diagnostic bundles
Cloudera Data Warehouse users can now easily download diagnostic bundles with a direct Collect option that reduces the need for prior time interval and log selection adjustments. This update enables faster, more efficient access to relevant diagnostic data. See, Downloading diagnostic bundles and Accessing and generating diagnostic bundles
Security improvement: use of Chainguard images
To enhance security, Cloudera Data Warehouse now uses Chainguard hardened images for its base images, Impala, Hue, and third-party images. The Kubernetes Dashboard is excluded from this change.

These changes help us address CVEs and offer improved security and stability. For more information, see Chainguard container images.

What's new in Hive on Cloudera Data Warehouse on premises

Hive Query History Service
The Hive query history service provides a scalable solution for storing and analyzing historical Hive query data. It captures detailed information about completed queries, such as runtime, accessed tables, errors, and metadata, and stores it in an efficient Iceberg table format. For more information see, Hive query history service
OpenTelemetry integration for Hive
Hive now integrates with OpenTelemetry (OTel) to enhance query by collecting and exporting telemetry data, including infrastructure and workload metrics. An OTel agent in Cloudera Data Warehouse helps monitor query performance and troubleshoot failures. For more information, see OpenTelemetry support for Hive

Apache Jira: HIVE-28504

What's new in Impala on Cloudera Data Warehouse on premises

Improved Cardinality Estimation for Aggregation Queries
Impala now provides more accurate cardinality estimates for aggregation queries by considering data distribution, predicates, and tuple tracing. Enhancements include:
  • Pre-aggregation Cardinality Adjustments: A new estimation model accounts for duplicate keys across nodes, reducing underestimation errors.
  • Predicate-Aware Cardinality Calculation: The planner now considers filtering conditions on group-by columns to refine cardinality estimates.
  • Tuple Tracing for Better Accuracy: Improved tuple analysis allows deeper tracking across views and intermediate aggregation nodes.
  • Consistent Aggregation Node Stats Computation: The planning process now ensures consistent and efficient recomputation of aggregation node statistics. These improvements lead to better memory estimates, optimized query execution, and more efficient resource utilization.
  • Tuple-Based Cardinality Analysis: Analyzing grouping expressions from the same tuple to ensure their combined number of distinct values does not exceed the output cardinality of the source PlanNode, reducing overestimation.
  • Refined number of distinct values Calculation for CPU Costing: The new approach applies a probabilistic formula to a single global NDV estimate, improving accuracy and reducing overestimation in processing cost calculations.

Apache Jira: IMPALA-2945, IMPALA-13086, IMPALA-13465 , IMPALA-13526, IMPALA-13405 IMPALA-13644

Cleanup of host-level remote scratch dir on startup and exit
Impala now removes leftover scratch files from remote storage during startup and shutdown, ensuring efficient storage management. The cleanup targets files in the host-specific directory within the configured remote scratch location.
A new flag, remote_scratch_cleanup_on_start_stop, controls this behavior. By default, cleanup is enabled, but you can disable it if multiple Impala daemons on a host or multiple clusters share the same remote scratch directory to prevent unintended deletions.

Apache Jira: IMPALA-13677, IMPALA-13798

Graceful shutdown with query cancellation
Impala now attempts to cancel running queries before reaching the graceful shutdown deadline, ensuring resources are released properly. The new shutdown_query_cancel_period_s flag controls this behavior. The default value is 60 seconds. If set to a value greater than 0, Impala will try to cancel running queries within this period before forcing shutdown. If the value exceeds 20% of the total shutdown deadline, it is automatically capped to prevent excessive delays. This approach helps prevent unfinished queries and unreleased resources during shutdown. For more information, see Setting Impala Query Cancellation on Shut down
Programmatic query termination
Impala now supports the KILL QUERY statement, enabling you to forcibly terminate queries for better workload management. The KILL QUERY statement cancels and unregisters queries on any coordinator. For more information, see KILL QUERY statement
Ability to log and manage Impala workloads
Cloudera Data Warehouse provides you the option to enable logging Impala queries on an existing Virtual Warehouse or while creating a new Impala Virtual Warehouse. The information for all completed Impala queries is stored in the sys.impala_query_log system table. Information about all actively running and recently completed Impala queries is stored in the sys.impala_query_live system table. Users with appropriate permissions can query this table using SQL to monitor and optimize the Impala engine. For more information, see Impala workload management

What's new in Iceberg on Cloudera Data Warehouse on premises

Cloudera support for Apache Iceberg version 1.5.2
The Apache Iceberg component has been upgraded from 1.4.3 to 1.5.2.
Reading Iceberg Puffin statistics
Impala supports reading Puffin statistics from current and older snapshots. When there are Puffin statistics for multiple snapshots, Impala chooses the most recent statistics for each column. This indicates that statistics for different columns may come from different snapshots. If there are Hive Metastore (HMS) and Puffin statistics for a column, the most recent statistics are considered. For HMS statistics, the impala.lastComputeStatsTime property is used and for Puffin statistics, the snapshot timestamp is used to determine which among the two is the most recent. For more information, see Iceberg Puffin statistics.
Enhancements to Iceberg data compaction
The OPTIMIZE TABLE statement is enhanced with the following improvements:
  • Supports partition evolution

    The Hive and Impala OPTIMIZE TABLE statement that is used to compact Iceberg tables and optimize them for read operations, is enhanced to support compaction of Iceberg tables with partition evolution.

  • Supports data compaction based on file size threshold

    The Impala OPTIMIZE TABLE statement has been enhanced to include a FILE_SIZE_THRESHOLD_MB option that enables you to specify the maximum size of files (in MB) that should be considered for compaction.

For more information, see Iceberg data compaction.

Impala supports the MERGE INTO statement for Iceberg tables
You can use Impala to run a MERGE INTO statement on an Iceberg table based on the results of a join between a target and source Iceberg table. For more information, see the Iceberg Merge feature.

What's new in Hue on Cloudera Data Warehouse on premises

Enhanced AI Integration in Hue SQL AI Assistant
The Hue SQL AI Assistant now supports Cloudera AI Workbench and Cloudera AI Inference service. These integrations enhance the Hue SQL AI Assistant by enabling the use of private models hosted within Cloudera-managed infrastructure. This ensures enhanced security and privacy while leveraging GenAI for the Hue SQL-related tasks.
Hue SQL AI: Multi database querying now supported
The Hue SQL AI Assistant now supports multi-database querying, allowing you to retrieve data from multiple databases simultaneously. This enhancement simplifies managing large datasets across different systems and enables seamless cross-database queries.
  • Support for cross-database queries.
  • Ability to retrieve and combine data from multiple sources in a single query.
For more information, see Multi database support for SQL query.
User Input Validation for Hue SQL AI
Hue SQL AI now supports secure and optimized integration with large language models (LLMs). You can now configure user input validation, such as prompt length limits, regex restrictions, and HTML tag handling, and more to enhance both security and system performance.

For more information, see User Input Validation for Hue SQL AI.