What's New in Cloudera Data Warehouse on cloud

Review the new features introduced in this release of Cloudera Data Warehouse service on Cloudera on cloud.

What's new in Cloudera Data Warehouse on cloud

Improvements to Impala Autoscaler Dashboard - view historical data: You can now view historical autoscaler metrics data for a specified period of time by choosing the Historic Data option and specifying the start and end timestamps for which you want to view the data. Note that this feature is currently available only for AWS environments. For more information, see About Impala Autoscaling Dashboard.
Publishing Cloudera Data Warehouse telemetry data in Cloudera Observability: While activating an AWS or Azure environment in Cloudera Data Warehouse, the global option that is set in Cloudera Management Console through Environments > Summary > Telemetry > Cloudera Observability - Workload Analytics is considered to decide if diagnostic information about job and query execution should be sent to Workload Manager.
If the Cloudera Observability - Workload Analytics option is enabled, Cloudera Data Warehouse publishes Hive or Impala query data to Cloudera Observability and if the option is disabled, users do not see any diagnostic data related to their queries.
note
This change only affects new Environments that are activated. Existing cluster instances continue to publish diagnostic data until the Environment is reactivated. Any change with this option is only considered by Cloudera Data Warehouse when the Environment is reactivated.
Removal of docker custom registry type: Starting from this release, the "docker" custom image registry type is no longer supported in Cloudera Data Warehouse and the option to choose the "docker" registry type during environment activation is removed. Cloudera Data Warehouse only supports the ACR and ECR image registries.
Security improvement: use of Chainguard images: To enhance security, Cloudera Data Warehouse now uses Chainguard hardened images for its base images, Hue, and third-party images. The Kubernetes Dashboard is excluded from this change.
These changes help us address CVEs and offer improved security and stability. For more information, see Chainguard container images.

What's new in Hive on Cloudera Data Warehouse on cloud

OpenTelemetry integration for Hive: Hive now integrates with OpenTelemetry (OTel) to enhance query by collecting and exporting telemetry data, including infrastructure and workload metrics. An OTel agent in Cloudera Data Warehouse helps monitor query performance and troubleshoot failures. For more information, see OpenTelemetry support for Hive
Apache Jira: HIVE-28504
Common table expression detection and rewrites using cost-based optimizer: Hive's existing shared work optimizer detects and optimizes common table expressions heuristically, but it lacks cost-based analysis and has limited customization. Introduced new APIs and configuration options to support common table expression optimizations at the cost-based optimizer level. The feature is experimental and disabled by default.
Apache Jira: HIVE-28259
Upgraded Avro to version 1.11.3

What's new in Impala on Cloudera Data Warehouse on cloud

Improved Cardinality Estimation for Aggregation Queries

Impala now provides more accurate cardinality estimates for aggregation queries by considering data distribution, predicates, and tuple tracing. Enhancements include:

Pre-aggregation Cardinality Adjustments: A new estimation model accounts for duplicate keys across nodes, reducing underestimation errors.
Predicate-Aware Cardinality Calculation: The planner now considers filtering conditions on group-by columns to refine cardinality estimates.
Tuple Tracing for Better Accuracy: Improved tuple analysis allows deeper tracking across views and intermediate aggregation nodes.
Consistent Aggregation Node Stats Computation: The planning process now ensures consistent and efficient recomputation of aggregation node statistics. These improvements lead to better memory estimates, optimized query execution, and more efficient resource utilization.
Tuple-Based Cardinality Analysis: Analyzing grouping expressions from the same tuple to ensure their combined number of distinct values does not exceed the output cardinality of the source PlanNode, reducing overestimation.
Refined number of distinct values Calculation for CPU Costing: The new approach applies a probabilistic formula to a single global NDV estimate, improving accuracy and reducing overestimation in processing cost calculations.

Apache Jira: IMPALA-2945, IMPALA-13086, IMPALA-13465 , IMPALA-13526, IMPALA-13405 IMPALA-13644

Cleanup of host-level remote scratch dir on startup and exit

Impala now removes leftover scratch files from remote storage during startup and shutdown, ensuring efficient storage management. The cleanup targets files in the host-specific directory within the configured remote scratch location.

A new flag, remote_scratch_cleanup_on_start_stop, controls this behavior. By default, cleanup is enabled, but you can disable it if multiple Impala daemons on a host or multiple clusters share the same remote scratch directory to prevent unintended deletions.

Apache Jira: IMPALA-13677, IMPALA-13798

Graceful shutdown with query cancellation

Impala now attempts to cancel running queries before reaching the graceful shutdown deadline, ensuring resources are released properly. The new shutdown_query_cancel_period_s flag controls this behavior. The default value is 60 seconds. If set to a value greater than 0, Impala will try to cancel running queries within this period before forcing shutdown. If the value exceeds 20% of the total shutdown deadline, it is automatically capped to prevent excessive delays. This approach helps prevent unfinished queries and unreleased resources during shutdown. For more information, see Setting Impala Query Cancellation on Shut down

Programmatic query termination

Impala now supports the KILL QUERY statement, enabling you to forcibly terminate queries for better workload management. The KILL QUERY statement cancels and unregisters queries on any coordinator. For more information, see KILL QUERY statement

Ability to log and manage Impala workloads is now GA

Cloudera Data Warehouse provides you the option to enable logging Impala queries on an existing Virtual Warehouse or while creating a new Impala Virtual Warehouse. The information for all completed Impala queries is stored in the sys.impala_query_log system table. Information about all actively running and recently completed Impala queries is stored in the sys.impala_query_live system table. Users with appropriate permissions can query this table using SQL to monitor and optimize the Impala engine. For more information, see Impala workload management

AI Functions in Impala is now GA

Cloudera Data Warehouse introduces Impala’s built-in ai_generate_text function integrates Large Language Models (LLMs) into SQL for tasks such as sentiment analysis and translation. It simplifies workflows, requires no ML expertise, and supports default or custom UDF configurations.

Secure API key storage is supported through a JCEKS keystore. A lightweight tool included in the UDF SDK helps create or update keystores on Amazon S3 or Azure ABFS without a local Hadoop setup.

For more information, see Advantages and use cases of Impala AI functions

What's new in Iceberg on Cloudera Data Warehouse on cloud

Cloudera support for Apache Iceberg version 1.5.2

The Apache Iceberg component has been upgraded from 1.4.3 to 1.5.2.

Reading Iceberg Puffin statistics

Impala supports reading Puffin statistics from current and older snapshots. When there are Puffin statistics for multiple snapshots, Impala chooses the most recent statistics for each column. This indicates that statistics for different columns may come from different snapshots. If there are Hive Metastore (HMS) and Puffin statistics for a column, the most recent statistics are considered. For HMS statistics, the impala.lastComputeStatsTime property is used and for Puffin statistics, the snapshot timestamp is used to determine which among the two is the most recent. For more information, see Iceberg Puffin statistics.

Enhancements to Iceberg data compaction

The OPTIMIZE TABLE statement is enhanced with the following improvements:

Supports partition evolution
The Hive and Impala OPTIMIZE TABLE statement that is used to compact Iceberg tables and optimize them for read operations, is enhanced to support compaction of Iceberg tables with partition evolution.
Supports data compaction based on file size threshold
The Impala OPTIMIZE TABLE statement has been enhanced to include a FILE_SIZE_THRESHOLD_MB option that enables you to specify the maximum size of files (in MB) that should be considered for compaction.

For more information, see Iceberg data compaction.

Impala supports the MERGE INTO statement for Iceberg tables

You can use Impala to run a MERGE INTO statement on an Iceberg table based on the results of a join between a target and source Iceberg table. For more information, see the Iceberg Merge feature.

What's new in Hue on Cloudera Data Warehouse on cloud

General availability of deploying a shared Hue service: Cloudera Data Warehouse now supports the deployment of a shared Hue service, enabling cost-efficient management by ensuring that only the necessary Virtual Warehouses remain active. Organizations can enhance team isolation by running multiple shared Hue instances, providing flexibility and control. The shared Hue service remains available as long as the environment is active.; For more information, see About deploying the shared Hue service.

Hue SQL AI: Multi database querying now supported

The Hue SQL AI Assistant now supports multi-database querying, allowing you to retrieve data from multiple databases simultaneously. This enhancement simplifies managing large datasets across different systems and enables seamless cross-database queries.

Support for cross-database queries.
Ability to retrieve and combine data from multiple sources in a single query.

For more information, see Multi database support for SQL query.

User Input Validation for Hue SQL AI: Hue SQL AI now supports secure and optimized integration with large language models (LLMs). You can now configure user input validation, such as prompt length limits, regex restrictions, and HTML tag handling, and more to enhance both security and system performance.; For more information, see User Input Validation for Hue SQL AI.