What's New in Cloudera Data Warehouse on cloud

Review the new features introduced in this release of Cloudera Data Warehouse service on Cloudera on cloud.

What's new in Cloudera Data Warehouse on cloud

Improvements to Impala Autoscaler Dashboard - view historical data
You can now view historical autoscaler metrics data for a specified period of time by choosing the Historic Data option and specifying the start and end timestamps for which you want to view the data. Note that this feature is currently available only for AWS environments. For more information, see About Impala Autoscaling Dashboard.
Publishing Cloudera Data Warehouse telemetry data in Cloudera Observability
While activating an AWS or Azure environment in Cloudera Data Warehouse, the global option that is set in Cloudera Management Console through Environments > Summary > Telemetry > Cloudera Observability - Workload Analytics is considered to decide if diagnostic information about job and query execution should be sent to Workload Manager.
If the Cloudera Observability - Workload Analytics option is enabled, Cloudera Data Warehouse publishes Hive or Impala query data to Cloudera Observability and if the option is disabled, users do not see any diagnostic data related to their queries.
Removal of docker custom registry type
Starting from this release, the "docker" custom image registry type is no longer supported in Cloudera Data Warehouse and the option to choose the "docker" registry type during environment activation is removed. Cloudera Data Warehouse only supports the ACR and ECR image registries.

What's new in Hive on Cloudera Data Warehouse on cloud

OpenTelemetry integration for Hive
Hive now integrates with OpenTelemetry (OTel) to enhance query by collecting and exporting telemetry data, including infrastructure and workload metrics. An OTel agent in Cloudera Data Warehouse helps monitor query performance and troubleshoot failures. For more information, see OpenTelemetry support for Hive

Apache Jira: HIVE-28504

Common table expression detection and rewrites using cost-based optimizer
Hive's existing shared work optimizer detects and optimizes common table expressions heuristically, but it lacks cost-based analysis and has limited customization. Introduced new APIs and configuration options to support common table expression optimizations at the cost-based optimizer level. The feature is experimental and disabled by default.

Apache Jira: HIVE-28259

Upgraded Avro to version 1.11.3

What's new in Impala on Cloudera Data Warehouse on cloud

Improved Cardinality Estimation for Aggregation Queries
Impala now provides more accurate cardinality estimates for aggregation queries by considering data distribution, predicates, and tuple tracing. Enhancements include:
  • Pre-aggregation Cardinality Adjustments: A new estimation model accounts for duplicate keys across nodes, reducing underestimation errors.
  • Predicate-Aware Cardinality Calculation: The planner now considers filtering conditions on group-by columns to refine cardinality estimates.
  • Tuple Tracing for Better Accuracy: Improved tuple analysis allows deeper tracking across views and intermediate aggregation nodes.
  • Consistent Aggregation Node Stats Computation: The planning process now ensures consistent and efficient recomputation of aggregation node statistics. These improvements lead to better memory estimates, optimized query execution, and more efficient resource utilization.
  • Tuple-Based Cardinality Analysis: Analyzing grouping expressions from the same tuple to ensure their combined number of distinct values does not exceed the output cardinality of the source PlanNode, reducing overestimation.
  • Refined number of distinct values Calculation for CPU Costing: The new approach applies a probabilistic formula to a single global NDV estimate, improving accuracy and reducing overestimation in processing cost calculations.

Apache Jira: IMPALA-2945, IMPALA-13086, IMPALA-13465 , IMPALA-13526, IMPALA-13405 IMPALA-13644

Cleanup of host-level remote scratch dir on startup and exit
Impala now removes leftover scratch files from remote storage during startup and shutdown, ensuring efficient storage management. The cleanup targets files in the host-specific directory within the configured remote scratch location.

A new flag, remote_scratch_cleanup_on_start_stop, controls this behavior. By default, cleanup is enabled, but you can disable it if multiple Impala daemons on a host or multiple clusters share the same remote scratch directory to prevent unintended deletions.

Apache Jira: IMPALA-13677, IMPALA-13798

Graceful shutdown with query cancellation
Impala now attempts to cancel running queries before reaching the graceful shutdown deadline, ensuring resources are released properly. The new shutdown_query_cancel_period_s flag controls this behavior. The default value is 60 seconds. If set to a value greater than 0, Impala will try to cancel running queries within this period before forcing shutdown. If the value exceeds 20% of the total shutdown deadline, it is automatically capped to prevent excessive delays. This approach helps prevent unfinished queries and unreleased resources during shutdown. For more information, see Setting Impala Query Cancellation on Shut down
Programmatic query and session termination
Impala now supports the KILL QUERY statement, enabling you to forcibly terminate queries and sessions for better workload management. The KILL QUERY statement cancels and unregisters queries on any coordinator. For more information, see KILL QUERY statement
Ability to log and manage Impala workloads is now GA
Cloudera Data Warehouse provides you the option to enable logging Impala queries on an existing Virtual Warehouse or while creating a new Impala Virtual Warehouse. The information for all completed Impala queries is stored in the sys.impala_query_log system table. Information about all actively running and recently completed Impala queries is stored in the sys.impala_query_live system table. Users with appropriate permissions can query this table using SQL to monitor and optimize the Impala engine. For more information, see Impala workload management
AI-enhanced UDF development package in Impala in now GA
Cloudera Data Warehouse introduces Impala’s built-in ai_generate_text function integrates Large Language Models (LLMs) into SQL for tasks such as sentiment analysis and translation. It simplifies workflows, requires no ML expertise, and supports default or custom UDF configurations.

Secure API key storage is supported through a JCEKS keystore. A lightweight tool included in the UDF SDK helps create or update keystores on Amazon S3 or Azure ABFS without a local Hadoop setup.

For more information, see Advantages and use cases of Impala AI functions

What's new in Iceberg on Cloudera Data Warehouse on cloud

Cloudera support for Apache Iceberg version 1.5.2
The Apache Iceberg component has been upgraded from 1.4.3 to 1.5.2.
Reading Iceberg Puffin statistics
Impala supports reading Puffin statistics from current and older snapshots. When there are Puffin statistics for multiple snapshots, Impala chooses the most recent statistics for each column. This indicates that statistics for different columns may come from different snapshots. If there are Hive Metastore (HMS) and Puffin statistics for a column, the most recent statistics are considered. For HMS statistics, the impala.lastComputeStatsTime property is used and for Puffin statistics, the snapshot timestamp is used to determine which among the two is the most recent. For more information, see Iceberg Puffin statistics.
Enhancements to Iceberg data compaction
The OPTIMIZE TABLE statement is enhanced with the following improvements:
  • Supports partition evolution

    The Hive and Impala OPTIMIZE TABLE statement that is used to compact Iceberg tables and optimize them for read operations, is enhanced to support compaction of Iceberg tables with partition evolution.

  • Supports data compaction based on file size threshold

    The Impala OPTIMIZE TABLE statement has been enhanced to include a FILE_SIZE_THRESHOLD_MB option that enables you to specify the maximum size of files (in MB) that should be considered for compaction.

For more information, see Iceberg data compaction.

Impala supports the MERGE INTO statement for Iceberg tables
You can use Impala to run a MERGE INTO statement on an Iceberg table based on the results of a join between a target and source Iceberg table. For more information, see the Iceberg Merge feature.

What's new in Hue on Cloudera Data Warehouse on cloud

General availability of deploying a shared Hue service
Cloudera Data Warehouse now supports the deployment of a shared Hue service, enabling cost-efficient management by ensuring that only the necessary Virtual Warehouses remain active. Organizations can enhance team isolation by running multiple shared Hue instances, providing flexibility and control. The shared Hue service remains available as long as the environment is active.
For more information, see About deploying the shared Hue service.
Hue SQL AI: Multi database querying now supported
The Hue SQL AI Assistant now supports multi-database querying, allowing you to retrieve data from multiple databases simultaneously. This enhancement simplifies managing large datasets across different systems and enables seamless cross-database queries.
  • Support for cross-database queries.
  • Ability to retrieve and combine data from multiple sources in a single query.
For more information, see Multi database support for SQL query.
User Input Validation for Hue SQL AI
Hue SQL AI now supports secure and optimized integration with large language models (LLMs). You can now configure user input validation, such as prompt length limits, regex restrictions, and HTML tag handling, and more to enhance both security and system performance.
For more information, see User Input Validation for Hue SQL AI.