What's New in Cloudera Data Warehouse on cloud
Review the new features introduced in this release of Cloudera Data Warehouse service on Cloudera on cloud.
What's new in Cloudera Data Warehouse on cloud
- Improvements to Impala Autoscaler Dashboard - view historical data
- You can now view historical autoscaler metrics data for a specified period of time by choosing the Historic Data option and specifying the start and end timestamps for which you want to view the data. Note that this feature is currently available only for AWS environments. For more information, see About Impala Autoscaling Dashboard.
- Publishing Cloudera Data Warehouse telemetry data in Cloudera Observability
- While activating an AWS or Azure environment in Cloudera Data Warehouse, the
global option that is set in Cloudera Management Console through is considered to decide if diagnostic information about job and query execution
should be sent to Workload Manager.If the Cloudera Observability - Workload Analytics option is enabled, Cloudera Data Warehouse publishes Hive or Impala query data to Cloudera Observability and if the option is disabled, users do not see any diagnostic data related to their queries.
- Removal of docker custom registry type
- Starting from this release, the "docker" custom image registry type is no longer supported in Cloudera Data Warehouse and the option to choose the "docker" registry type during environment activation is removed. Cloudera Data Warehouse only supports the ACR and ECR image registries.
What's new in Hive on Cloudera Data Warehouse on cloud
- OpenTelemetry integration for Hive
- Hive now integrates with OpenTelemetry (OTel) to enhance query by collecting and exporting
telemetry data, including infrastructure and workload metrics. An OTel agent in Cloudera Data Warehouse helps monitor query performance and troubleshoot failures.
For more information, see OpenTelemetry support for Hive
Apache Jira: HIVE-28504
- Common table expression detection and rewrites using cost-based optimizer
- Hive's existing shared work optimizer detects and optimizes common table expressions
heuristically, but it lacks cost-based analysis and has limited customization. Introduced new
APIs and configuration options to support common table expression optimizations at the
cost-based optimizer level. The feature is experimental and disabled by default.
Apache Jira: HIVE-28259
- Upgraded Avro to version 1.11.3
What's new in Impala on Cloudera Data Warehouse on cloud
- Improved Cardinality Estimation for Aggregation Queries
- Impala now provides more accurate cardinality estimates for aggregation queries by
considering data distribution, predicates, and tuple tracing. Enhancements include:
- Pre-aggregation Cardinality Adjustments: A new estimation model accounts for duplicate keys across nodes, reducing underestimation errors.
- Predicate-Aware Cardinality Calculation: The planner now considers filtering conditions on group-by columns to refine cardinality estimates.
- Tuple Tracing for Better Accuracy: Improved tuple analysis allows deeper tracking across views and intermediate aggregation nodes.
- Consistent Aggregation Node Stats Computation: The planning process now ensures consistent and efficient recomputation of aggregation node statistics. These improvements lead to better memory estimates, optimized query execution, and more efficient resource utilization.
- Tuple-Based Cardinality Analysis: Analyzing grouping expressions from the same tuple to ensure their combined number of distinct values does not exceed the output cardinality of the source PlanNode, reducing overestimation.
- Refined number of distinct values Calculation for CPU Costing: The new approach applies a probabilistic formula to a single global NDV estimate, improving accuracy and reducing overestimation in processing cost calculations.
Apache Jira: IMPALA-2945, IMPALA-13086, IMPALA-13465 , IMPALA-13526, IMPALA-13405 IMPALA-13644
- Cleanup of host-level remote scratch dir on startup and exit
- Impala now removes leftover scratch files from remote storage during startup and shutdown,
ensuring efficient storage management. The cleanup targets files in the host-specific
directory within the configured remote scratch location.
A new flag,
remote_scratch_cleanup_on_start_stop
, controls this behavior. By default, cleanup is enabled, but you can disable it if multiple Impala daemons on a host or multiple clusters share the same remote scratch directory to prevent unintended deletions.Apache Jira: IMPALA-13677, IMPALA-13798
- Graceful shutdown with query cancellation
- Impala now attempts to cancel running queries before reaching the graceful shutdown
deadline, ensuring resources are released properly. The new
shutdown_query_cancel_period_s flag
controls this behavior. The default value is 60 seconds. If set to a value greater than 0, Impala will try to cancel running queries within this period before forcing shutdown. If the value exceeds 20% of the total shutdown deadline, it is automatically capped to prevent excessive delays. This approach helps prevent unfinished queries and unreleased resources during shutdown. For more information, see Setting Impala Query Cancellation on Shut down - Programmatic query and session termination
- Impala now supports the
KILL QUERY
statement, enabling you to forcibly terminate queries and sessions for better workload management. TheKILL QUERY
statement cancels and unregisters queries on any coordinator. For more information, see KILL QUERY statement - Ability to log and manage Impala workloads is now GA
- Cloudera Data Warehouse provides you the option to enable logging Impala
queries on an existing Virtual Warehouse or while creating a new Impala Virtual Warehouse. The
information for all completed Impala queries is stored in the
sys.impala_query_log
system table. Information about all actively running and recently completed Impala queries is stored in thesys.impala_query_live
system table. Users with appropriate permissions can query this table using SQL to monitor and optimize the Impala engine. For more information, see Impala workload management - AI-enhanced UDF development package in Impala in now GA
- Cloudera Data Warehouse introduces Impala’s built-in ai_generate_text
function integrates Large Language Models (LLMs) into SQL for tasks such as sentiment analysis
and translation. It simplifies workflows, requires no ML expertise, and supports default or
custom UDF configurations.
Secure API key storage is supported through a JCEKS keystore. A lightweight tool included in the UDF SDK helps create or update keystores on Amazon S3 or Azure ABFS without a local Hadoop setup.
For more information, see Advantages and use cases of Impala AI functions
What's new in Iceberg on Cloudera Data Warehouse on cloud
- Cloudera support for Apache Iceberg version 1.5.2
- The Apache Iceberg component has been upgraded from 1.4.3 to 1.5.2.
- Reading Iceberg Puffin statistics
- Impala supports reading Puffin statistics from current and older snapshots. When there are
Puffin statistics for multiple snapshots, Impala chooses the most recent statistics for each
column. This indicates that statistics for different columns may come from different
snapshots. If there are Hive Metastore (HMS) and Puffin statistics for a column, the most
recent statistics are considered. For HMS statistics, the
impala.lastComputeStatsTime
property is used and for Puffin statistics, the snapshot timestamp is used to determine which among the two is the most recent. For more information, see Iceberg Puffin statistics. - Enhancements to Iceberg data compaction
- The
OPTIMIZE TABLE
statement is enhanced with the following improvements:- Supports partition evolution
The Hive and Impala
OPTIMIZE TABLE
statement that is used to compact Iceberg tables and optimize them for read operations, is enhanced to support compaction of Iceberg tables with partition evolution. - Supports data compaction based on file size
threshold
The Impala
OPTIMIZE TABLE
statement has been enhanced to include aFILE_SIZE_THRESHOLD_MB
option that enables you to specify the maximum size of files (in MB) that should be considered for compaction.
For more information, see Iceberg data compaction.
- Supports partition evolution
- Impala supports the
MERGE INTO
statement for Iceberg tables - You can use Impala to run a
MERGE INTO
statement on an Iceberg table based on the results of a join between a target and source Iceberg table. For more information, see the Iceberg Merge feature.
What's new in Hue on Cloudera Data Warehouse on cloud
- General availability of deploying a shared Hue service
- Cloudera Data Warehouse now supports the deployment of a shared Hue service, enabling cost-efficient management by ensuring that only the necessary Virtual Warehouses remain active. Organizations can enhance team isolation by running multiple shared Hue instances, providing flexibility and control. The shared Hue service remains available as long as the environment is active.
- Hue SQL AI: Multi database querying now supported
- The Hue SQL AI Assistant now supports multi-database querying, allowing you to retrieve
data from multiple databases simultaneously. This enhancement simplifies managing large
datasets across different systems and enables seamless cross-database queries.
- Support for cross-database queries.
- Ability to retrieve and combine data from multiple sources in a single query.
- User Input Validation for Hue SQL AI
- Hue SQL AI now supports secure and optimized integration with large language models (LLMs). You can now configure user input validation, such as prompt length limits, regex restrictions, and HTML tag handling, and more to enhance both security and system performance.