What's new in Cloudera Data Warehouse on premises
Review the new features in this release of Cloudera Data Warehouse on premises service.
Cloudera Data Warehouse on premises
- Integrating third-party Certification Manager
- Cert-manager is an open-source tool for Kubernetes that automates the provisioning, management, and renewal of TLS certificates. Its documentation at https://cert-manager.io/docs/ provides comprehensive guidance on installing, configuring, and using cert-manager to secure workloads with trusted X.509 certificates. Cloudera provides out-of-the-box support for Venafi TPP as part of the Cloudera Embedded Container Service installation. By integrating cert-manager, the Cloudera Data Services on premises achieve secure communication, reduced manual overhead, and compliance with security standards, leveraging its robust automation and flexibility. For more information on integrating Cert-manager using Venafi TPP in Cloudera Data Warehouse, see Configuring cluster issuer for Certificate Manager.
- Quota management improvements to support multiple environments
- As part of this release, Quota Management capabilities have been enhanced to support
multiple environments. Previously,
root
served as the top-level resource for the cluster. With the new changes, each environment now has its own resource pool for the respective data service.When an environment is activated in Cloudera Data Warehouse, a
root.<environment-name>.cdw
resource pool is automatically created. This newly created resource pool can be selected as the top-level resource pool. For more details, refer to Quota management in Cloudera Data Warehouse on premises. - Improvements to Impala Autoscaler Dashboard
-
The following inprovements were introduced for the Impala Autoscaler Dashboard:
- Ability to select the log-level configuration for the autoscaler and autoscaler metrics containers.
- A new “Understanding The Dashboard” page has been added which explains the metrics displayed on the UI and how they are calculated.
- Empty data points that manifest as gaps in the graphs are skipped. Zero values are accurately displayed.
For more information, see About Impala Autoscaling dashboard.
- Ability to view end-of-support information through UI and CDP CLI
- Cloudera Data Warehouse releases reach the end of support every six
months. The Cloudera Data Warehouse UI displays whether your deployment is
nearing its end of support time or is unsupported, enabling you to plan an upgrade. You
can also view the upgrade instructions on the UI. The end of support information is also
displayed when you run the
list-clusters
anddescribe-clusters
CDP CLI commands. - Streamlined option for downloading Cloudera Data Warehouse diagnostic bundles
- Cloudera Data Warehouse users can now easily download diagnostic bundles with a direct Collect option that reduces the need for prior time interval and log selection adjustments. This update enables faster, more efficient access to relevant diagnostic data. See, Downloading diagnostic bundles and Accessing and generating diagnostic bundles
- Security improvement: use of Chainguard images
- To enhance security, Cloudera Data Warehouse now uses Chainguard hardened
images for its base images, Impala, Hue, and third-party images. The Kubernetes
Dashboard is excluded from this change.
These changes help us address CVEs and offer improved security and stability. For more information, see Chainguard container images.
What's new in Hive on Cloudera Data Warehouse on premises
- Hive Query History Service
- The Hive query history service provides a scalable solution for storing and analyzing historical Hive query data. It captures detailed information about completed queries, such as runtime, accessed tables, errors, and metadata, and stores it in an efficient Iceberg table format. For more information see, Hive query history service
- OpenTelemetry integration for Hive
- Hive now integrates with OpenTelemetry (OTel) to enhance query by collecting and
exporting telemetry data, including infrastructure and workload metrics. An OTel agent
in Cloudera Data Warehouse helps monitor query performance and troubleshoot
failures. For more information, see OpenTelemetry support for Hive
Apache Jira: HIVE-28504
What's new in Impala on Cloudera Data Warehouse on premises
- Improved Cardinality Estimation for Aggregation Queries
- Impala now provides more accurate cardinality estimates for aggregation queries by
considering data distribution, predicates, and tuple tracing. Enhancements include:
- Pre-aggregation Cardinality Adjustments: A new estimation model accounts for duplicate keys across nodes, reducing underestimation errors.
- Predicate-Aware Cardinality Calculation: The planner now considers filtering conditions on group-by columns to refine cardinality estimates.
- Tuple Tracing for Better Accuracy: Improved tuple analysis allows deeper tracking across views and intermediate aggregation nodes.
- Consistent Aggregation Node Stats Computation: The planning process now ensures consistent and efficient recomputation of aggregation node statistics. These improvements lead to better memory estimates, optimized query execution, and more efficient resource utilization.
- Tuple-Based Cardinality Analysis: Analyzing grouping expressions from the same tuple to ensure their combined number of distinct values does not exceed the output cardinality of the source PlanNode, reducing overestimation.
- Refined number of distinct values Calculation for CPU Costing: The new approach applies a probabilistic formula to a single global NDV estimate, improving accuracy and reducing overestimation in processing cost calculations.
Apache Jira: IMPALA-2945, IMPALA-13086, IMPALA-13465 , IMPALA-13526, IMPALA-13405 IMPALA-13644
- Cleanup of host-level remote scratch dir on startup and exit
- Impala now removes leftover scratch files from remote storage during startup and shutdown, ensuring efficient storage management. The cleanup targets files in the host-specific directory within the configured remote scratch location.
- Graceful shutdown with query cancellation
- Impala now attempts to cancel running queries before reaching the graceful shutdown
deadline, ensuring resources are released properly. The new
shutdown_query_cancel_period_s flag
controls this behavior. The default value is 60 seconds. If set to a value greater than 0, Impala will try to cancel running queries within this period before forcing shutdown. If the value exceeds 20% of the total shutdown deadline, it is automatically capped to prevent excessive delays. This approach helps prevent unfinished queries and unreleased resources during shutdown. For more information, see Setting Impala Query Cancellation on Shut down - Programmatic query termination
- Impala now supports the
KILL QUERY
statement, enabling you to forcibly terminate queries for better workload management. TheKILL QUERY
statement cancels and unregisters queries on any coordinator. For more information, see KILL QUERY statement - Ability to log and manage Impala workloads
- Cloudera Data Warehouse provides you the option to enable logging Impala
queries on an existing Virtual Warehouse or while creating a new Impala Virtual
Warehouse. The information for all completed Impala queries is stored in the
sys.impala_query_log
system table. Information about all actively running and recently completed Impala queries is stored in thesys.impala_query_live
system table. Users with appropriate permissions can query this table using SQL to monitor and optimize the Impala engine. For more information, see Impala workload management
What's new in Iceberg on Cloudera Data Warehouse on premises
- Cloudera support for Apache Iceberg version 1.5.2
- The Apache Iceberg component has been upgraded from 1.4.3 to 1.5.2.
- Reading Iceberg Puffin statistics
- Impala supports reading Puffin statistics from current and older snapshots. When there
are Puffin statistics for multiple snapshots, Impala chooses the most recent statistics
for each column. This indicates that statistics for different columns may come from
different snapshots. If there are Hive Metastore (HMS) and Puffin statistics for a
column, the most recent statistics are considered. For HMS statistics, the
impala.lastComputeStatsTime
property is used and for Puffin statistics, the snapshot timestamp is used to determine which among the two is the most recent. For more information, see Iceberg Puffin statistics. - Enhancements to Iceberg data compaction
- The
OPTIMIZE TABLE
statement is enhanced with the following improvements:- Supports partition evolution
The Hive and Impala
OPTIMIZE TABLE
statement that is used to compact Iceberg tables and optimize them for read operations, is enhanced to support compaction of Iceberg tables with partition evolution. - Supports data compaction based on file size
threshold
The Impala
OPTIMIZE TABLE
statement has been enhanced to include aFILE_SIZE_THRESHOLD_MB
option that enables you to specify the maximum size of files (in MB) that should be considered for compaction.
For more information, see Iceberg data compaction.
- Supports partition evolution
- Impala supports the
MERGE INTO
statement for Iceberg tables - You can use Impala to run a
MERGE INTO
statement on an Iceberg table based on the results of a join between a target and source Iceberg table. For more information, see the Iceberg Merge feature.
What's new in Hue on Cloudera Data Warehouse on premises
- Enhanced AI Integration in Hue SQL AI Assistant
- The Hue SQL AI Assistant now supports Cloudera AI Workbench and Cloudera AI Inference service. These integrations enhance the Hue SQL AI
Assistant by enabling the use of private models hosted within Cloudera-managed infrastructure. This ensures
enhanced security and privacy while leveraging GenAI for the Hue SQL-related tasks.
- Cloudera AI Workbench: This enables you to securely deploy and run your own models within a virtual private cloud. This configuration enhances control and privacy within your environment. For more information, see Configure SQL AI Assistant using Cloudera AI Workbench.
- Cloudera AI Inference service: Helps in a production-grade serving environment for hosting predictive and generative AI models. This service simplifies model deployment and maintenance. For more information, see Configure SQL AI Assistant using Cloudera AI Inference service.
- Hue SQL AI: Multi database querying now supported
- The Hue SQL AI Assistant now supports multi-database querying, allowing you to
retrieve data from multiple databases simultaneously. This enhancement simplifies
managing large datasets across different systems and enables seamless cross-database
queries.
- Support for cross-database queries.
- Ability to retrieve and combine data from multiple sources in a single query.
- User Input Validation for Hue SQL AI
- Hue SQL AI now supports secure and optimized integration with large language models
(LLMs). You can now configure user input validation, such as prompt length limits, regex
restrictions, and HTML tag handling, and more to enhance both security and system
performance.
For more information, see User Input Validation for Hue SQL AI.