Cloudera AI on premises 1.5.5 CHF1

Review the features, fixes, and known issues in the Cloudera AI on premises 1.5.5 Cumulative hotfix 1 release.

For information on the Repository locations, see Repository Locations for Data Services on premises 1.5.5 CHF1.

Fixed issues in 1.5.5 CHF1

Review the fixed issues in the Cloudera AI 1.5.5 Cumulative hotfix 1 release.

DSE-37313: Dashboards and user_events tables scalability to avoid web pods OOM and performance issues

Large dashboards and user event tables containing millions of records previously caused web pod out-of-memory (OOM) restarts and UI performance degradation.

This issue is now resolved by archiving old records into the dashboards_archive table, which reduces the size of the dashboards table. As a result, active and archived data are now managed separately to enhance overall performance and query execution for active dashboards is faster, providing a more consistent user experience. For more details, see Optimized queries with Dashboards Archive table.

DSE-42826: Creator filter shows a user multiple times

The creator filter in the Project job list displayed duplicate usernames. This issue occurred when multiple users shared the same full name but had different usernames, resulting in duplicates in the creator filter dropdown. The problem is now resolved by implementing a Map-based method to ensure that each creator appears only once in the dropdown. This fix eliminates redundancy and improves the usability of the creator filter.

DSE-40029: The job timeout is not of a valid type

The job timeout field could not be cleared once set, leading to update errors. This issue is now resolved by converting null timeout updates to 0, which the backend interprets as no timeout restriction. This fix ensures that users can reset the timeout value directly through the UI, providing greater flexibility and control over job timeout configurations.

DSE-45485: Update tenant if the tenant is an empty string in model registry table

If the tenant field in the model registry table was empty in 1.5.4, it led to query failures after upgrading to version 1.5.5. This issue is now resolved by implementing a migration script that sets the empty tenant fields to the appropriate values, ensuring smooth operation after the upgrade.

DSE-25966: Memory leak detected in model proxy

A memory leak in the model proxy caused Kubernetes out-of-memory (OOM) termination and brief model serving outages. This issue is now resolved by replacing the custom cache with the APIv2 cache library, properly closing database calls, eliminating heavy synchronous cache calls, and implementing daily cache pruning. These changes ensure improved stability and performance of the model proxy.

DSE-45050: Large file job attachment crashes web

Large file attachments, ranging from approximately 500MB to 10GB, caused the web container to crash due to unhandled Google Remote Procedure Calls (gRPC) deadline exceeded errors. This issue is now resolved by implementing file size limitation and graceful error handling. The system now refuses Virtual File System (VFS) calls for files exceeding a predefined size limit, preventing the processing of excessively large attachments. Errors related to large file attachments are now handled gracefully, ensuring that the web container remains stable and operational even when such files are encountered.

DSE-41316: Fix Add model resource profile dropdown

The Add Model Resource Profile dropdown option in the UI was not functioning correctly due to the broadcast function not being passed to the view layer. As a result, POST arguments failed to update despite changes made in the UI.

This issue is now resolved, ensuring that the dropdown properly updates POST arguments in response to UI changes, restoring its intended functionality.

DSE-45141: Project creation fail - Git Clone

Project creation was failing due to Git clone permission errors on filesystem mounted on Network File System (NFS). The root cause was identified as Git attempting to read temporary pack files directly on NFS, which triggered permission issues.

This issue is now resolved by modifying the process to first clone Git repositories to the local filesystem. The cloned files are then copied to the NFS path, effectively avoiding NFS-related permission problems and ensuring successful project creation.

DSE-44083: Web requests to other services have malformed UUID in logs

Web requests to other services were logging malformed Universally Unique Identifier (UUIDs), which complicated debugging efforts. The issue was traced back to request IDs from web pods being trimmed in operator logs due to a 30-character limit on the `contextId` element.

This issue is now resolved by increasing the length of the contextId element to 36 characters, ensuring that full UUIDs are preserved in the logs. This fix improves log accuracy and facilitates more effective debugging.

DSE-44088: Operator pod start failure log missing ID

Operator pod failure logs lacked critical identifiers, such as the engineId identifier and request UUID, which complicated error identification and traceability. This issue is now resolved by enhancing the failure logs to include both the engineId and request UUID identifiers for engine creation by the operator pod. This improvement provides better error traceability and simplifies debugging.

DSE-44091: StartJobRun Kubernetes client failure error lost during processing

During the StartJobRun process, the Kubernetes client failure errors were not correctly reported, leading to misleading success messages. The issue stemmed from failure responses from the Kubernetes client not being properly translated, resulting in false indications of success. The root cause was identified in the error handling of the GetFailureResponse function, which lacked robustness. This issue is now resolved, and engine start failures are accurately reported with detailed error data, eliminating misleading success messages and improving error transparency.