Cloudera AI on premises 1.5.4 SP2 CHF1

Review the features, fixes, and known issues in the Cloudera AI 1.5.4 SP2 Cumulative hotfix 1 release.

Fixed issues in 1.5.4 SP2 CHF1

Review the fixed issues in the Cloudera AI 1.5.4 SP2 Cumulative hotfix 1 release.

DSE-44586: Disabled Spark ML Runtime addons are reenabled

Previously, Spark ML Runtime addons were enabled automatically, which was unintended behavior. This issue is resolved, and Spark ML Runtime addons can only be enabled when explicitly configured by the administrator.

DSE-41027: Upgrade path from 1.5.0 -> 1.5.2 -> 1.5.4 -> 1.5.4 CHF3 fails with missing key

Due to a strict validation introduced, certain upgrade paths were missing the required key, resulting in upgrade failures. Users upgrading to version 1.5.4 might have encountered this issue if their original Cloudera Data Services on premises version at the time of initial installation was 1.5.0 or 1.5.1.

This issue has been resolved.

DSE-40225: Projects/cdn/XXXX-POD folders are not cleaned on NFS

This issue addressed a bug where projects/cdn/XXXX-POD folders were not being cleaned up on the Network File System (NFS) after workload termination, resulting in the accumulation of excessive stale directories. These folders, created per workload and mounted as writable volumes, were intended to persist until the session or project was deleted but were never automatically removed.

DSE-45050: Large file job attachment crashes web

Uploading excessively large files, for example, around 500MB, through the UI caused the web container to crash. This issue was triggered by an uncaught asynchronous exception in the `send email` method, which was not properly handled when processing attachments. The problem is resolved, ensuring that large file uploads no longer result in web container crashes.

DSE-43104: Timezones in Cloudera AI on premises cause pods to be killed with exit code 34

In Kubernetes clusters on premises, local timezones can be configured. This caused timezone discrepancies in engine pods, in which timestamp fields such as scheduling_at, starting_at, running_at, and finished_at were inconsistently stored and read across the Cloudera AI infrastructure.

The issue is now resolved by updating the dashboard timestamp fields to store all timestamps with explicit timezone information and enforcing UTC for all database writes. This fix ensures consistent timestamp handling and prevents premature pod termination due to timezone offsets.

DSE-44091: StartJobRun Kubernetes client failure error lost during processing

Kubernetes client failure errors during StartJobRun API were not properly reported, leading to misleading success logs. The operator pod failed to correctly translate certain asynchronous action errors, resulting in false Finish StartJobRun success messages even when pod start failures occurred.

The root cause was identified as insufficient error handling in the GetFailureResponse method, which allowed nil errors to bubble up and obscure the actual failures. This issue is now resolved, ensuring accurate error reporting and preventing misleading success logs.

DSE-44083: Web service requests to other services have malformed UUID in logs

This issue addressed a bug where web service requests logged malformed Universally Unique Identifiers (UUIDs), making debugging difficult. The problem occurred because the length of the contextId UUID was limited to 30 characters, causing 36-character requestIds UUID to be truncated in operator logs.

The issue is now resolved by increasing the contextId length to 36 characters, ensuring full UUIDs are preserved in logs and improving log accuracy for debugging purposes.

DSE-44088: Operator pod start failure log missing ID

This issue addressed the lack of detailed error logging in operator pod failures, where logs did not include the engineId UUID or request UUID, making it difficult to trace and debug issues.

The fix implemented now includes both the engineId UUID and request UUID in failure logs, significantly improving traceability and aiding in debugging operator pod failures.

DSE-41733 - Spark logs not cleaned up by Livelog cleaner
Spark logs were not cleaned up by the Livelog cleaner because the getCleanableEngines internal API did not include Spark executors. This issue is now resolved, and the getCleanableEngines API now includes Spark executors, ensuring that Spark logs are properly cleaned up by the Livelog cleaner.
DSE-42231: When a workbench experiences heavy usage the new Session page starts to load slowly

The new Session page and related workload creation pages, such as new Jobs, and Applications, experienced significant slowdowns, with load times exceeding 60 seconds and frequent timeouts.

The issue is now resolved by removing unnecessary /usage API calls from workload creation pages, including Sessions, Jobs, Applications and so on, ensuring faster load times and improved responsiveness.

DSE-40029: The job timeout is not of a valid type

Previously, the job timeout value could not be cleared once set and attempting to update a job with an empty timeout field resulted in a The job timeout is not of a valid type validation error . This issue is now resolved, and the timeout field can now be cleared as expected.

DSE-43950: Workbench installation is failing as buildkit pod is crashing due to port bind issues
Buildkitd pods in Cloudera AI Workbench could intermittently fail with a CrashLoopBackOff error because the BuildKit 1234 port was not properly released during pod restarts or was occupied by another process. This resulted in errors, such as:
buildkitd: listen tcp 0.0.0.0:1234: bind: address already in use.

This issue is now resolved, and Buildkitd pods no longer enter a crash loop state due to port binding issues.

DSE-44700: Fixing web pod crash while fetching data connections from Cloudera Base cluster
This issue addressed a bug that prevented the use of underscores in data connection names. The fix now allows underscores to be used, ensuring greater flexibility in naming data connections.
DSE-43774 Reconciler unresponsiveness issue

Previously, a logging-pipeline issue could cause the reconciler and other microservices utilizing Cloudera customized tee binary to freeze. The issue surfaced when the reconciler tried to log a very large Kubernetes object from a DeletedFinalStateUnknown pod event. The object itself was harmless, but the single log line it produced exceeded 64 KB, triggering a hard line-size limit in the customized tee, which blocked the log stream and stalled the entire service. The issue is now resolved and the tee can now stream log lines of any length, removing the 64 KB constraint and preventing similar hangs across all components.