Cloudera AI on premises 1.5.4 SP2 CHF1
Review the features, fixes, and known issues in the Cloudera AI 1.5.4 SP2 Cumulative hotfix 1 release.
Fixed issues in 1.5.4 SP2 CHF1
Review the fixed issues in the Cloudera AI 1.5.4 SP2 Cumulative hotfix 1 release.
- DSE-44586: Disabled Spark ML Runtime addons are reenabled
-
Previously, Spark ML Runtime addons were enabled automatically, which was unintended behavior. This issue is resolved, and Spark ML Runtime addons can only be enabled when explicitly configured by the administrator.
- DSE-41027: Upgrade path from 1.5.0 -> 1.5.2 -> 1.5.4 -> 1.5.4 CHF3 fails with missing key
-
Due to a strict validation introduced, certain upgrade paths were missing the required key, resulting in upgrade failures. Users upgrading to version 1.5.4 might have encountered this issue if their original Cloudera Data Services on premises version at the time of initial installation was 1.5.0 or 1.5.1.
This issue has been resolved.
- DSE-40225: Projects/cdn/XXXX-POD folders are not cleaned on NFS
-
This issue addressed a bug where
projects/cdn/XXXX-POD
folders were not being cleaned up on the Network File System (NFS) after workload termination, resulting in the accumulation of excessive stale directories. These folders, created per workload and mounted as writable volumes, were intended to persist until the session or project was deleted but were never automatically removed. - DSE-45050: Large file job attachment crashes web
-
Uploading excessively large files, for example, around 500MB, through the UI caused the web container to crash. This issue was triggered by an uncaught asynchronous exception in the `send email` method, which was not properly handled when processing attachments. The problem is resolved, ensuring that large file uploads no longer result in web container crashes.
- DSE-43104: Timezones in Cloudera AI on premises cause pods to be killed with exit code 34
-
In Kubernetes clusters on premises, local timezones can be configured. This caused timezone discrepancies in engine pods, in which timestamp fields such as
scheduling_at
,starting_at
,running_at
, andfinished_at
were inconsistently stored and read across the Cloudera AI infrastructure.The issue is now resolved by updating the dashboard timestamp fields to store all timestamps with explicit timezone information and enforcing UTC for all database writes. This fix ensures consistent timestamp handling and prevents premature pod termination due to timezone offsets.
- DSE-44091: StartJobRun Kubernetes client failure error lost during processing
-
Kubernetes client failure errors during
StartJobRun
API were not properly reported, leading to misleading success logs. The operator pod failed to correctly translate certain asynchronous action errors, resulting in falseFinish StartJobRun
success messages even when pod start failures occurred.The root cause was identified as insufficient error handling in the
GetFailureResponse
method, which allowed nil errors to bubble up and obscure the actual failures. This issue is now resolved, ensuring accurate error reporting and preventing misleading success logs. - DSE-44083: Web service requests to other services have malformed UUID in logs
-
This issue addressed a bug where web service requests logged malformed Universally Unique Identifiers (UUIDs), making debugging difficult. The problem occurred because the length of the
contextId
UUID was limited to 30 characters, causing 36-characterrequestIds
UUID to be truncated in operator logs.The issue is now resolved by increasing the
contextId
length to 36 characters, ensuring full UUIDs are preserved in logs and improving log accuracy for debugging purposes. - DSE-44088: Operator pod start failure log missing ID
-
This issue addressed the lack of detailed error logging in operator pod failures, where logs did not include the
engineId
UUID orrequest
UUID, making it difficult to trace and debug issues.The fix implemented now includes both the
engineId
UUID andrequest
UUID in failure logs, significantly improving traceability and aiding in debugging operator pod failures. - DSE-41733 - Spark logs not cleaned up by Livelog cleaner
- Spark logs were not cleaned up by the Livelog cleaner because the
getCleanableEngines
internal API did not include Spark executors. This issue is now resolved, and thegetCleanableEngines
API now includes Spark executors, ensuring that Spark logs are properly cleaned up by the Livelog cleaner. - DSE-42231: When a workbench experiences heavy usage the new Session page starts to load slowly
-
The new Session page and related workload creation pages, such as new Jobs, and Applications, experienced significant slowdowns, with load times exceeding 60 seconds and frequent timeouts.
The issue is now resolved by removing unnecessary
/usage
API calls from workload creation pages, including Sessions, Jobs, Applications and so on, ensuring faster load times and improved responsiveness. - DSE-40029: The job timeout is not of a valid type
-
Previously, the job timeout value could not be cleared once set and attempting to update a job with an empty timeout field resulted in a
The job timeout is not of a valid type
validation error . This issue is now resolved, and the timeout field can now be cleared as expected. - DSE-43950: Workbench installation is failing as buildkit pod is crashing due to port bind issues
-
Buildkitd pods in Cloudera AI Workbench could intermittently fail with a
CrashLoopBackOff
error because the BuildKit 1234 port was not properly released during pod restarts or was occupied by another process. This resulted in errors, such as:buildkitd: listen tcp 0.0.0.0:1234: bind: address already in use.
This issue is now resolved, and Buildkitd pods no longer enter a crash loop state due to port binding issues.
- DSE-44700: Fixing web pod crash while fetching data connections from Cloudera Base cluster
- This issue addressed a bug that prevented the use of underscores in data connection names. The fix now allows underscores to be used, ensuring greater flexibility in naming data connections.
- DSE-43774 Reconciler unresponsiveness issue
-
Previously, a logging-pipeline issue could cause the reconciler and other microservices utilizing Cloudera customized tee binary to freeze. The issue surfaced when the reconciler tried to log a very large Kubernetes object from a
DeletedFinalStateUnknown
pod event. The object itself was harmless, but the single log line it produced exceeded 64 KB, triggering a hard line-size limit in the customized tee, which blocked the log stream and stalled the entire service. The issue is now resolved and the tee can now stream log lines of any length, removing the 64 KB constraint and preventing similar hangs across all components.