Known Issues

You might run into some known issues while using Cloudera AI on premises.

DSE-44901: Possible occurrence for incorrect status for successful workloads

A potential race condition in the reconciler can result in the status of successful workloads being incorrectly updated to unknown or failed status.

The issue occurs due to high system load, which leads to incorrect status reporting after pod deletion.

DSE-6499: Using dollar character in environment variables in Cloudera AI

Environment variables with the dollar ($) character are not parsed correctly by Cloudera AI. For example, if you set PASSWORD="pass$123" in the project environment variables, and then try to read it using the echo command, the following output will be displayed: pass23

Workaround: Use one of the following commands to print the $ sign:

echo 24 | xxd -r -p
or
echo JAo= | base64 -d

Insert the value of the environment variable by wrapping it in the command substitution using $() or ``. For example, if you want to set the environment variable to ABC$123, specify:

ABC$(echo 24 | xxd -r -p)123
or
ABC`echo 24 | xxd -r -p`123

DSE-37827: Jupyter's RTC extension throws an error and notebooks become unusable

In certain cases, Jupyter’s RTC (Real Time Collaboration) extension may cause errors claiming either that other sessions are active, or that other processes have accessed the notebook files. After these errors, the notebook becomes unusable due to the error messages and the Cloudera AI session needs to be restarted.

Workaround:

You must disable the Jupyter RTC extension by performing the following tasks:

Create a Session.
Open the terminal.
Enter nano /home/cdsw/.jupyter/labconfig/page_config.json.

Add the following lines to the file:

{
  "disabledExtensions": {
    "@jupyter/collaboration-extension": true
  },
  "lockedExtensions": {
    "@jupyter/collaboration-extension": true
  }
}

Save and close the file.

DSE-36718: Disable auto synchronization feature for users and teams

The automated team and user synchronization feature is disabled. Newly installed or upgraded workbenches do not have the automatic synchronization option in the Cloudera AI UI.

Workaround: none

DSE-36759: AMPs and Feature Announcement sections do not work in NTP setups

Cloudera AI on premises setups with Non Transparent Proxy do not function properly, that affects Cloudera Accelerators for Machine Learning Projects and Feature Announcements. The home page freezes, the feature announcement displays error message, and the AMPs do not load.

Workaround:

To avoid the home page freeze copy the following environment variables from the web deployment, and add them to the environment section of the API deployments:

HTTP_PROXY
HTTPS_PROXY
NO_PROXY
http_proxy
https_proxy
no_proxy

DSE-32943: Enabling Service Accounts

Teams in the Cloudera AI Workbench can only run workloads within team projects with the Run as option for service accounts if they have previously manually added service accounts as a collaborator to the team.

DSE-35013: First Cloudera AI Workbench creation fails

On RHEL 8.8, during the first Cloudera AI Workbench installation on GPU with Cloudera Embedded Container Service external registry, pods might get stuck in the init or CrashLoop state.

First-time workbench installation is expected to fail. Consider this as a test workbench, and apply the following manual workaround for creating subsequent workbenches:

Restart or delete the pods which are in init or CrashLoop state in the test workbench.
Once all pods are in the running state, create new workbenches as needed.
Delete the test workbench from the Cloudera AI UI if no longer needed.

OPSX-4603: Buildkit in Cloudera Embedded Container Service in Cloudera AI on premises

Issue: BuildKit was introduced in Cloudera Embedded Container Service for building images of models and experiments. BuildKit is a replacement for Docker, which was previously used to build images of Cloudera AI's models and experiments in Cloudera Embedded Container Service. Buildkit is only for OS RHEL8.x and CentOS 8.x.

Buildkit in Cloudera AI on premises 1.5.2 is a Technical Preview feature. Hence, having Docker installed on the nodes/hosts is still mandatory for models and experiments to work smoothly. Upcoming release will be completely eliminating the dependency of Docker on the nodes.

Workaround: None.

DSE-32285: Migration: Migrated models are failing due to image pull errors

Issue: After CDSW to Cloudera AI migration (on-premises) via full-fledged migration tool, migrated models on Cloudera AI Workbench on premises fails on initial deployment. This is because the initial model deployment tries to pull images from on-premises's registry.

Workaround: Redeploy the migrated model. As this involves the build and deploy process, the image will be built, pushed to the on premises Cloudera AI Workbench's configured registry, and then the same image will be consumed for further usage.

DSE-28768: Spark Pushdown is not working with Scala 2.11 runtime

Issue: Scala and R are not supported for Spark Pushdown.

Workaround: None.

DSE-32304: On Cloudera AI on premises on Cloudera Embedded Container Service terminal and ssh connections can terminate

Issue: In Cloudera on premises on Cloudera Embedded Container Service, Cloudera AI Terminal and SSH connections can terminate after an uncertain amount of time, usually after 4-10 minutes. This issue affects the usage of local IDEs to work with Cloudera AI, as well as any customer application using a websocket connection.

Workaround: None.

DSE- 35251: Web pod crashes if a project forking takes more than 60 minutes

The web pod crashes if a project forking takes more than 60 minutes. This is because the timeout is set to 60 minutes using the grpc_git_clone_timeout_minutes property. The following error is displayed after the web pod crash:

2024-04-23 22:52:36.384   1737    ERROR      AppServer.VFS.grpc                    crossCopy grpc error    data = [{"error":"1"},{"code":4,"details":"2","metadata":"3"},"Deadline exceeded",{}]
          ["Error: 4 DEADLINE_EXCEEDED: Deadline exceeded\n    at callErrorFromStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/call.js:31:19)\n    at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:192:76)\n    at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)\n    at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)\n    at /home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78\n    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)\nfor call at\n    at ServiceClientImpl.makeUnaryRequest (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:160:34)\n    at ServiceClientImpl.crossCopy (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)\n    at /home/cdswint/services/web/server-dist/grpc/vfs-client.js:235:19\n    at new Promise (<anonymous>)\n    at Object.crossCopy (/home/cdswint/services/web/server-dist/grpc/vfs-client.js:234:12)\n    at Object.crossCopy (/home/cdswint/services/web/server-dist/models/vfs.js:280:38)\n    at projectForkAsyncWrapper (/home/cdswint/services/web/server-dist/models/projects/projects-create.js:229:19)"]
          node:internal/process/promises:288
          triggerUncaughtException(err, true /* fromPromise */);
          ^Error: 4 DEADLINE_EXCEEDED: Deadline exceeded
          at callErrorFromStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
          at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:192:76)
          at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
          at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
          at /home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78
          at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
          for call at
          at ServiceClientImpl.makeUnaryRequest (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:160:34)
          at ServiceClientImpl.crossCopy (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)
          at /home/cdswint/services/web/server-dist/grpc/vfs-client.js:235:19
          at new Promise (<anonymous>)
          at Object.crossCopy (/home/cdswint/services/web/server-dist/grpc/vfs-client.js:234:12)
          at Object.crossCopy (/home/cdswint/services/web/server-dist/models/vfs.js:280:38)
          at projectForkAsyncWrapper (/home/cdswint/services/web/server-dist/models/projects/projects-create.js:229:19) {
          code: 4,
          details: 'Deadline exceeded',
          metadata: Metadata { internalRepr: Map(0) {}, options: {} }
          }

Workaround: Increase the timeout limit, for example, to 120 minutes, using the grpc_git_clone_timeout_minutes property.

UPDATE site_config SET grpc_git_clone_timeout_minutes = <new value>;

DSE-40198: Resolve painpoints with installations and updates of self-signed certificates

When rotating or updating the TLS certificate used by Cloudera AI, the Cloudera AI does not automatically pull the new certificate from the Cloudera Control Pane. To update Cloudera AI with a new TLS certificate, follow the steps below.

Workaround:

Backup the existing ConfigMap.
Create a backup of the current private-cloud-ca-certs-pem-2 ConfigMap in your existing Cloudera AI Workbench using the following command:
```
kubectl get configmap private-cloud-ca-certs-pem-2 -n [***existing CAI workbench namespace***] -o yaml > private-cloud-ca-certs-pem-2.backup
```
Create a temporary TLS-enabled workbench.

Spin up a new, temporary TLS-enabled workbench in the same cluster and environment as the existing workbench. (It is not necessary for the workbench to start up correctly. You do not need to allocate a full set of resources for this cluster.)
Locate the ConfigMap in the new workbench.
Once the Cloudera AI infrastructure pods in the new workbench are running, retrieve the private-cloud-ca-certs-pem-2 ConfigMap using this command:
```
kubectl get configmap private-cloud-ca-certs-pem-2 -n [***new CAI workbench namespace***] -o yaml
```
Update the existing workbench with the new certificate.
Replace the binaryData: cacerts value in the existing ConfigMap of the Cloudera AI Workbench with the binaryData: cacerts value from the new workbench. The simplest way to perform this replacement is through the Cloudera Embedded Container Service UI. This data is a large base64-encoded string. To verify the new TLS certificate, decode the string and inspect its content using the OpenSSL tool:
```
  kubectl get configmap private-cloud-ca-certs-pem-2 -n [***new CAI workbench namespace***] -o yaml | grep cacerts | awk '{print $2}' | base64 -d > decoded-private-cloud-ca-certs-pem.pem
     while openssl x509 -noout -text; do :; done < decoded-private-cloud-ca-certs-pem.pem
```
Restart pods in the existing workbench. Restart the ds-cdh pod in the old namespace. Additionally, restart any other pods in the old namespace that fail to come up automatically.
Delete the temporary workbench. After confirming that the old Cloudera AI Workbench is functioning correctly with the updated certificate, delete the temporary workbench.

By following these steps, you can successfully update the TLS certificate for Cloudera AI while ensuring minimal disruption to your existing workbench.

DSE-39287: Job configuration update error:

The job
            accelerator label ID is not of a valid type

You may encounter the error,

The job accelerator label ID
            is not of a valid type

when attempting to update a job. This issue prevents editing jobs through the UI. However, you can avoid recreating the job by performing the update through the API.

DSE-41898: Workload accelerators should have only one database entry for a given GPU type

Restarting Cloudera Embedded Container Service nodes creates duplicate entries in the node_labels database table for the same GPU type. This results in duplicate GPU registrations appearing under Site Administration > Runtime > Workload Accelerators and similarly, duplicate entries appear in the Resource Profile > GPU Type dropdown menu when launching a session. Selecting a duplicate GPU entry while launching a session triggers the error: This accelerator has been removed.

Workaround:

Access the db-0 pod and specify the Workbench name:

kubectl exec db-0 -ti -n [***WORKBENCH NAME***] -c db - - psql -U sense

Check the existing entries in the node_labels table:
```
sense=# SELECT * FROM node_labels;
```

Delete inactive entries:

DELETE FROM node_labels WHERE availability = FALSE;

DSE-12064: Terminal remains functional after web session times out

The Cloudera AI terminal remains active even after the Cloudera AI web session has timed out.