Known issues for Cloudera Data Services on premises 1.5.5

List about the known issues and limitations, their areas of impact, and workarounds in Cloudera Data Services on premises 1.5.5 release.

Known Issues in Cloudera Data Services on premises 1.5.5

OBS-8038: When using the Grafana Dashboard URL shortener, the shortened URL defaults to localhost:3000
This behaviour happens because the URL shortener uses the local server address instead of the actual domain name of the Cloudera Observability instance. As a result, users cannot access the shortened URL.
You must not use the shortened URL. To ensure users can access the URL, update it to use the correct Cloudera Observability instance domain name, such as cp_domain/{shorten_url}{}.
ENGESC-31426 - Upgrade fails with pod in ContainerCreating due to Mount Error
During an upgrade, a critical pod (such as vault-0) fails to start and remains in the ContainerCreating status. When you describe the pod (kubectl describe pod <pod-name>), you see a FailedMount error in the events section with the message mount point busy. This issue is caused by a stale volume mount on the Kubernetes worker node. The node's kubelet service incorrectly believes the storage volume is already in use, preventing it from being mounted to the new pod.
Effective method to resolve this issue is to perform a rolling, graceful reboot of the worker nodes before starting the upgrade. This action clears any stale in-memory locks or filesystem handles. Perform the following steps, for each worker node, one at a time:
  1. Drain the Node: Safely move all workloads off the node.
    kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
  2. Reboot the Node: Log in to the node and perform a reboot.
    sudo reboot
  3. Uncordon the Node: After the node is back online, make it available for scheduling workloads again.
    kubectl uncordon <node-name>

Performing this rolling maintenance ensures all nodes are in a clean state, preventing this issue from blocking the upgrade process.

DRS automatic backup policy is inconsistent post upgrade
When you update a backup policy manually, the modifications to the policy are not persisted after a Control Plane upgrade.
Ensure that you manually update the policy after the upgrade activity is complete.
CDPQE-32336 - The nvidia-device-plugin-daemonset pods fail with CrashLoopBackoff error
The nvidia-device-plugin-daemonset pods fail with CrashLoopBackoff error with logs showing the following error:

nvidia-container-cli: initialization error: driver rpc error: timed out

To resolve this issue, enable persistence mode through the legacy or daemon modes by following the article at: Nvidia Persistence Mode Documentation.
ENGESC-29640 - Unable to resize the Longhorn volume

Longhorn does not automatically resize the filesystem for RWX volumes backed by NFS when the volume size is expanded.

For more information on explanation of the RWX/NFS limitation and step-by-step instructions to manually resize the filesystem after volume expansion, see Longhorn Volume Expansion.
OPSX-6209 and DWX-20809: Cloudera Data Services on premises installations on RHEL 8.9 or lower versions may encounter issues
You may notice issues when installing Cloudera Data Services on premises on Cloudera Embedded Container Service clusters running on RHEL 8.9 or lower versions. Pod crashloops are noticed with the following error:
Warning  FailedCreatePodSandBox           1s (x2 over 4s)  kubelet   Failed to create pod 
sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create 
shim task: OCI runtime create
 failed: runc create failed: unable to start container process: unable to init seccomp: error 
loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown
The issue is due to a memory leak with 'seccomp' (Secure Computing Mode) in the Linux kernel. If your kernel version is not on 6.2 or higher verisons or if it is not part of the list of versions mentioned here, you may face issues during installation.
To avoid this issue, increase the value of net.core.bpf_jit_limit by running the following command on all Cloudera Embedded Container Service hosts:
[root@host ~]# sysctl net.core.bpf_jit_limit=528482304
However, Cloudera recommends upgrading the Linux kernel to an appropriate version that contains a patch for the memory leak issue. For a list of versions that contain this patch, see this link.
COMPX-20705: [153CHF-155] Post Cloudera Embedded Container Service upgrade pods are stuck in ApplicationRejected State
After upgrading the Cloudera installation pods on Kubernetes could be left in a failure state showing "ApplicationRejected". This is caused by a delay in settings being applied to Kubernetes as part of the post upgrade steps.
To resove this issue, restart the scheduler to pick up the latest settings for Kubernetes. Also, restart YuniKorn using the following commands:

kubectl scale deployment yunikorn-scheduler --replicas=0 -n yunikorn
kubectl scale deployment yunikorn-scheduler --replicas=1 -n yunikorn
OPSX-6303 - Cloudera Embedded Container Service server went down - 'etcdserver: mvcc: database space exceeded'

Cloudera Embedded Container Service server may fail with error message - "etcdserver: mvcc: database space exceeded" in large clusters.

  1. Add to the safety valve for server group:
    
    etcd-arg: 
    - "quota-backend-bytes=4294967296"
  2. Restart stale services (Select the option re-deploy client configs).
  3. The default value for quota-backend-bytes is 2 GB. It can be increased up to 8 GB.
OPSX-6295 - Cloudera Control Plane upgrade failing with cadence-matching and cadence-history

Incase of extra cadence-matching and cadence-history pod stuck in Init:CreateContainerError state , Cloudera Embedded Container Service Upgrade to 1.5.5 will be stuck in retry loop because of all pods running validation failure.

You need to manually apply the workaround to proceed further upgrade and get it done successfully. Hence, delete the stuck cadence pods.
OPSX-4391 - External docker cert not base64 encoded
When using Cloudera Data Services on premises on Cloudera Embedded Container Service, in some rare situations, the CA certificate for the Docker registry in the cdp namespace is incorrectly encoded, resulting in TLS errors when connecting to the Docker registry.
Compare and edit the contents of the "cdp-private-installer-docker-cert" secret in the cdp namespace so that it matches the contents of the "cdp-private-installer-docker-cert" secret in other namespaces. The secrets and their corresponding namespaces can be identified using the command:
kubectl get secret -A | grep cdp-private-installer-docker-cert
Inspect each secret using the command:
kubectl get secret -n cdp cdp-private-installer-docker-cert -o yaml
Replace "cdp" with the different namespace names. If necessary, modify the secret in the cdp namespace using the command:
kubectl edit secret -n cdp cdp-private-installer-docker-cert
OPSX-6245 - Airgap | Multiple pods are in pending state on rolling restart

Performing back-to-back rolling restarts on Cloudera Embedded Container Service clusters can intermittently fail during the Vault unseal step. During rapid consecutive rolling restarts, the kube-controller-manager pod may not return to a ready state promptly. This can cause a cascading effect where other critical pods, including Vault, fails to initialize properly. As a result, the unseal Vault step fails.

As a workaround, perform the following steps:
  1. Stop the Cloudera Embedded Container Service role that failed.
  2. Start the Cloudera Embedded Container Service role again.
  3. If required, perform the rolling restart again.
OPSX-4684 - Start ECS command shows green(finished) even though start docker server failed on one of the hosts
Docker service starts with one or more docker roles failed to start because the corresponding host is unhealthy.

Make sure the host is healthy. Start the the docker role in the host.

OPSX-5986 - Cloudera Embedded Container Service fresh install failing with helm-install-rke2-ingress-nginx pod failing to come into Completed state
Cloudera Embedded Container Service fresh install fails at the "Execute command Reapply All Settings to Cluster on service ECS" step due to a timeout waiting for helm-install.
To confirm the issue, run the following kubectl command on the Cloudera Embedded Container Service host to check if the pod is stuck in a running state:
kubectl get pods -n kube-system | grep helm-install-rke2-ingress-nginx
To resolve the issue, manually delete the pod by running:
kubectl delete pod <helm-install-rke2-ingress-nginx-pod-name> -n kube-system
Then, click Resume to proceed with the fresh install process on the Cloudera Manager UI.
OPSX-6298 - Issue on service namespace cleanup

There might be cases in which uninstalling services from the Cloudera Data Services on premises UI will fail due to various reasons.

In case uninstallation of a Service fails, trigger again the service uninstall process, and mark “Force Delete” to ensure that all metadata of the service will be removed from Cloudera side. Then, move to the OpenShift UI, and there search for that service namespace / project. On that project/namespace select the Action button on the top right of the screen and choose to Delete the Project.

If you move back to the main Project screen you could see that the project is moving to status “Terminating” after which it will be removed from the OCP platform. Manually terminate the <project_name>-service and <project_name>-monitoring-platform namespaces. This action will ensure that all the entities linked to that project/namespace will also be removed by OpenShift.

OPSX-6265 - Setting inotify max_user_instances config

We cannot recommend an exact value for inotify max_user_instances config. It depends on all workloads that are run in a given node.

With newly introduced features like istio, cert manager, in Cloudera Control Plane, there is a need to set inotify max_user_instancesconfig to 256 instead of 128 to resolve this issue.

COMPX-20362 - Use API to create a pool that has a subset of resource types

The Resource Management UI supports displaying only three resource types: CPU, memory and GPU. The Resource Management UI will always set all three resource types it knows about: CPU, Memory and GPU (K8s resource nvidia.com/gpu) when creating a quota. If no value is chosen for a resource type a value of 0 will be set, blocking the use of that resource type.

To create a pool that has a subset of resource types the REST API must be used as follows:
POST /api/v1/compute/createResourcePool
Payload:



{
    "pool": {
        "path": "root.environment.service.mypool",
        "policy": {
            "allocation": "INELASTIC"
        },
        "quota": {
            "cpu": "100 m",
            "memory": "10 GB"
        }
    }
}

Known issues from previous releases carried in Cloudera Data Services on premises 1.5.5

Known Issues identified in 1.5.4

DOCS-21833: Orphaned replicas/pods are not getting auto cleaned up leading to volume fill-up issues

By default, Longhorn will not automatically delete the orphaned replica directory. One can enable the automatic deletion by setting orphan-auto-deletion to true.

No workaround available.
OPSX-5310: Longhorn engine images were not deployed on Cloudera Embedded Container Service server nodes
Longhorn engine images were not deployed on Cloudera Embedded Container Service nodes due to missing tolerations for Cloudera Control Plane taints. This caused the engine DaemonSet to schedule only on Cloudera Embedded Container Service agent nodes, preventing deployment on Cloudera Control Plane nodes.
  1. Check the Engine DaemonSet Status. Run the following command to check if the Longhorn engine DaemonSet is missing on certain nodes:
    kubectl get ds -n longhorn-system | grep engine
  2. Identify Taints on Affected Nodes. Run the following command to check for taints on affected nodes:
    kubectl describe node <node-name> | grep Taints
  3. Manually Edit the DaemonSet to Add a Toleration. Edit the Longhorn engine DaemonSet YAML:
    kubectl edit ds -n longhorn-system engine-image-ei-<your-engine-id>
  4. Add the following under tolerations:
    
    tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
      operator: Equal
      value: "true"
    
  5. Apply the changes and verify deeployment. Save and exit the editor. Then, check if the DaemonSet is now running on all necessary nodes:
    kubectl get pods -n longhorn-system -o wide | grep engine

    Verify that the engine pods are successfully scheduled on the affected Cloudera Embedded Container Service nodes.

OPSX-5155: OS Upgrade | Pods are not starting after the OS upgrade from RHEL 8.6 to 8.8
After an OS upgrade and start of the Cloudera Embedded Container Service service, pods fail to come up due to stale state.
Restart the Cloudera Embedded Container Service cluster.
OPSX-5055: Cloudera Embedded Container Service upgrade failed at Unseal Vault step

During an Cloudera Embedded Container Service upgrade from 1.5.2 to 1.5.4 release, the vault pod fails to start due to an error caused by the Longhorn volume unable to attach to the host. The error is as below:

Warning FailedAttachVolume 3m16s (x166 over 5h26m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-0ba86385-9064-4ef9-9019-71976b4902a5" : rpc error: code = Internal desc = volume pvc-0ba86385-9064-4ef9-9019-71976b4902a5 failed to attach to node host-1.cloudera.com with attachmentID csi-7659ab0e6655d308d2316536269de47b4e66062539f135bf6012bfc8b41fc345: the volume is currently attached to different node host-2.cloudera.com

Follow below steps provided by SUSE to ensure the Longhorn volume is correctly attached to the node where the vault pod is running.

# Find out the volume name that is failing to attach to the vault pod. 
For e.g. pvc-bc73e7d3-c7e7-468a-b8e0-afdb8033e40b from the pod logs.
kubectl edit volumeattachments.longhorn.io -n longhorn-system pvc-bc73e7d3-c7e7-468a-b8e0-afdb8033e40b

# Update the "spec:" section of the volumeattachment and replace 
attachmentTickets section with {} as shown below and save.
spec:
 attachmentTickets: {}
 volume: pvc-bc73e7d3-c7e7-468a-b8e0-afdb8033e40b

# scale down the vault statefulset to 0 and scale it back up.
kubectl scale sts vault --replicas=0 -n vault-system
kubectl scale sts vault --replicas=1 -n vault-system
OPSX-4684: Start Cloudera Embedded Container Service command shows green(finished) even though start docker server failed on one of the hosts

The Docker service starts, but one or more Docker roles fail to start because the corresponding host is unhealthy.

Ensure the host is healthy. Start the the Docker role on the host.

OPSX-735: Kerberos service should handle Cloudera Manager downtime

The Cloudera Manager Server in the base cluster operates to generate Kerberos principals for Cloudera on premises. If there is downtime, you may observe Kerberos-related errors.

Resolve downtime on Cloudera Manager. If you encounter Kerberos errors, you can retry the operation (such as retrying creation of the Virtual Warehouse).