Known issues for Cloudera Data Services on premises 1.5.5
List about the known issues and limitations, their areas of impact, and workarounds in Cloudera Data Services on premises 1.5.5 release.
Known Issues in Cloudera Data Services on premises 1.5.5
- OBS-8038: When using the Grafana Dashboard URL shortener, the shortened URL defaults to localhost:3000
- This behaviour happens because the URL shortener uses the local server address instead of the actual domain name of the Cloudera Observability instance. As a result, users cannot access the shortened URL.
- ENGESC-31426 - Upgrade fails with pod in
ContainerCreatingdue to Mount Error - During an upgrade, a critical pod (such as
vault-0) fails to start and remains in theContainerCreatingstatus. When you describe the pod (kubectl describe pod <pod-name>), you see aFailedMounterror in the events section with the messagemount point busy. This issue is caused by a stale volume mount on the Kubernetes worker node. The node'skubeletservice incorrectly believes the storage volume is already in use, preventing it from being mounted to the new pod.
- DRS automatic backup policy is inconsistent post upgrade
- When you update a backup policy manually, the modifications to the policy are not persisted after a Control Plane upgrade.
- CDPQE-32336 - The nvidia-device-plugin-daemonset pods fail with CrashLoopBackoff error
- The nvidia-device-plugin-daemonset pods fail
with CrashLoopBackoff error with logs showing the following
error:
nvidia-container-cli: initialization error: driver rpc error: timed out
- ENGESC-29640 - Unable to resize the Longhorn volume
-
Longhorn does not automatically resize the filesystem for RWX volumes backed by NFS when the volume size is expanded.
- OPSX-6209 and DWX-20809: Cloudera Data Services on premises installations on RHEL 8.9 or lower versions may encounter issues
- You may notice issues when installing Cloudera Data Services on premises on Cloudera Embedded Container Service
clusters running on RHEL 8.9 or lower versions. Pod crashloops are noticed with the
following
error:
The issue is due to a memory leak with 'seccomp' (Secure Computing Mode) in the Linux kernel. If your kernel version is not on 6.2 or higher verisons or if it is not part of the list of versions mentioned here, you may face issues during installation.Warning FailedCreatePodSandBox 1s (x2 over 4s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown
- COMPX-20705: [153CHF-155] Post Cloudera Embedded Container Service upgrade pods are stuck in ApplicationRejected State
- After upgrading the Cloudera installation pods on Kubernetes could be left in a failure state showing "ApplicationRejected". This is caused by a delay in settings being applied to Kubernetes as part of the post upgrade steps.
- OPSX-6303 - Cloudera Embedded Container Service server went down - 'etcdserver: mvcc: database space exceeded'
-
Cloudera Embedded Container Service server may fail with error message - "etcdserver: mvcc: database space exceeded" in large clusters.
- OPSX-6295 - Cloudera Control Plane upgrade failing with cadence-matching and cadence-history
-
Incase of extra cadence-matching and cadence-history pod stuck in Init:CreateContainerError state , Cloudera Embedded Container Service Upgrade to 1.5.5 will be stuck in retry loop because of all pods running validation failure.
- OPSX-4391 - External docker cert not base64 encoded
- When using Cloudera Data Services on premises on Cloudera Embedded Container Service, in some rare situations, the CA certificate for the Docker registry in the cdp namespace is incorrectly encoded, resulting in TLS errors when connecting to the Docker registry.
- OPSX-6245 - Airgap | Multiple pods are in pending state on rolling restart
-
Performing back-to-back rolling restarts on Cloudera Embedded Container Service clusters can intermittently fail during the Vault unseal step. During rapid consecutive rolling restarts, the kube-controller-manager pod may not return to a ready state promptly. This can cause a cascading effect where other critical pods, including Vault, fails to initialize properly. As a result, the unseal Vault step fails.
- OPSX-4684 - Start ECS command shows green(finished) even though start docker server failed on one of the hosts
- Docker service starts with one or more docker roles failed to start because the corresponding host is unhealthy.
- OPSX-5986 - Cloudera Embedded Container Service fresh install failing with helm-install-rke2-ingress-nginx pod failing to come into Completed state
- Cloudera Embedded Container Service fresh install fails at the "Execute command Reapply All Settings to Cluster on service ECS" step due to a timeout waiting for helm-install.
- OPSX-6298 - Issue on service namespace cleanup
-
There might be cases in which uninstalling services from the Cloudera Data Services on premises UI will fail due to various reasons.
- OPSX-6265 - Setting inotify max_user_instances config
-
We cannot recommend an exact value for inotify max_user_instances config. It depends on all workloads that are run in a given node.
- COMPX-20362 - Use API to create a pool that has a subset of resource types
-
The Resource Management UI supports displaying only three resource types: CPU, memory and GPU. The Resource Management UI will always set all three resource types it knows about: CPU, Memory and GPU (K8s resource nvidia.com/gpu) when creating a quota. If no value is chosen for a resource type a value of 0 will be set, blocking the use of that resource type.
Known issues from previous releases carried in Cloudera Data Services on premises 1.5.5
Known Issues identified in 1.5.4
- DOCS-21833: Orphaned replicas/pods are not getting auto cleaned up leading to volume fill-up issues
-
By default, Longhorn will not automatically delete the orphaned replica directory. One can enable the automatic deletion by setting orphan-auto-deletion to true.
- OPSX-5310: Longhorn engine images were not deployed on Cloudera Embedded Container Service server nodes
- Longhorn engine images were not deployed on Cloudera Embedded Container Service nodes due to missing tolerations for Cloudera Control Plane taints. This caused the engine DaemonSet to schedule only on Cloudera Embedded Container Service agent nodes, preventing deployment on Cloudera Control Plane nodes.
- OPSX-5155: OS Upgrade | Pods are not starting after the OS upgrade from RHEL 8.6 to 8.8
- After an OS upgrade and start of the Cloudera Embedded Container Service service, pods fail to come up due to stale state.
- OPSX-5055: Cloudera Embedded Container Service upgrade failed at Unseal Vault step
-
During an Cloudera Embedded Container Service upgrade from 1.5.2 to 1.5.4 release, the vault pod fails to start due to an error caused by the Longhorn volume unable to attach to the host. The error is as below:
Warning FailedAttachVolume 3m16s (x166 over 5h26m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-0ba86385-9064-4ef9-9019-71976b4902a5" : rpc error: code = Internal desc = volume pvc-0ba86385-9064-4ef9-9019-71976b4902a5 failed to attach to node host-1.cloudera.com with attachmentID csi-7659ab0e6655d308d2316536269de47b4e66062539f135bf6012bfc8b41fc345: the volume is currently attached to different node host-2.cloudera.com
- OPSX-4684: Start Cloudera Embedded Container Service command shows green(finished) even though start docker server failed on one of the hosts
-
The Docker service starts, but one or more Docker roles fail to start because the corresponding host is unhealthy.
- OPSX-735: Kerberos service should handle Cloudera Manager downtime
-
The Cloudera Manager Server in the base cluster operates to generate Kerberos principals for Cloudera on premises. If there is downtime, you may observe Kerberos-related errors.
