Known issues for the Cloudera Data Services on premises 1.5.5 SP1
You must be aware of the known issues and limitations, the areas of impact, and workaround in Cloudera Data Services on premises 1.5.5 SP1 release.
Cloudera Data Services on premises 1.5.5 existing known issues are carried into Cloudera Data Services on premises1.5.5 SP1. For more details, see Known Issues.
Known issue identified in 1.5.5 SP1
The following are the known issues identified in 1.5.5 SP1:
- COMPX-20437 - DB connection failures causing RPM and CAM pods to CrashLoopBackOff
- During an upgrade from version 1.5.5 to any 1.5.5 hotfix release, the cluster-access-manager (CAM) and resource-pool-manager (RPM) pods can enter a CrashLoopBackOff state if they are not automatically restarted during the upgrade.
- OPSX-4656: DRS automatic backup policy is inconsistent post upgrade
- When you update a backup policy manually, the modifications to the policy are not persisted after a Control Plane upgrade.
- OBS-9491 - Prometheus configuration exceeds size limit in large environments
- In environments with a large number of namespaces (approximately 300 or more per environment), the Prometheus configuration for Cloudera Monitoring might exceed the 1 MB Kubernetes Secret size limit. If the total size, which depends on factors such as the number of namespaces, the length of namespace names, their variability, and the size of the certificate store, exceeds 1 MB, the new Prometheus configuration will not be applied, and new namespaces will not be monitored. As a result, the telemetry data will not be collected from those namespaces and will not be reflected on the corresponding Grafana charts.
- OPSX-6618 - In Cloudera Embedded Container Service upgrade not all volumes are upgraded to the latest longhorn version
- During restart of the Cloudera Embedded Container Service cluster from 1.5.5 to 1.5.5 SP1, the upgrade failed due to longhorn health issues. This is because one of the volumes was degraded.
- OPSX-6566 - Cloudera Embedded Container Service restart fails with etcd connectivity issues
- Restart of the Cloudera Embedded Container Service server fails with etcd error: "error reading from server"
- OPSX-6401 - Istio ingress-default-cert is not created in the upgrade scenario
- After upgrading to 1.5.5 SP1, the Secret
ingress-default-certis not created in theistio-ingressnamespace. Because this certificate is expected, failing to create it causes components like CAII & MR provisioning to fail.
- OPSX-6645 - Cloudera Embedded Container Service upgrade failure at restart step
- When the Cloudera Embedded Container Service role fails to start
after a node reboot or service restart, the root cause can be that the etcd
defragmentation process which runs on startup takes longer than the component timeout
thresholds. As a result:
- The kubelet service may fail to start or time out.
- The kube-apiserver, kube-scheduler or kube-controller-manager roles may remain in a NotReady state.
- etcd may perform automatic defragmentation at startup.
- The API server may fail to connect to etcd (connection refused or timeout).
- OPSX-6767 - Cloudera Embedded Container Service cluster has stale configuration after Cloudera Manager upgrade to 7.13.1.501-b2 from 7.11.3.24
- After upgrading Cloudera Manager to
version 7.13.1.501, the Cloudera Embedded Container Service shows a
staleness indicator. This occurs due to configuration
changes applied by the upgrade:
-
worker-shutdown-timeout: reduced from 24 hours (86,400 s) to 15 minutes (900 s). -
smon_host: a new monitoring configuration added. -
smon_port: a new monitoring port configuration (9997).
-
- OPSX-6638 - Post rolling restart many pods are stuck in pending state
- Pods remain in Pending state and fail to schedule with etcd performance warnings.
