Known issues for the Cloudera Data Services on premises 1.5.5 SP1

OPSAPS-70612: Invalid URL error while installing Cloudera Data Services on premises 1.5.5 SP1 from Cloudera Manager 7.13.1.501

When attempting to install Cloudera Data Services on premises using a Cloudera Manager 7.13.1.501 hotfix, the installation may fail.

The installation fails to process the repository URL correctly in this specific scenario, resulting in an "Invalid URL" error, which blocks the installation.

Perform the following steps to mitigate this issue:

Login to Cloudera Manager and click Hosts > Add Hosts.
In the Add Hosts page, select the Add hosts to Cloudera Manager option and then click Continue.
Follow the instructions on the Setup Auto-TLS page and then click Continue.
In the Specify Hosts page, enter a host name or pattern to search for new hosts to add to the cluster, and then click Continue. You can click the Patterns link for more information.
A list of matching hosts are displayed.
Select the hosts that you want to add and click Continue.
Select the repository location where Cloudera Manager can access the software required to install on the new hosts. Choose either the Public Cloudera Repository or a Custom Repository, and provide the URL of the custom repository available on your local network.
Click Continue.

After completing the above tasks, you can proceed with the Cloudera Data Services on premises installation.

OPSX-6794 - Upgrading from Cloudera Data Services on premises 1.5.4 to 1.5.5 Service Pack 1 fails with a Upgrade embedded DB error message midway through the upgrade process.

When attempting to upgrade from Cloudera Data Services on premises 1.5.4 to 1.5.5 Service Pack 1, the upgrade process could fail.

If you have performed a manual Expand Volume operation on the cdp-embedded-db-backend Persistent Volume Claim (PVC) of the cdp-embedded-db service in cdp namespace, it can cause the upgrade process to fail. The upgrade workflow attempts to apply the original value that was used during the initial cluster set up. As Longhorn does not support reducing the volume size, the upgrade process fails.

To avoid stalling the ongoing upgrade process, you must manually exclude the PVC for the cdp-embedded-db-backend in the upgrade process.

Perform the following steps:

#Login to ECS master node
ssh root@ecs_master_node

#Switch to embedded db directory in the ECS parcel
cd /opt/cloudera/parcels/ECS-1.5.5-h1000-b167-ecs-1.5.5-h1000-b167.p0.72522329/installer/helm/cdp-embedded-db

#Create a tmp working directory
mkdir temp-cdp-embedded-db-1.5.5-h1000-b167

#Extract the tgz contents into the temp working directory
tar -xzf cdp-embedded-db-1.5.5-h1000-b167.tgz -C ./temp-cdp-embedded-db-1.5.5-h1000-b167

#Manual changes
vi temp-cdp-embedded-db-1.5.5-h1000-b167/templates/persistentVolumeClaim.yaml

#This file will have two PVCs. cdp-embedded-db-backend and cdp-embedded-db-backend-v17. Remove cdp-embedded-db-backend PVC from the file. Save the file and exit

#Take a backup of the old tgz file
mv cdp-embedded-db-1.5.5-h1000-b167.tgz backup-cdp-embedded-db-1.5.5-h1000-b167.tgz

#Switch to temp working directory
cd temp-cdp-embedded-db-1.5.5-h1000-b167

#Prepare an updated tgz file
tar -czvf ../cdp-embedded-db-1.5.5-h1000-b167.tgz .

#Switch to previous directory and check the tgz file
cd ..
ls -lrth

#Move the backup tgz to a different directory
mv backup-cdp-embedded-db-1.5.5-h1000-b167.tgz /tmp/some_directory

#Cleanup the temp working directory
rm -rf temp-cdp-embedded-db-1.5.5-h1000-b167/

#Backup the following secret in cdp namespace to 'some_directory'. Make sure to replace 'some_directory' with an appropriate path
kubectl get secret cdp-thunderheaduserpreference-db-secret -n cdp -o yaml > /tmp/some_directory/backup-cdp-thunderheaduserpreference-db-secret

#Delete the following secret in cdp namespace
kubectl delete secret cdp-thunderheaduserpreference-db-secret -n cdp

Once you complete these steps, you can resume the upgrade process.

OPSX-6797 - A possible upgrade failure resulting from expanding the cdp-embedded-db volume from Longhorn UI after the initial installation of your Data Services on premises cluster

If you have expanded cdp-embedded-db volume from Longhorn UI after the initial installation of your Data Services on premises cluster, you must complete the workaround steps before planning your upgrade to Cloudera Data Services on premises 1.5.5 Service Pack 1 to avoid a potential upgrade failure.

Login to your ECS master node on your cluster and perform the following these steps:

# Find the size of cdp-embedded-db-backend PVC, look for the value in CAPACITY column
kubectl get pvc -n cdp | grep cdp-embedded-db-backend

# Find the latest/current copy of cdp-private-installer-values
# If the cluster is on 1.5.5-b212, then the secret to use will be cdp-private-installer-values-1.5.5-b212
kubectl get secret -n cdp | grep cdp-private-installer-values

# For this example, we will use 'cdp-private-installer-values-1.5.5-b212'
# Please use the correct one from your cluster

# Backup the existing values, in case you need to restore it later
kubectl get secret cdp-private-installer-values-1.5.5-b212 -n cdp -o yaml > backup-cdp-private-installer-values-1.5.5-b212.yaml

# Make a new copy before updating the values
cp backup-cdp-private-installer-values-1.5.5-b212.yaml updated-cdp-private-installer-values-1.5.5-b212.yaml

# Edit 'updated-cdp-private-installer-values-1.5.5-b212.yaml' using the following instructions:
# 1. Base64decode the value present in 'values.yaml.merged' attribute
# 2. Update Database.EmbeddedDbStorage to the latest value obtained in the first step
# 3. Base64encode the updated yaml string and update it in 'values.yaml.merged' attribute
# 4. Delete metadata.creationTimestamp, metadata.name, metadata.resourceVersion and metadata.uid attributes
# 5. Save these changes to updated-cdp-private-installer-values-1.5.5-b212.yaml

# Delete the existing secret
kubectl delete secret cdp-private-installer-values-1.5.5-b212 -n cdp

# Create a new secret using the updated file
kubectl apply -f updated-cdp-private-installer-values-1.5.5-b212.yaml

# Check if the secret got created
kubectl get secret -n cdp | grep cdp-private-installer-values

# Verify if the value got updated
kubectl get secret cdp-private-installer-values-1.5.5-b212 -n cdp -o jsonpath='{.data.values\.yaml\.merged}' | base64 --decode

COMPX-20437 - DB connection failures causing RPM and CAM pods to CrashLoopBackOff

During an upgrade from version 1.5.5 to any 1.5.5 hotfix release, the cluster-access-manager (CAM) and resource-pool-manager (RPM) pods can enter a CrashLoopBackOff state if they are not automatically restarted during the upgrade.

After upgrade, manually restart the CAM pod and then restart the RPM pod (order of restart is very important).

Commands to restart:


kubectl rollout restart deployment <cluster-access-manager-deployment-name> -n <namespace> 
kubectl rollout restart deployment <resource-pool-manager-deployment-name> -n <namespace>

OBS-9491 - Prometheus configuration exceeds size limit in large environments

In environments with a large number of namespaces (approximately 300 or more per environment), the Prometheus configuration for Cloudera Monitoring might exceed the 1 MB Kubernetes Secret size limit. If the total size, which depends on factors such as the number of namespaces, the length of namespace names, their variability, and the size of the certificate store, exceeds 1 MB, the new Prometheus configuration will not be applied, and new namespaces will not be monitored. As a result, the telemetry data will not be collected from those namespaces and will not be reflected on the corresponding Grafana charts.

To resolve this issue, you must enable Prometheus configuration compression at the Cloudera Control Plane level.

Upgrade to Cloudera Data Services on premises 1.5.5 SP1 or a higher version.
Back up the secret from the monitoring namespace to allow for rollback and disabling compression later.
Set the environment variable ENABLE_ENVIRONMENT_PROMETHEUS_CONFIG_COMPRESSION to "true" on the cdp-release-monitoring-pvcservice deployment in the Cloudera Control Plane namespace.
note
The upper bound for monitoring is based on the number of namespaces per environment, the length of the namespace name, the variability of the names, and the size of the cert store.
- For example, about 700 workload namespaces in a single environment with a short, consistent pattern ("aaaa-bbbb-XXXX") generate around 150 KB of configuration, which fits within the 1 MB limit.
- About 3000 namespaces in a single environment with a longer, consistent pattern ("aaaa-cccc-dddd-XXXXX") generate about 900 KB.

Rolling back the workaround:

Set the environment variable ENABLE_ENVIRONMENT_PROMETHEUS_CONFIG_COMPRESSION to "false" on the cdp-release-monitoring-pvcservice deployment in the Cloudera Control Plane namespace.
Reapply the secret from the backup.

OPSX-6618 - In Cloudera Embedded Container Service upgrade not all volumes are upgraded to the latest longhorn version

During restart of the Cloudera Embedded Container Service cluster from 1.5.5 to 1.5.5 SP1, the upgrade failed due to longhorn health issues. This is because one of the volumes was degraded.

Follow the steps to resolve the issue:

Identify the problematic volumes (degraded state).
Set the value of spec.numberOfReplicas of the volume to the number of active replicas. For example, set the value to 3 if two replicas are active.
Apply the fix before or during the upgrade as per the workaround instructions (refer to Longhorn issue #11825). Longhorn Engineering is addressing in v1.11.0 and will back-port the fix. The workaround is included as part of the upgrade. However, if the issue is noticed even after upgrade please follow the workaround steps documented here: https://github.com/longhorn/longhorn/issues/11825

COMPX-23842 / COMPX-24130/ COMPX-23319 - Pod status shown as OutOfcpu, OutOfmemory, or Pending after a cluster restart

During a cluster restart, due to an upgrade or normal maintenance, all nodes in the cluster are restarted. During this process, the cluster operates with reduced resource capacity. When this occurs, pod placement can be rejected, resulting in some pods entering OutOfcpu or OutOfmemory state.

Do not restart failed pods until the cluster restart process has fully completed. Once the cluster is fully operational:

Restart any pods that are in a failed (OutOfcpu or OutOfmemory) state, so they can be rescheduled properly.
If any pods remain in a pending state after the restart, restart the YuniKorn scheduler to refresh scheduling:
```
kubectl rollout restart deployment yunikorn-scheduler -n yunikorn
```

OPSX-6566 - Cloudera Embedded Container Service restart fails with etcd connectivity issues

Restart of the Cloudera Embedded Container Service server fails with etcd error: "error reading from server"

Identify the server role that failed.
Restart only the Cloudera Embedded Container Service server role which failed.
Once the server role is restarted and healthy, proceed with the remainder of the Cloudera Embedded Container Service server role restart sequence.

OPSX-6401 - Istio ingress-default-cert is not created in the upgrade scenario

After upgrading to 1.5.5 SP1, the Secret ingress-default-cert is not created in the istio-ingress namespace. Because this certificate is expected, failing to create it causes components like CAII & MR provisioning to fail.

To resolve this issue:

After upgrading from an older version to 1.5.5 SP1, check for the secret called ingress-default-cert in the istio-ingress namespace (ns).
For example:
```
kubectl get secret ingress-default-cert -n istio-ingress
```
If missing, copy the identical secret from kube-system namespace (or from the pre-upgrade environment) into istio-ingress namespace with the same name and contents.
This will restore the expected certificate and allow CAII & MR provisioning to proceed.

OPSX-6645 - Cloudera Embedded Container Service upgrade failure at restart step

When the Cloudera Embedded Container Service role fails to start after a node reboot or service restart, the root cause can be that the etcd defragmentation process which runs on startup takes longer than the component timeout thresholds. As a result:

The kubelet service may fail to start or time out.
The kube-apiserver, kube-scheduler or kube-controller-manager roles may remain in a NotReady state.
etcd may perform automatic defragmentation at startup.
The API server may fail to connect to etcd (connection refused or timeout).

To fix this issue:

Resume the Cloudera Embedded Container Service start/restart from the Cloudera Manager UI once etcd has come up.
Ensure etcd meets production hardware and configuration requirements: For more information, see etcd requirements and Knowledge Base.

OPSX-6767 - Cloudera Embedded Container Service cluster has stale configuration after Cloudera Manager upgrade to 7.13.1.501-b2 from 7.11.3.24

After upgrading Cloudera Manager to version 7.13.1.501, the Cloudera Embedded Container Service shows a staleness indicator. This occurs due to configuration changes applied by the upgrade:

worker-shutdown-timeout: reduced from 24 hours (86,400 s) to 15 minutes (900 s).
smon_host: a new monitoring configuration added.
smon_port: a new monitoring port configuration (9997).

No action required — the staleness is expected following this upgrade and can be safely ignored. The indicator will automatically clear once Cloudera Embedded Container Service is upgraded to version 1.5.5 SP1 or later.

If you prefer to clear the staleness indicator right away, you may manually refresh the Cloudera Embedded Container Service service through the Cloudera Manager UI.

OPSX-6638 - Post rolling restart many pods are stuck in pending state

Pods remain in Pending state and fail to schedule with etcd performance warnings.

Restart the Cloudera Embedded Container Service master role which has etcd performance issues. For more information, see etcd hardware recommendations and Knowledge Base.

Known issues for the Cloudera Data Services on premises 1.5.5 SP1

Known issue identified in 1.5.5 SP1