General known issues with Cloudera Data Engineering

DEX-17581: Cloudera Data Engineering-1.24.1 is not getting deployed in East US region

Only applicable to Azure. Cloudera Data Engineering service creation failed during the database server provisioning step. The issue occurred because the Azure API, which Cloudera Data Engineering uses to retrieve the supported database instance types for the specified region (for example, eastus), returned an empty response. As a result, the database server provisioning could not proceed. The following error message appeared in the Cloudera Data Engineering service logs:

unable to get MySQL flexible server DB instance type for cluster, Error: no instance types available for MySQL flexible server DB service tier: GeneralPurpose having vCores 2

Cloudera has raised a support ticket with Microsoft regarding this issue. According to their response, the empty API response for the specified location occurred because the quota for Azure MySQL Flexible Server was either unavailable or disabled in the given region for the subscription. If you encounter the same issue, contact Microsoft Support and request them to enable the quota for MySQL Flexible Servers in the affected region for your subscription.

DEX-17565: Links to download cdeconnect and pyspark tars for Spark Connect are giving HTTP 404 error

Links to download cdeconnect and

pyspark
            tars

for Spark Connect give an HTTP 404 error.

Replace 7.2.18.800 with 7.2.18.0 in the URL.

DEX-17519: Sessions are not killed as per the ttl configured in Azure and AWS

Sessions are not killed as per the ttl configured in Azure and in AWS. The calculation of timeout has gone wrong in the isTimeout method in the Livy code. This method takes a calculated timeout in milliseconds and converts it into nano seconds. However, the caller is already passing the calculated timeout value in nano seconds. In the isTimeout method, the calculatedTimeout value is converted again, which provides a different value. Therefore, (

toTime -
            fromTime

) will not be greater than the calculated timeout, as the calculated timeout value is higher. For this reason, the sessions are not killed after the timeout is reached.

DEX-17507: Restore of Scheduled Jobs are failing due to time format

Restoring the Spark Jobs with the Schedule Configuration fails if the start date or end date uses a time format other than RFC3339Nano. This issue affects only jobs created using non-UI options, such as the API or CLI.

Before taking the backup, edit the schedule configuration of the affected Spark Job to use the RFC3339Nano time format.

DEX-17500: [CDP Cli] Spark OsName "chainguard" Not Triggering Error in Cloudera Data Engineering Version 1.23.1 Virtual Cluster

Cloudera Data Engineering allows the creation of a Virtual Cluster with the securityhardened option in Cloudera Data Engineering version 1.23.1, without any error message. Technically, it is using UBI [redhat] underneath, which is correct, but it can lead to confusion, as the property in the Virtual Cluster states securityhardened.

Avoid using the CDP CLI to provide the spark.osname[="securityhardened"] to create a Virtual Cluster, as it is not supported in Cloudera Data Engineering versions lower than 1.24.1.

DEX-17458: Cloudera Data Engineering session creation is failing with

java.util.concurrent.ExecutionException:
            javax.security.sasl.SaslException

Cloudera Data Engineering sessions created in a Spark 3.3.0 Virtual Cluster fail to create. The following error is listed in the driver logs:

Exception in thread "main" java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: Client closed before SASL negotiation finished

DEX-16747: Cloudera Data Engineering 1.23.1-b114 - Driver container stderr, and stdout logs are missing for some Spark jobs

For some job runs, intermittently, the driver stderr and stdout logs are missing.

DEX-15884: Resource file upload did not pick the modified file intermittently

When you attempt to update a file by uploading a new version with the exact same filename, the operation appears to succeed, but the content of the file is not updated. The system continues to serve the previous version of the file. This issue has been observed to occur intermittently under the following conditions:

Uploading a file to overwrite an existing file with the same name.
Deleting the original file first and then uploading a new file with the same name.

Specify a different filename for the resource.

DEX-15714: Proxy settings are not propagating to Cloudera Data Engineering sessions

Proxy settings from a configured CDP proxy (configmap: cdp-proxy-config) are not propagated to Cloudera Data Engineering sessions. Proxy settings for Cloudera Data Engineering jobs are propagated through spark.driver.extraJavaOptions and spark.executor.extraJavaOptions, as standard JAVA_OPTS. For more information, see Cloudera public proxy documentation.

Add the proxy settings manually to spark.executor.extraJavaOptions and spark.driver.extraJavaOptions and after that create the session.

DEX-15461: Writing Spark Dataframe to Hive using HWC Fails with java.util.NoSuchElementException: None.get

This is a known issue while writing data in ORC format. The issue has been fixed internally, but more testing is needed. This issue will be part of the Hive Warehouse Connector and Cloudera Data Engineering certification in the future.

DEX-14725: virtualenv cannot access pypi mirror

When a Python virtual environment is created, virtual-env needs to access the internet to seed packages such as pip, setup-tools, and wheel. If you block the public internet access (for example, in case of a private network), certain packages fail to build. Example package: requests-kerberos

Use custom Docker images. Example: https://community.cloudera.com/t5/Community-Articles/Creating-Custom-Runtimes-with-Spark3-Python-3-9-on-Cloudera/ta-p/368867.

DEX-14385: Backup fails if there is a Git repository resource

In the Cloudera Data Engineering 1.20.3 services, if there is a Git repository resource, the cluster backup fails.

Remove the Git repository.

DEX-12616: Node Count shows zero in /metric request

Cloudera Data Engineering 1.20.3 introduced compatibility with Kubernetes version 1.27. With this update, the kube_state_metrics no longer provides label and annotation metrics by default.

Earlier, Cloudera Data Engineering used label information to calculate the Node Count for both Core and All-Purpose nodes, which was automatically exposed. However, due to the changes in kube_state_metrics, this functionality is no longer available by default. As a result, the Node count shows zero in /metrics, charts, and the user interface.

Make sure that you set a kube-config for the Cloudera Data Engineering service.

Run the following command, which exposes the label manually in the prometheus-kube-state-metrics container:

`kubectl patch deployment monitoring-prometheus-kube-state-metrics --namespace monitoring --type='json' -p=\
'[{"op": "add",
"path":"/spec/template/spec/containers/0/args", "value":["--metric-labels-allowlist=nodes=[role]"]}]'`

For more information about how to see the node count, see Checking the node count on your Cloud Service Provider's website.

DEX-11340: Kill all the alive sessions in prepare-for-upgrade phase of stop-gap solution for upgrade

If Spark sessions are running during the Cloudera Data Engineering upgrade, they are not automatically killed, leaving them in an unknown state during and after the upgrade.

Ensure that you do not have any Spark sessions running during the Cloudera Data Engineering upgrade. If they are running during the Cloudera Data Engineering upgrade, kill them before proceeding.

DEX-14084: No error response for Airflow Python virtual environment at Virtual Cluster level for view only access user

If a user with a view only role on a Virtual Cluster (VC) tries to create an Airflow Python virtual environment on a VC, the access is blocked with a 403 error. However, the no-access 403 error is not displayed on the UI.

DEX-11639: "CPU" and "Memory" Should Match Tier 1 and Tier 2 Virtual Clusters AutoScale

CPU and Memory options in the service or cluster edit page display the values for Core (tier 1) and All-Purpose (tier 2) together. However, they must be separate values for Core and All-Purpose.

DEX-12482: [Intermittent] Diagnostic Bundle generation taking several hours to generate

Diagnostics bundles can intermittently take very long to get generated due to low EBS throughput and IOPS of the base node.

Increase the EBS throughput and IOPS values in the CloudFormation template, then trigger new diagnostic bundles.

DEX-14253: Cloudera Data Engineering Spark Jobs are getting stuck due to the unavailability of the spot instances

The unavailability of AWS spot instances may cause Cloudera Data Engineering Spark jobs to get stuck.

Re-create the Virtual Cluster with on-demand instances.

DEX-14192: Some Spark 3.5.1 jobs have slightly higher memory requirements

Some jobs running on Spark 3.5.1 have slightly higher memory requirements, resulting in the driver pods getting killed with a k8s OOMKilled.

Increase the driver pod memory from the default 1GB to 1.2GB in the job's configuration.

DEX-14173: VC Creation is failing with "Helm error: 'timed out waiting for the condition', no events found for chart"

In case of busy k8s clusters, installing VC/Cloudera Data Engineering may fail with an error message showing

Helm error: 'timed out waiting for the condition',
            no events found for chart

.

Try installing again. The failure is due to image pulls timing out. The installation will go through as more resources become available.

DEX-13957: Cloudera Data Engineering metrics and graphs show no data

Cloudera Data Engineering versions 1.20.3 and 1.21 use Kubernetes version 1.27. In Kubernetes version 1.27, by default, the kube_state_metrics does not provide label and annotation metrics. For this reason, the node count shows zero for core and all-purpose nodes in the Cloudera Data Engineering UI and in charts.

As a prerequisite, set a kube-config for the Cloudera Data Engineering service and run:

kubectl patch deployment monitoring-prometheus-kube-state-metrics --namespace monitoring --type='json' -p=\
'[{"op": "add",
"path":"/spec/template/spec/containers/0/args", "value":["--metric-labels-allowlist=nodes=[role]"]}]'

DEX 11498: Spark job failing with error: "Exception in thread "main" org.apache.hadoop.fs.s3a.AWSBadRequestException:"

When users in Milan and Jakarta region use Hadoop s3a client to access AWS s3 storage, that is using s3a://bucket-name/key to access the file, an error may occur. This is a known issue in Hadoop.

Set the region manually as: spark.hadoop.fs.s3a.endpoint.region=<region code> . For region codes, see https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.

DEX-10147: Grafana issue for virtual clusters with the same name

In Cloudera Data Engineering 1.19, when you have two different Cloudera Data Engineering services with the same name under the same environment, and you click the Grafana charts for the second Cloudera Data Engineering service, metrics for the Virtual Cluster in the first Cloudera Data Engineering service will display.

After you have upgraded Cloudera Data Engineering, you must verify other things in the upgraded Cloudera Data Engineering cluster except the data shown in Grafana. Once you have verified everything in the new upgraded Cloudera Data Engineering service, the old Cloudera Data Engineering service should be deleted and the Grafana issue will be fixed.

DEX-9112: VC deployment frequently fails when deployed through the CDP CLI

In Cloudera Data Engineering 1.19, when a Virtual Cluster is deployed using the CDP CLI, it fails frequently as the pods fail to start. However, creating a Virtual cluster using the UI is successful.

Ensure that you are using proper units to --memory-requests in "cdp de" CLI, for example "--memory-requests 10Gi".

DEX-9879: Infinite while loops not working in Cloudera Data Engineering Sessions

If an infinite while loop is submitted as a statement, the session will be stuck infinitely. This means that new sessions can't be sent and the Session stays in a busy state. Sample input:

while(True) {
  print("hello")
}

Copy and use the DEX_API that can be found on the Virtual Cluster details page to cancel the statement: POST $DEX_API/sessions/{session-name}/statements/{statement-id}/cancel. The Statement ID can be found by running the cde sessions statements command from the CDE CLI.
Kill the Session and create a new one.

DEX-9898: CDE CLI input reads break after interacting with a Session

After interacting with a Session through the sessions interact command, input to the CDE CLI on the terminal breaks. In this example below, ^M displays instead of proceeding:

> cde session interact --name sparkid-test-6
WARN: Plaintext or insecure TLS connection requested, take care before continuing. Continue? yes/no [no]: yes^M

Open a new terminal and type your Cloudera Data Engineering commands.

DEX-9881: Multi-line command error for Spark-Scala Session types in the CDE CLI

In Cloudera Data Engineering 1.19, Multi-line input into a Scala session on the CDE CLI will not work as expected, in some cases. The CLI interaction will throw an error before reading the complete input. Sample input:

scala> type
     |

Use the UI to interact with Scala sessions. A newline is expected in the above situation. In Cloudera Data Engineering 1.19, only unbalanced brackets will generate a new line. In Cloudera Data Engineering 1.20, all valid Scala newline conditions will be handled:

scala> customFunc(
     | (
     | )
     | )
     |

DEX-9756: Unable to run large raw Scala jobs

Scala code with more than 2000 lines could result in an error.

To avoid the error, increase the stack size. For example, "spark.driver.extraJavaOptions=-Xss4M", "spark.driver.extraJavaOptions=-Xss8M", and so forth.

DEX-8679: Job fails with permission denied on a RAZ environment

When running a job that has access to files is longer than the delegation token renewal time on a RAZ-enabled Cloudera environment, the job will fail with the following error:

Failed to acquire a SAS token for get-status on /.../words.txt due to org.apache.hadoop.security.AccessControlException: Permission denied.

DEX-3706: The Cloudera Data Engineering home page not displaying for some users

The Cloudera Data Engineering home page will not display Virtual Clusters or a Quick Action bar if the user is part of hundreds of user groups or subgrooups.

The user must access the Administration page and open the Virtual Cluster of choice to perform all Job-related actions. This issue will be fixed in Cloudera Data Engineering 1.18.1

DEX-8283: False Positive Status is appearing for the Raw Scala Syntax issue

Raw Scala jobs that fail due to syntax errors are reported as succeeded by Cloudera Data Engineering as displayed in this example:

spark.range(3)..show()

The job will fail with the following error and will be logged in the driver stdout log:

/opt/spark/optional-lib/exec_invalid.scala:3: error: identifier expected but '.' found.
    spark.range(3)..show()
                   ^

This issue will be fixed in Cloudera Data Engineering 1.18.1.

DEX-8281: Raw Scala Scripts fail due to the use of the case class

Implicit conversions which involve implicit Encoders for case classes, that are usually supported by importing spark.implicits._, don't work in Raw Scala jobs in Cloudera Data Engineering. These include converting Scala objects, including RDD Dataset DataFrame, and Columns. For example, the following operations will fail on Cloudera Data Engineering:

import org.apache.spark.sql.Encoders
import spark.implicits._
case class Case(foo:String, bar:String)

// 1: an attempt to obtain schema via the implicit encoder for case class fails
val encoderSchema = Encoders.product[Case].schema
encoderSchema.printTreeString()

// 2: an attempt to convert RDD[Case] to DataFrame fails
val caseDF = sc
	.parallelize(1 to 3)
	.map(i => Case(f"$i", "bar"))
	.toDF

// 3: an attempt to convert DataFrame to Dataset[Case] fails
val caseDS = spark
	.read
	.json(List("""{"foo":"1","bar":"2"}""").toDS)
	.as[Case]

Whereas conversions that involve implicit encoders for primitive types are supported:

val ds = Seq("I am a Dataset").toDS
val df = Seq("I am a DataFrame").toDF

Notice that List, Row, StructField, and createDataFrame are used below instead of

case class and
          .toDF():

val bankRowRDD = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
  s => Row(
    s(0).toInt,
    s(1).replaceAll("\"", ""),
    s(2).replaceAll("\"", ""),
    s(3).replaceAll("\"", ""),
    s(5).replaceAll("\"", "").toInt
  )
)

val bankSchema = List(
  StructField("age", IntegerType, true),
  StructField("job", StringType, true),
  StructField("marital", StringType, true),
  StructField("education", StringType, true),
  StructField("balance", IntegerType, true)
)

val bank = spark.createDataFrame(
  bankRowRDD,
  StructType(bankSchema)
)


bank.registerTempTable("bank")

DEX-7051 EnvironmentPrivilegedUser role cannot be used with Cloudera Data Engineering

The role EnvironmentPrivilegedUser cannot currently be used by a user if a user wants to access Cloudera Data Engineering. If a user has this role, then this user will not be able to interact with Cloudera Data Engineering as an "access denied" would occur.

Cloudera recommends to not use or assign the EnvironmentPrivilegedUser role for accessing Cloudera Data Engineering.

Strict DAG declaration in Airflow 2.2.5

Cloudera Data Engineering 1.16 introduces Airflow 2.2.5 which is now stricter about DAG declaration than the previously supported Airflow version in Cloudera Data Engineering. In Airflow 2.2.5, DAG timezone should be a pendulum.tz.Timezone, not datetime.timezone.utc.

If you upgrade to Cloudera Data Engineering 1.16, make sure that you have updated your DAGs according to the Airflow documentation, otherwise your DAGs will not be able to be created in Cloudera Data Engineering and the restore process will not be able to restore these DAGs.

Example of valid DAG:

import pendulum 
dag = DAG("my_tz_dag", start_date=pendulum.datetime(2016, 1, 1, tz="Europe/Amsterdam")) 
op = DummyOperator(task_id="dummy", dag=dag)

Example of invalid DAG:

from datetime import timezone
from dateutil import parser
dag = DAG("my_tz_dag", start_date=parser.isoparse('2020-11-11T20:20:04.268Z').replace(tzinfo=timezone.utc)) 
op = DummyOperator(task_id="dummy", dag=dag)

COMPX-6949: Stuck jobs prevent cluster scale down

Because of hanging jobs, the cluster is unable to scale down even when there are no ongoing activities. This may happen when some unexpected node removal occurs, causing some pods to be stuck in Pending state. These pending pods prevent the cluster from downscaling.

Terminate the jobs manually.