Error Messages and Various Failures
These error messages may indicate an issue with Kerberos authentication or other issues.
Continue reading:
- Cluster cannot run jobs after Kerberos enabled
- NameNode fails to start
- Clients cannot connect to NameNode
- Hadoop commands run in local realm but not in remote realm
- Users cannot obtain credentials when running Hadoop jobs or commands
- Bogus replay exceptions in service logs
- Cloudera Manager cluster services fail to start
- Error Messages
Cluster cannot run jobs after Kerberos enabled
Symptom: Cluster previously configured without Kerberos authentication may fail to run jobs for certain users on certain TaskTrackers (MRv1) or NodeManagers (YARN) after enabling Kerberos for the cluster. Errors may display in the TaskTracker or NodeManager logs. The following example errors are from TaskTracker on MRv1:
10/11/03 01:29:55 INFO mapred.JobClient: Task Id : attempt_201011021321_0004_m_000011_0, Status : FAILED Error initializing attempt_201011021321_0004_m_000011_0: java.io.IOException: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.mapred.LinuxTaskController.runCommand(LinuxTaskController.java:212) at org.apache.hadoop.mapred.LinuxTaskController.initializeUser(LinuxTaskController.java:442) at org.apache.hadoop.mapreduce.server.tasktracker.Localizer.initializeUserDirs(Localizer.java:272) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:963) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2209) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2174) Caused by: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:250) at org.apache.hadoop.util.Shell.run(Shell.java:177) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:370) at org.apache.hadoop.mapred.LinuxTaskController.runCommand(LinuxTaskController.java:203) ... 5 more
- Cluster that had not been configured for Kerberos authentication was used to run jobs, which created local user directory (or directories) on each TaskTracker or NodeManager host.
- Cluster was then configured to use Kerberos authentication.
- Users try running jobs on the newly secured cluster but local user directories on TaskTrackers or NodeManagers are owned by the wrong user or have overly-permissive permissions.
Steps to resolve: Delete mapred.local.dir or yarn.nodemanager.local-dirs directories across the cluster for affected users.
NameNode fails to start
Caused by: KrbException: Integrity check on decrypted field failed (31) - PREAUTH_FAILED}}
Caused by: KrbException: Identifier does not match expected value (906)
Possible cause: This issue may be due to incorrect configuration for AES-256. By default, certain operating systems—CentOS/Red Hat Enterprise Linux 5.6 (and higher), Ubuntu—use AES-256 encryption which requires installing the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File on all hosts (see JCE Policy File for AES-256 Encryption for details), or disabling AES-256 support in the kdc.conf or krb5.com (see disable AES-256 encryption from the Kerberos instance for details).
Steps to resolve: KrbException 31 and KrbException 906 can be caused by various issues, but the most likely cause is incorrectly configured AES-256 encryption. Resolving the issue should start by determining the type of encryption configured for the cluster.
To verify the type of encryption configured for the cluster:
- On the local KDC host, type this command to create a test principal:
$ kadmin -q "addprinc test"
- On a cluster host, type this command to start a Kerberos session as test:
$ kinit test
- On a cluster host, type this command to view the encryption type in use:
$ klist -e
If AES-256 is being used, output such as the following displays:
Ticket cache: FILE:/tmp/krb5cc_0 Default principal: test@SCM Valid starting Expires Service principal 05/19/11 13:25:04 05/20/11 13:25:04 krbtgt/SCM@SCM Etype (skey, tkt): AES-256 CTS mode with 96-bit SHA-1 HMAC, AES-256 CTS mode with 96-bit SHA-1 HMAC
To remove AES-256 encryption from the Kerberos configuration files:
- Remove aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file.
- After changing the configuration, restart the KDC and the kadmin servers.
- Change TGT principal (krbtgt/REALM@REALM) and other principal passwords as needed.
- In the [realms] section of the kdc.conf file, for the realm associated with HADOOP.LOCALDOMAIN, add (or replace if it
exists already) the following variable:
supported_enctypes = des3-hmac-sha1:normal arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal des-cbc-crc:v4 des-cbc-crc:afs3
- Recreate the hdfs keytab file and mapred keytab file using the -norandkey option in the
xst command (see Step 4: Create and Deploy the Kerberos Principals and Keytab Files for
details.
kadmin.local: xst -norandkey -k hdfs.keytab hdfs/fully.qualified.domain.name HTTP/fully.qualified.domain.name kadmin.local: xst -norandkey -k mapred.keytab mapred/fully.qualified.domain.name HTTP/fully.qualified.domain.name
Clients cannot connect to NameNode
Symptom: The NameNode keytab file does not have an AES-256 entry but the client tickets do. The NameNode starts but clients cannot connect to it. The error message does not specify "AES256" but rather contains enctype code "18."
Possible cause: Issue related to AES-256 encryption and the JCE library.
Steps to resolve: Verify that Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File is installed or remove aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file, as detailed above.
Hadoop commands run in local realm but not in remote realm
Symptom: After enabling cross-realm trust, authenticating as a principal in the local realm lets you successfully run Hadoop commands, but authenticating as a principal in the remote realm does not.
Possible cause: This issue is often due to principals in the two realms having different encryption types or different passwords for the cross-realm principal in each realm. Because the local and remote realm each issue their own TGTs, the local commands run but the service ticket needed for the local and remote realms to communicate cannot be granted.
kadmin: addprinc -e "enc_type_list" krbtgt/LOCAL-REALM.EXAMPLE.COM@MAIN-REALM.COMPANY.COM
kadmin: addprinc -e "aes256-cts:normal rc4-hmac:normal des3-hmac-sha1:normal" krbtgt/LOCAL-REALM.EXAMPLE.COM@MAIN-REALM.COMPANY.COM
Users cannot obtain credentials when running Hadoop jobs or commands
13/01/15 17:44:48 DEBUG ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. (63) - No service creds)]
Possible cause: Ticket message may be too large for the UDP protocol (which is used by SASL by default).
[libdefaults] udp_preference_limit = 1
Configure krb5.conf through Cloudera Manager, this will automatically get added to krb5.conf.
Bogus replay exceptions in service logs
Symptom: Multiple valid requests to Kerberos protected services are identified as replay attempts when they are not. The following exception shows up in the logs for one or more of the Hadoop daemons:
2013-02-28 22:49:03,152 INFO ipc.Server (Server.java:doRead(571)) - IPC Server listener on 8020: readAndProcess threw exception javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism l javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34))] at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:159) at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1040) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1213) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:566) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:363) Caused by: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)) at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:741) at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:323) at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:267) at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:137) ... 4 more Caused by: KrbException: Request is a replay (34) at sun.security.krb5.KrbApReq.authenticate(KrbApReq.java:300) at sun.security.krb5.KrbApReq.<init>(KrbApReq.java:134) at sun.security.jgss.krb5.InitSecContextToken.<init>(InitSecContextToken.java:79) at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:724) ... 7 more
This issue can also manifest as poor performance for clients of the cluster, including dropped connections, timeouts attempting to make RPC calls, and so on.
Possible cause: Kerberos uses a second-resolution timestamp to protect against replay attacks (where an attacker can record network traffic, and play back recorded requests later to gain elevated privileges). That is, incoming requests are cached by Kerberos for a little while, and if there are similar requests within a few seconds, Kerberos will be able to detect them as replay attack attempts (see MIT Kerberos replay cache for more information).
However, if there are multiple valid Kerberos requests coming in at the same time, these may also be misjudged as attacks for the following reasons:
- Multiple services in the cluster are using the same Kerberos principal. All secure clients that run on multiple machines should use unique Kerberos principals for each machine. For example, rather than connecting as a service principal myservice@EXAMPLE.COM, services should have per-host principals such as myservice/host123.example.com@EXAMPLE.COM.
- Clocks not synchronized: All hosts should run NTP so that clocks are kept in sync between clients and servers.
Steps to resolve:
While having different principals for each service, and clocks in sync helps mitigate the issue, there are, however, cases where even if all of the above are implemented, the problem still persists. In such a case, disabling the cache (and the replay protection as a consequence), will allow parallel requests to succeed. This compromise between usability and security can be applied by setting the KRB5RCACHETYPE environment variable to none.
Note that the KRB5RCACHETYPE is not automatically detected by Java applications. For Java-based components:
- Ensure that the cluster runs on JDK 8.
- To disable the replay cache, add -Dsun.security.krb5.rcache=none to the Java Opts/Arguments of the targeted JVM. For example, HiveServer2 or the Sentry service.
Cloudera Manager cluster services fail to start
Exception in secureMain java.lang.ExceptionInInitializerError at javax.crypto.KeyGenerator.nextSpi(KeyGenerator.java:324) at javax.crypto.KeyGenerator.<init>(KeyGenerator.java:157) ... Caused by: java.lang.SecurityException: The jurisdiction policy files are not signed by a trusted signer! at javax.crypto.JarVerifier.verifyPolicySigned(JarVerifier.java:289) at javax.crypto.JceSecurity.loadPolicies(JceSecurity.java:316) at javax.crypto.JceSecurity.setupJurisdictionPolicies(JceSecurity.java:261) ...
Possible cause: This is another example of a mismatch for AES-256 encryption. Services cannot start when the version of the JCE policy file does not match the version of Java installed on a node because the cryptographic signatures for the JCE policy files cannot be verified, resulting in the message shown above.
- Check that the encryption types are matched between your KDC and krb5.conf on all hosts.
Solution: If you are using AES-256, follow the instructions at Step 2: Installing JCE Policy File for AES-256 Encryption to deploy the JCE policy file on all hosts.
- Services cannot start
Download and unpack the zip file. Copy the two JAR files to the $JAVA_HOME/jre/lib/security directory on each node within the cluster.
Error Messages
Incorrect permission Java exception (java.io.IOException)
java.io.IOException: Incorrect permission for /var/folders/B3/B3d2vCm4F+mmWzVPB89W6E+++TI/-Tmp-/tmpYTil84/dfs/data/data1, expected: rwxr-xr-x, while actual: rwxrwxr-x at org.apache.hadoop.util.DiskChecker.checkPermission(DiskChecker.java:107) at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:144) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:160) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1484) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1432) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1408) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:418) ...
Possible cause: The daemon has umask 0002 rather than 0022.
Steps to resolve: Make sure that the umask for hdfs and mapred is 0022.
MapReduce (MRv1) Errors
These error messages are associated with MapReduce only (not YARN).
Jobs won't run and cannot access files in mapred.local.dir
WARN org.apache.hadoop.mapred.TaskTracker: Exception while localization java.io.IOException: Job initialization failed (1)
Possible cause:
- Add the mapred user to the mapred and hadoop groups on all hosts.
- Restart all TaskTrackers.
Jobs cannot run and TaskTracker cannot create local mapred directory
11/08/17 14:44:06 INFO mapred.TaskController: main : user is atm 11/08/17 14:44:06 INFO mapred.TaskController: Failed to create directory /var/log/hadoop/cache/mapred/mapred/local1/taskTracker/atm - No such file or directory 11/08/17 14:44:06 WARN mapred.TaskTracker: Exception while localization java.io.IOException: Job initialization failed (20) at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:191) at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221) Caused by: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:255) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:184) ... 8 more
Possible cause: Mismatch of mapred.local.dir values specified in mapred-site.xml and taskcontroller.cfg. These values should be the same.
Steps to resolve: Verify that the setting for mapred.local.dir is the same in both mapred-site.xml and taskcontroller.cfg, and reconfigure if necessary.
Jobs cannot run and TaskTracker cannot create Hadoop logs directory
11/08/17 14:48:23 INFO mapred.TaskController: Failed to create directory /home/atm/src/cloudera/hadoop/build/hadoop-0.23.2-cdh3u1-SNAPSHOT/logs1/userlogs/job_201108171441_0004 - No such file or directory 11/08/17 14:48:23 WARN mapred.TaskTracker: Exception while localization java.io.IOException: Job initialization failed (255) at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:191) at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221) Caused by: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:255) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:184) ... 8 more
Possible cause: Misconfiguration issue.
Steps to resolve: In MRv1, the default value specified for hadoop.log.dir in mapred-site.xml is /var/log/hadoop-0.20-mapreduce. The path must be owned and be writable by the mapred user. If you change the default value specified for hadoop.log.dir, make sure the value is identical in mapred-site.xml and taskcontroller.cfg. If the values are different, the error message above is returned.