Securing Impala Data and Log Files
One aspect of security is to protect files from unauthorized access at the filesystem level. For example, if you store sensitive data in HDFS, you specify permissions on the associated files and directories in HDFS to restrict read and write permissions to the appropriate users and groups.
If you issue queries containing sensitive values in the WHERE clause, such as financial account numbers, those values are stored in Impala log files in the Linux filesystem and you must secure those files also. For the locations of Impala log files, see Using Impala Logging.
All Impala read and write operations are performed under the filesystem privileges of the impala user. The impala user must be able to read all directories and data files that you query, and write into all the directories and data files for INSERT and LOAD DATA statements. At a minimum, make sure the impala user is in the hive group so that it can access files and directories shared between Impala and Hive. See User Account Requirements for more details.
Setting file permissions is necessary for Impala to function correctly, but is not an effective security practice by itself:
-
The way to ensure that only authorized users can submit requests for databases and tables they are allowed to access is to set up Sentry authorization, as explained in Enabling Sentry Authorization for Impala. With authorization enabled, the checking of the user ID and group is done by Impala, and unauthorized access is blocked by Impala itself. The actual low-level read and write requests are still done by the impala user, so you must have appropriate file and directory permissions for that user ID.
-
You must also set up Kerberos authentication, as described in Enabling Kerberos Authentication for Impala, so that users can only connect from trusted hosts. With Kerberos enabled, if someone connects a new host to the network and creates user IDs that match your privileged IDs, they will be blocked from connecting to Impala at all from that host.