How to Use S3 as Source or Sink in Hue

On this page, we demonstrate how to write to, and read from, an S3 bucket in Hue.

Continue reading:

Populate S3 Bucket

In this section, we use open data from the U.S. Geological Survey.

Download 30 days of earthquake data (all_month.csv) from the USGS (~2 MB).
Log on to the Hue Web UI from Cloudera Manager.
Select File Browser > S3 Browser.
Click New > Bucket, name it "quakes_<any unique id>" and click Create.
Tip: Unique bucket names are important per S3 bucket naming conventions.
Navigate into the bucket by clicking the bucket name.
Click New > Directory, name it "input" and click Create.
Navigate into the directory by clicking the directory name.
Click Upload and select, or drag, all_month.csv. The path is s3a://quakes/input/all_month.csv.
Important: Do not add anything else to the "input" directory–no extra files, no directories.

Go to the Metastore Manager by clicking Data Browsers > Metastore Tables.
Create a new table from a file by clicking .
Enter a Table Name such as "earthquakes".
Browse for the Input Directory, s3a://quakes/input/, and click Select this folder.
Select Create External Table from the Load Data menu and click Next.
Delimit by Comma(,) and click Next.
Click Create Table.
Click Browse Data to automatically generate a SELECT query in the Hive editor:
```
SELECT * FROM `default`.`earthquakes` LIMIT 10000;
```

Run and Export Results in Hive
1. Run the query by clicking Execute .
2. Click Get Results .
3. Select Export to open the Save query result dialog.
Save Results as Custom File
1. Select In store (max 10000000 cells) and open the Path to CSV file dialog.
2. Navigate into the bucket, s3a://quakes.
3. Create folder named, "output."
4. Navigate into the output directory and click Select this folder.
5. Append a file name to the path, such as quakes.cvs.
6. Click Save. The results are saved as s3a://quakes/ouput/quakes.csv.
Save Results as MapReduce files
1. Select In store (large result) and open the Path to empty directory dialog.
2. Navigate into the bucket, s3a://quakes.
3. If you have not done so, create a folder named, "output."
4. Navigate into the output directory and click Select this folder.
5. Click Save. A MapReduce job is run and results are stored in s3a://quakes/output/.
Save Results as Table
1. Run a query for "moment" earthquakes and export:
```
SELECT time,
         latitude,
         longitude,
         mag
FROM `default`.`earthquakes`
WHERE magtype IN ('mw','mwb','mwc','mwr','mww');
```
2. Select A new table and input <database>.<new table name>.
3. Click Save.
4. Click Browse Data to view the new table.

This section addresses some error messages you may encounter when attempting to use Hue with S3.

Failed to access path
```
Failed to access path: "s3a://quakes". Check that you have access to read this bucket and that the region is correct.
```
Possible solution: Check your bucket region:
1. Log on to your AWS account and navigate to the S3 service.
2. Select your bucket, for example "quakes", and click Properties.
3. Find your region. If it says US Standard, then region=us-east-1.
4. Update your configuration in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini.
5. Save your changes and restart Hue.

The table could not be created
```
The table could not be created. Error while compiling statement: FAILED: SemanticException com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain.
```
Possible solution: Set your S3 credentials in Hive core-site.xml:
1. In Cloudera Manager, go to Hive > Configuration.
2. Filter by Category > Advanced.
3. Set your credentials in Hive Service Advanced Configuration Snippet (Safety Valve) for core-site.xml.
  1. Click the button and input Name and Value for fs.s3a.AccessKeyId.
  2. Click the button and input Name and Value for fs.s3a.SecretAccessKey.
4. Save your changes and restart Hive.
The target path is a directory

Possible solution: Remove any directories or files that may have been added to s3a://quakes/input/ (so that all_month.csv is alone).
Bad status for request TFetchResultsReq … Not a file
```
Bad status for request TFetchResultsReq(...): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: java.io.IOException: Not a file: s3a://Not a file: s3a://quakes/input/output' ...
```
Possible solution: Remove any directories or files that may have been added to s3a://quakes/input/ (so that all_month.csv is alone). Here, Hive cannot successfully query the earthquakes table (based on all_month.csv) due to the directory, s3a://quakes/input/output.

Tip: Run tail -f against the Hive server log in: /var/log/hive/.