Setting Up Apache Sqoop Using the Command Line
Apache Sqoop 1 is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop 1 to import data from external structured datastores into the Hadoop Distributed File System (HDFS) or related systems such as Hive and HBase. Conversely, you can use Sqoop 1 to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.
Upgrading Sqoop 1 from an Earlier CDH 5 release
These instructions assume that you are upgrading Sqoop 1 as part of an upgrade to the latest CDH 5 release, and have already performed the steps under Upgrading from an Earlier CDH 5 Release to the Latest Release.
To upgrade Sqoop 1 from an earlier CDH 5 release, install the new version of Sqoop 1 using the RPM or Debian packages.
Sqoop 1 Prerequisites
Sqoop 1 requires the following:
- An operating system supported by CDH 5.
- Oracle JDK.
- Services that you want to use with Sqoop, such as HBase, Hive HCatalog, and Accumulo. When you run Sqoop, it checks to see if these services are installed and configured. It logs
warnings for services it does not find. These warnings, shown below, are harmless. You can suppress these error messages by setting the variables $HBASE_HOME,
$HCAT_HOME and $ACCUMULO_HOME to any existing directory.
> Warning: /usr/lib/sqoop/../hbase does not exist! HBase imports will fail. > Please set $HBASE_HOME to the root of your HBase installation. > Warning: /usr/lib/sqoop/../hive-hcatalog does not exist! HCatalog jobs will fail. > Please set $HCAT_HOME to the root of your HCatalog installation. > Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail. > Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Installing the Sqoop 1 RPM or Debian Packages
The Sqoop 1 packages consist of:
- sqoop — Complete Sqoop 1 distribution
- sqoop-metastore — For installation of the Sqoop 1 metastore only
To install Sqoop 1 on a RHEL-compatible system:
$ sudo yum install sqoop
To install Sqoop 1 on an Ubuntu or other Debian system:
$ sudo apt-get install sqoop
To install Sqoop 1 on a SLES system:
$ sudo zypper install sqoop
If you have already configured CDH on your system, there is no further configuration necessary for Sqoop 1. You can start using Sqoop 1 by using commands such as:
$ sqoop help $ sqoop version $ sqoop import
Installing the JDBC Drivers for Sqoop 1
Before you begin:
mkdir -p /var/lib/sqoop chown sqoop:sqoop /var/lib/sqoop chmod 755 /var/lib/sqoop
This sets permissions to drwxr-xr-x.
For JDBC drivers for Hive, Impala, Teradata, or Netezza, see the Connectors documentation.
Installing the MySQL JDBC Driver
Download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/5.1.html. You will need to sign up for an account if you do not already have one, and log in, before you can download it. Then copy it to the /var/lib/sqoop/ directory. For example:
$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar /var/lib/sqoop/
Installing the Oracle JDBC Driver
You can download the JDBC Driver from the Oracle website, for example http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html. You must accept the license agreement before you can download the driver. Download the ojdbc6.jar file and copy it to the /var/lib/sqoop/ directory:
$ sudo cp ojdbc6.jar /var/lib/sqoop/
Installing the Microsoft SQL Server JDBC Driver
Download the Microsoft SQL Server JDBC driver from http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774 and copy it to the /var/lib/sqoop/ directory. For example:
$ curl -L 'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz' | tar xz $ sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop/
Installing the PostgreSQL JDBC Driver
Download the PostgreSQL JDBC driver from http://jdbc.postgresql.org/download.html and copy it to the /var/lib/sqoop/ directory. For example:
$ curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar $ sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop/
Syntax for Configuring JDBC Connection Strings
These are the JDBC connection strings for supported databases.
MySql Connection String
Syntax:
jdbc:mysql://<HOST>:<PORT>/<DATABASE_NAME>
Example:
jdbc:mysql://my_mysql_server_hostname:3306/my_database_name
Oracle Connection String
Syntax:
jdbc:oracle:thin:@<HOST>:<PORT>:<DATABASE_NAME>
Example:
jdbc:oracle:thin:@my_oracle_server_hostname:1521:my_database_name
PostgreSQL Connection String
Syntax:
jdbc:postgresql://<HOST>:<PORT>/<DATABASE_NAME>
Example:
jdbc:postgresql://my_postgres_server_hostname:5432/my_database_name
Setting HADOOP_MAPRED_HOME for Sqoop 1
- For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop 1 in a YARN installation, make sure that the HADOOP_MAPRED_HOME environment variable is set correctly, as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
Viewing the Sqoop 1 Documentation
For additional documentation see the Sqoop user guides.