Setting Up Apache Sqoop 2 Using the Command Line
Sqoop 2 is a server-based tool designed to transfer data between Hadoop and relational databases. You can use Sqoop 2 to import data from a relational database management system (RDBMS), such as MySQL or Oracle, into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export it back into an RDBMS.
Sqoop 2 has three packaging options for installation:
- Tarball (.tgz) that contains both the Sqoop 2 server and the client.
- Separate RPM packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)
- Separate Debian packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)
These topics describe the steps to install Sqoop 2.
Feature Differences - Sqoop 1 and Sqoop 2
Feature | Sqoop 1 | Sqoop 2 |
---|---|---|
Connectors for all major RDBMS | Supported. |
Not supported. Workaround: Use the generic JDBC Connector which has been tested on the following databases: Microsoft SQL Server, PostgreSQL, MySQL and Oracle. This connector should work on any other JDBC compliant database. However, performance might not be comparable to that of specialized connectors in Sqoop. |
Kerberos Security Integration | Supported. |
Supported. |
Data transfer from RDBMS to Hive or HBase | Supported. |
Not supported. Workaround: Follow this two-step approach.
|
Data transfer from Hive or HBase to RDBMS | Not supported.
Workaround: Follow this two-step approach.
|
Not supported. Follow the same workaround as for Sqoop 1. |
Sqoop 2 Prerequisites
- An operating system supported by CDH 5.
- Oracle JDK.
-
Hadoop must be installed on the host that runs the Sqoop 2 server component.
- Services that you want to use with Sqoop, such as HBase, Hive HCatalog, and Accumulo. Sqoop checks for these services when you run it, and finds services that are installed and
configured. It logs warnings for services it does not find. These warnings, shown below, are harmless.
> Warning: /usr/lib/sqoop/../hbase does not exist! HBase imports will fail. > Please set $HBASE_HOME to the root of your HBase installation. > Warning: /usr/lib/sqoop/../hive-hcatalog does not exist! HCatalog jobs will fail. > Please set $HCAT_HOME to the root of your HCatalog installation. > Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail. > Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Installing Sqoop 2
Sqoop 2 is distributed as two separate packages: a client package (sqoop2-client) and a server package (sqoop2-server). Install the server package on one host in the cluster; because the Sqoop 2 server acts as a MapReduce client, this host must have Hadoop installed and configured.
Install the client package on each host that acts as a client. A Sqoop 2 client always connects to the Sqoop 2 server to perform any actions, so Hadoop does not need to be installed on the client hosts.
Depending on what you are planning to install, choose the appropriate package and install it using your preferred package manager application.
To install the Sqoop 2 server package on a RHEL-compatible system:
$ sudo yum install sqoop2-server
To install the Sqoop 2 client package on a RHEL-compatible system:
$ sudo yum install sqoop2-client
To install the Sqoop 2 server package on a SLES system:
$ sudo zypper install sqoop2-server
To install the Sqoop 2 client package on a SLES system:
$ sudo zypper install sqoop2-client
To install the Sqoop 2 server package on an Ubuntu or Debian system:
$ sudo apt-get install sqoop2-server
To install the Sqoop 2 client package on an Ubuntu or Debian system:
$ sudo apt-get install sqoop2-client
Configuring Sqoop 2
This section explains how to configure the Sqoop 2 server.
Configuring which Hadoop Version to Use
The Sqoop 2 client does not interact directly with Hadoop MapReduce, and so it does not require any MapReduce configuration.
- To use YARN:
alternatives --set sqoop2-tomcat-conf /etc/sqoop2/tomcat-conf.dist
- To use MRv1:
alternatives --set sqoop2-tomcat-conf /etc/sqoop2/tomcat-conf.mr1
Configuring Sqoop 2 to Use PostgreSQL instead of Apache Derby
Deciding which Database to Use
- Derby runs in embedded mode and it is not possible to monitor its health.
- Though it might be possible, Cloudera currently has no live backup strategy for the embedded Derby database.
- Under load, Cloudera has observed locks and rollbacks with the embedded Derby database that do not happen with server-based databases.
Use the procedure that follows to configure Sqoop 2 to use PostgreSQL instead of Apache Derby.
Create the Sqoop User and Sqoop Database
For example, using the PostgreSQL psql command-line tool:
$ psql -U postgres Password for user postgres: ***** postgres=# CREATE ROLE sqoop LOGIN ENCRYPTED PASSWORD 'sqoop' NOSUPERUSER INHERIT CREATEDB NOCREATEROLE; CREATE ROLE postgres=# CREATE DATABASE "sqoop" WITH OWNER = sqoop ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'en_US.UTF8' LC_CTYPE = 'en_US.UTF8' CONNECTION LIMIT = -1; CREATE DATABASE postgres=# \q
Configure Sqoop 2 to use PostgreSQL
Edit the sqoop.properties file (normally /etc/sqoop2/conf) as follows:
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.postgresql.PostgresqlRepositoryHandler org.apache.sqoop.repository.jdbc.transaction.isolation=isolation level org.apache.sqoop.repository.jdbc.maximum.connections=max connections org.apache.sqoop.repository.jdbc.url=jdbc URL org.apache.sqoop.repository.jdbc.driver=org.postgresql.Driver org.apache.sqoop.repository.jdbc.user=username org.apache.sqoop.repository.jdbc.password=password org.apache.sqoop.repository.jdbc.properties.property=value
Installing the JDBC Drivers
Installing the MySQL JDBC Driver
Download the MySQL JDBC driver here. You must sign up for an account if you do not already have one, then log in before you can download the driver. Copy it to the /var/lib/sqoop2/ directory. For example:
$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar /var/lib/sqoop2/At the time of publication, version was 5.1.31, but the version might change by the time you read this.
Installing the Oracle JDBC Driver
You can download the JDBC Driver from the Oracle website, for example here. You must accept the license agreement before you can download the driver. Download the ojdbc6.jar file and copy it to /var/lib/sqoop2/ directory:
$ sudo cp ojdbc6.jar /var/lib/sqoop2/
Installing the Microsoft SQL Server JDBC Driver
Download the Microsoft SQL Server JDBC driver here and copy it to the /var/lib/sqoop2/ directory. For example:
$ curl -L 'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz' | tar xz $ sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop2/
Installing the PostgreSQL JDBC Driver
Download the PostgreSQL JDBC driver here and copy it to the /var/lib/sqoop2/ directory. For example:
$ curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar $ sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop2/
Syntax for Configuring JDBC Connection Strings
These are the JDBC connection strings for supported databases.
MySql Connection String
Syntax:
jdbc:mysql://<HOST>:<PORT>/<DATABASE_NAME>
Example:
jdbc:mysql://my_mysql_server_hostname:3306/my_database_name
Oracle Connection String
Syntax:
jdbc:oracle:thin:@<HOST>:<PORT>:<DATABASE_NAME>
Example:
jdbc:oracle:thin:@my_oracle_server_hostname:1521:my_database_name
PostgreSQL Connection String
Syntax:
jdbc:postgresql://<HOST>:<PORT>/<DATABASE_NAME>
Example:
jdbc:postgresql://my_postgres_server_hostname:5432/my_database_name
Starting, Stopping, and Accessing the Sqoop 2 Server
Starting the Sqoop 2 Server
After you have completed all of the required configuration steps, you can start Sqoop 2 server:
$ sudo /sbin/service sqoop2-server start
Stopping the Sqoop 2 Server
$ sudo /sbin/service sqoop2-server stop
Checking that the Sqoop 2 Server has Started
You can verify whether the server has started correctly by connecting to its HTTP interface. The simplest way is to get the server version using following command:
$ wget -qO - localhost:12000/sqoop/version
You should get a text fragment in JSON format similar to the following:
{"version":"1.99.2-cdh5.0.0",...}
Accessing the Sqoop 2 Server with the Sqoop 2 Client
Start the Sqoop 2 client:
sqoop2
Identify the host where your server is running (we will use localhost in this example):
sqoop:000> set server --host localhost
Test the connection by running the command show version --all to obtain the version number from server. You should see output similar to the following:
sqoop:000> show version --all server version: Sqoop 1.99.2-cdh5.0.0 revision ... Compiled by jenkins on ... client version: Sqoop 1.99.2-cdh5.0.0 revision ... Compiled by jenkins on ... Protocol version: [1]