Setting Up Apache Pig Using the Command Line

Apache Pig enables you to analyze large amounts of data using Pig's query language called Pig Latin. Pig Latin queries run in a distributed way on a Hadoop cluster.

Installing Pig

To install Pig On RHEL-compatible systems:

$ sudo yum install pig

To install Pig on SLES systems:

$ sudo zypper install pig

To install Pig on Ubuntu and other Debian systems:

$ sudo apt-get install pig

To start Pig in interactive mode (YARN)

To start Pig, use the following command.

$ pig

To start Pig in interactive mode (MRv1)

Use the following command:

$ pig 
You should see output similar to the following:
2012-02-08 23:39:41,819 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/arvind/pig-0.11.0-cdh5b1/bin/pig_1328773181817.log
2012-02-08 23:39:41,994 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost/
...
grunt>

Examples

To verify that the input and output directories from the YARN or MRv1 example grep job exist, list an HDFS directory from the Grunt Shell:
grunt> ls
hdfs://localhost/user/joe/input <dir>
hdfs://localhost/user/joe/output <dir>
To run a grep example job using Pig for grep inputs:
grunt> A = LOAD 'input';
grunt> B = FILTER A BY $0 MATCHES '.*dfs[a-z.]+.*';
grunt> DUMP B;