Using MapReduce Batch Indexing to Index Sample Tweets

The following sections include examples that illustrate how to use MapReduce for batch indexing. Batch indexing is useful for periodically indexing large amounts of data, or for indexing a dataset for the first time. Before continuing, make sure that you have:

Batch Indexing into Online Solr Servers Using GoLive

MapReduceIndexerTool is a MapReduce batch job driver that creates a set of Solr index shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner. Using GoLive, MapReduceIndexerTool also supports merging the output shards into a set of live customer-facing Solr servers. The following steps demonstrate these capabilities.

  1. If you are working with a secured cluster, configure your client jaas.conf file as described in Configuring SolrJ Library Usage.
  2. If you are using Kerberos, kinit as the user that has privileges to update the collection:
    kinit jdoe@EXAMPLE.COM

    Replace EXAMPLE.COM with your Kerberos realm name.

  3. Delete any existing documents in the cloudera_tutorial_tweets collection. If your cluster does not have security enabled, run the following commands as the solr user by adding sudo -u solr before the command:
    solrctl collection --deletedocs cloudera_tutorial_tweets
  4. Run the MapReduce job with the --go-live parameter. Replace nn01.example.com and zk01.example.com with your NameNode and ZooKeeper hostnames, respectively.
    • Parcel-based Installation (Security Enabled):
      HADOOP_OPTS="-Djava.security.auth.login.config=/path/to/jaas.conf" hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --go-live --zk-host zk01.example.com:2181/solr --collection cloudera_tutorial_tweets hdfs://nn01.example.com:8020/user/jdoe/indir
    • Parcel-based Installation (Security Disabled):
      hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --go-live --zk-host zk01.example.com:2181/solr --collection cloudera_tutorial_tweets hdfs://nn01.example.com:8020/user/jdoe/indir
    • Package-based Installation (Security Enabled):
      HADOOP_OPTS="-Djava.security.auth.login.config=/path/to/jaas.conf" hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --go-live --zk-host zk01.example.com:2181/solr --collection cloudera_tutorial_tweets hdfs://nn01.example.com:8020/user/jdoe/indir
    • Package-based Installation (Security Disabled):
      hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --go-live --zk-host zk01.example.com:2181/solr --collection cloudera_tutorial_tweets hdfs://nn01.example.com:8020/user/jdoe/indir
    This command requires a morphline file, which must include a SOLR_LOCATOR directive. Any CLI parameters for --zkhost and --collection override the parameters of the solrLocator. The SOLR_LOCATOR directive might appear as follows:
    SOLR_LOCATOR : {
      # Name of solr collection
      collection : collection_name
    
      # ZooKeeper ensemble
      zkHost : "zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr"
    }
    
    morphlines : [
      {
        id : morphline1
        importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
        commands : [
          { generateUUID { field : id } }
    
          { # Remove record fields that are unknown to Solr schema.xml.
            # Recall that Solr throws an exception on any attempt to load a document that
            # contains a field that isn't specified in schema.xml.
            sanitizeUnknownSolrFields {
              solrLocator : ${SOLR_LOCATOR} # Location from which to fetch Solr schema
            }
          }
    
          { logDebug { format : "output record: {}", args : ["@{}"] } }
    
          {
            loadSolr {
              solrLocator : ${SOLR_LOCATOR}
            }
          }
        ]
      }
    ]
    For help on how to run the MapReduce job, run the following command:
    • Parcel-based Installation:
      hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool --help
    • Package-based Installation:
      hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool --help

    For development purposes, you can use the --dry-run option to run in local mode and print documents to stdout instead of loading them into Solr. Using this option causes the morphline to run locally without submitting a job to MapReduce. Running locally provides faster turnaround during early trial and debug sessions.

    To print diagnostic information, such as the content of records as they pass through the morphline commands, enable TRACE log level diagnostics by adding the following entry to your log4j.properties file:
    log4j.logger.org.kitesdk.morphline=TRACE
    The log4j.properties file location can be specified by using the --log4j /path/to/log4j.properties command-line option.
  5. Check the job status at:
    • MRv1: http://jt01.example.com:50030/jobtracker.jsp
    • YARN: http://rm01.example.com:8090/cluster
    For secure clusters, replace http with https.
  6. When the job is complete, run a Solr query. For example, for a Solr server running on search01.example.com, go to one of the following URLs in a browser, depending on whether you have enabled security on your cluster:
    • Security Enabled: https://search01.example.com:8985/solr/cloudera_tutorial_tweets/select?q=*%3A*&wt=json&indent=true
    • Security Disabled: http://search01.example.com:8983/solr/cloudera_tutorial_tweets/select?q=*%3A*&wt=json&indent=true
    If indexing was successful, this page displays the first 10 query results.

Batch Indexing into Offline Solr Shards

Running the MapReduce job without GoLive causes the job to create a set of Solr index shards from a set of input files and write the indexes to HDFS. You can then explicitly point each Solr server to one of the HDFS output shard directories.

Batch indexing into offline Solr shards is mainly intended for offline use-cases by advanced users. Use cases requiring read-only indexes for searching can be handled by using batch indexing without the --go-live option. By not using GoLive, you can avoid copying datasets between segments, thereby reducing resource utilization.

  1. If you are working with a secured cluster, configure your client jaas.conf file as described in Configuring SolrJ Library Usage.
  2. If you are using Kerberos, kinit as the user that has privileges to update the collection:
    kinit jdoe@EXAMPLE.COM

    Replace EXAMPLE.COM with your Kerberos realm name.

  3. Delete any existing documents in the cloudera_tutorial_tweets collection. If your cluster does not have security enabled, run the following commands as the solr user by adding sudo -u solr before the command:
    solrctl collection --deletedocs cloudera_tutorial_tweets
  4. Delete the contents of the outdir directory:
    hdfs dfs -rm -r -skipTrash /user/jdoe/outdir/*
  5. Run the MapReduce job as follows, replacing nn01.example.com in the command with your NameNode hostname.
    • Parcel-based Installation (Security Enabled):
      HADOOP_OPTS="-Djava.security.auth.login.config=/path/to/jaas.conf" hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --solr-home-dir $HOME/cloudera_tutorial_tweets_config --collection cloudera_tutorial_tweets --shards 2 hdfs://nn01.example.com:8020/user/jdoe/indir
    • Parcel-based Installation (Security Disabled):
      hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --solr-home-dir $HOME/cloudera_tutorial_tweets_config --collection cloudera_tutorial_tweets --shards 2 hdfs://nn01.example.com:8020/user/jdoe/indir
    • Package-based Installation (Security Enabled):
      HADOOP_OPTS="-Djava.security.auth.login.config=/path/to/jaas.conf" hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --solr-home-dir $HOME/cloudera_tutorial_tweets_config --collection cloudera_tutorial_tweets --shards 2 hdfs://nn01.example.com:8020/user/jdoe/indir
    • Package-based Installation (Security Disabled):
      hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --solr-home-dir $HOME/cloudera_tutorial_tweets_config --collection cloudera_tutorial_tweets --shards 2 hdfs://nn01.example.com:8020/user/jdoe/indir
  6. Check the job status at:
    • MRv1: http://jt01.example.com:50030/jobtracker.jsp
    • YARN: http://rm01.example.com:8090/cluster
    For secure clusters, replace http with https.
  7. After the job is completed, check the generated index files. Individual shards are written to the results directory with names of the form part-00000, part-00001, part-00002, and so on. This example has two shards:
    hdfs dfs -ls /user/jdoe/outdir/results
    hdfs dfs -ls /user/jdoe/outdir/results/part-00000/data/index
  8. Stop the Solr service:
    • Cloudera Manager: Solr service > Actions > Stop
    • Unmanaged: On each Solr server host, run:
      sudo service solr-server stop
  9. Identify the paths to each Solr core:
    hdfs dfs -ls /solr/cloudera_tutorial_tweets
    Found 2 items
    drwxr-xr-x   - solr solr          0 2017-03-13 06:20 /solr/cloudera_tutorial_tweets/core_node1
    drwxr-xr-x   - solr solr          0 2017-03-13 06:20 /solr/cloudera_tutorial_tweets/core_node2
  10. Move the index shards into place.
    1. (Kerberos only) Switch to the solr user:
      kinit solr@EXAMPLE.COM
    2. Remove outdated files. If your cluster does not have security enabled, run the following commands as the solr user by adding sudo -u solr before the command:
      hdfs dfs -rm -r -skipTrash /solr/cloudera_tutorial_tweets/core_node1/data/index
      hdfs dfs -rm -r -skipTrash /solr/cloudera_tutorial_tweets/core_node1/data/tlog
      hdfs dfs -rm -r -skipTrash /solr/cloudera_tutorial_tweets/core_node2/data/index
      hdfs dfs -rm -r -skipTrash /solr/cloudera_tutorial_tweets/core_node2/data/tlog
    3. Change ownership of the results directory to solr. If your cluster has security enabled, kinit as the HDFS superuser (hdfs by default) before running the following command. If your cluster does not have security enabled, run the command as the HDFS superuser by adding sudo -u hdfs before the command:
      hdfs dfs -chown -R solr /user/jdoe/outdir/results
    4. (Kerberos only) Switch to the solr user:
      kinit solr@EXAMPLE.COM
    5. Move the two index shards into place. If your cluster does not have security enabled, run the following commands as the solr user by adding sudo -u solr before the command
      hdfs dfs -mv /user/jdoe/outdir/results/part-00000/data/index /solr/cloudera_tutorial_tweets/core_node1/data
      hdfs dfs -mv /user/jdoe/outdir/results/part-00001/data/index /solr/cloudera_tutorial_tweets/core_node2/data
  11. Start the Solr service:
    • Cloudera Manager: Solr service > Actions > Start
    • Unmanaged: On each Solr server host, run:
      sudo service solr-server start
  12. Run some Solr queries. For example, for a Solr server running on search01.example.com, go to one of the following URLs in a browser, depending on whether you have enabled security on your cluster:
    • Security Enabled: https://search01.example.com:8985/solr/cloudera_tutorial_tweets/select?q=*%3A*&wt=json&indent=true
    • Security Disabled: http://search01.example.com:8983/solr/cloudera_tutorial_tweets/select?q=*%3A*&wt=json&indent=true
    If indexing was successful, this page displays the first 10 query results.

To index live tweets from the Twitter public stream, continue on to Near Real Time (NRT) Indexing Tweets Using Flume.