Lily HBase Batch Indexing for Cloudera Search

With Cloudera Search, you can batch index HBase tables using the Lily HBase batch indexer MapReduce job, also known as HBaseMapReduceIndexerTool. This batch indexing does not require:

  • HBase replication
  • The Lily HBase Indexer Service
  • Registering a Lily HBase Indexer configuration with the Lily HBase Indexer Service

The indexer supports flexible, custom, application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase. This way, applications can use the search result set to directly access matching raw HBase cells.

Populating an HBase Table

After configuring and starting your system, create an HBase table and add rows to it. For example:

$ hbase shell

hbase(main):002:0> create 'sample_table', {NAME => 'data'}
hbase(main):002:0> put 'sample_table', 'row1', 'data', 'value'
hbase(main):001:0> put 'sample_table', 'row2', 'data', 'value2'

Creating a Collection in Cloudera Search

A collection in Search used for HBase indexing must have a Solr schema that accommodates the types of HBase column families and qualifiers that are being indexed. To begin, consider adding the all-inclusive data field to a default schema. Once you decide on a schema, create a collection using commands similar to the following:

$ solrctl instancedir --generate $HOME/hbase_collection_config
## Edit $HOME/hbase_collection_config/conf/schema.xml as needed ##
## If you are using Sentry for authorization, copy solrconfig.xml.secure to solrconfig.xml as follows: ##
## cp $HOME/hbase_collection_config/conf/solrconfig.xml.secure $HOME/hbase_collection_config/conf/solrconfig.xml ##
$ solrctl instancedir --create hbase_collection_config $HOME/hbase_collection_config
$ solrctl collection --create hbase_collection -s <numShards> -c hbase_collection_config

Creating a Lily HBase Indexer Configuration File

Configure individual Lily HBase Indexers using the hbase-indexer command-line utility. Typically, there is one Lily HBase Indexer configuration file for each HBase table, but there can be as many Lily HBase Indexer configuration files as there are tables, column families, and corresponding collections in Search. Each Lily HBase Indexer configuration is defined in an XML file, such as morphline-hbase-mapper.xml.

An indexer configuration XML file must refer to the MorphlineResultToSolrMapper implementation and point to the location of a Morphline configuration file, as shown in the following morphline-hbase-mapper.xml indexer configuration file. For Cloudera Manager managed environments, set morphlineFile to the relative path morphlines.conf. For unmanaged environments, specify the absolute path to a morphlines.conf file that exists on the Lily HBase Indexer host. Make sure the file is readable by the HBase system user (hbase by default).

$ cat $HOME/morphline-hbase-mapper.xml

<?xml version="1.0"?>
<indexer table="sample_table"
mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper">

   <!-- The relative or absolute path on the local file system to the
   morphline configuration file. -->
   <!-- Use relative path "morphlines.conf" for morphlines managed by
   Cloudera Manager -->
   <param name="morphlineFile" value="/path/to/morphlines.conf"/>

   <!-- The optional morphlineId identifies a morphline if there are multiple
   morphlines in morphlines.conf -->
   <!-- <param name="morphlineId" value="morphline1"/> -->

</indexer>

The Lily HBase Indexer configuration file also supports the standard attributes of any HBase Lily Indexer on the top-level <indexer> element: table, mapping-type, read-row, mapper, unique-key-formatter, unique-key-field, row-field, column-family-field, and table-family-field. It does not support the <field> element and <extract> elements.

Creating a Morphline Configuration File

After creating an indexer configuration XML file, you can configure morphline ETL transformation commands in a morphlines.conf configuration file. The morphlines.conf configuration file can contain any number of morphline commands. Typically, an extractHBaseCells command is the first command. The readAvroContainer or readAvro morphline commands are often used to extract Avro data from the HBase byte array. This configuration file can be shared among different applications that use morphlines.

If you are using Cloudera Manager, the morphlines.conf file is edited within Cloudera Manager (Key-Value Store Indexer service > Configuration > Category > Morphlines > Morphlines File).

For unmanaged environments, create the morphlines.conf file on the Lily HBase Indexer host:

$ cat /etc/hbase-solr/conf/morphlines.conf

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"]

    commands : [
      {
        extractHBaseCells {
          mappings : [
            {
              inputColumn : "data:*"
              outputField : "data"
              type : string
              source : value
            }

            #{
            #  inputColumn : "data:item"
            #  outputField : "_attachment_body"
            #  type : "byte[]"
            #  source : value
            #}
          ]
        }
      }

      #for avro use with type : "byte[]" in extractHBaseCells mapping above
      #{ readAvroContainer {} }
      #{
      #  extractAvroPaths {
      #    paths : {
      #      data : /user_name
      #    }
      #  }
      #}

      { logTrace { format : "output record: {}", args : ["@{}"] } }
    ]
  }
]

Understanding the extractHBaseCells Morphline Command

The extractHBaseCells morphline command extracts cells from an HBase result and transforms the values into a SolrInputDocument. The command consists of an array of zero or more mapping specifications.

Each mapping has:

  • The inputColumn parameter, which specifies the data from HBase for populating a field in Solr. It has the form of a column family name and qualifier, separated by a colon. The qualifier portion can end in an asterisk, which is interpreted as a wildcard. In this case, all matching column-family and qualifier expressions are used. The following are examples of valid inputColumn values:
    • mycolumnfamily:myqualifier
    • mycolumnfamily:my*
    • mycolumnfamily:*
  • The outputField parameter specifies the morphline record field to which to add output values. The morphline record field is also known as the Solr document field. Example: first_name.
  • Dynamic output fields are enabled by the outputField parameter ending with a wildcard (*). For example:
    inputColumn : "mycolumnfamily:*"
    outputField : "belongs_to_*"
    In this case, if you make these puts in HBase:
    put 'table_name' , 'row1' , 'mycolumnfamily:1' , 'foo'
    put 'table_name' , 'row1' , 'mycolumnfamily:9' , 'bar'
    Then the fields of the Solr document are as follows:
    belongs_to_1 : foo
    belongs_to_9 : bar
  • The type parameter defines the data type of the content in HBase. All input data is stored in HBase as byte arrays, but all content in Solr is indexed as text, so a method for converting byte arrays to the actual data type is required. The type parameter can be the name of a type that is supported by org.apache.hadoop.hbase.util.Bytes.to* (which currently includes byte[], int, long, string, boolean, float, double, short, and bigdecimal). Use type byte[] to pass the byte array through to the morphline without conversion.
    • type:byte[] copies the byte array unmodified into the record output field
    • type:int converts with org.apache.hadoop.hbase.util.Bytes.toInt
    • type:long converts with org.apache.hadoop.hbase.util.Bytes.toLong
    • type:string converts with org.apache.hadoop.hbase.util.Bytes.toString
    • type:boolean converts with org.apache.hadoop.hbase.util.Bytes.toBoolean
    • type:float converts with org.apache.hadoop.hbase.util.Bytes.toFloat
    • type:double converts with org.apache.hadoop.hbase.util.Bytes.toDouble
    • type:short converts with org.apache.hadoop.hbase.util.Bytes.toShort
    • type:bigdecimal converts with org.apache.hadoop.hbase.util.Bytes.toBigDecimal
    Alternatively, the type parameter can be the name of a Java class that implements the com.ngdata.hbaseindexer.parse.ByteArrayValueMapper interface.
    HBase data formatting does not always match what is specified by org.apache.hadoop.hbase.util.Bytes.*. For example, this can occur with data of type float or double. You can enable indexing of such HBase data by converting the data. There are various ways to do so, including:
    • Using Java morphline command to parse input data, converting it to the expected output. For example:
      {
       imports : "import java.util.*;" code: """ // manipulate the contents of a record field
       String stringAmount = (String) record.getFirstValue("amount");
       Double dbl = Double.parseDouble(stringAmount); record.replaceValues("amount",dbl);
       return child.process(record); // pass record to next command in chain """
      }
    • Creating table fields with binary format and then using types such as double or float in a morphline.conf. You could create a table in HBase for storing doubles using commands similar to:
      CREATE TABLE sample_lily_hbase ( id string, amount double, ts timestamp )
      STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,ti:amount#b,ti:ts,')
      TBLPROPERTIES ('hbase.table.name' = 'sample_lily'); 
  • The source parameter determines which portion of an HBase KeyValue is used as indexing input. Valid choices are value or qualifier. When value is specified, the HBase cell value is used as input for indexing. When qualifier is specified, then the HBase column qualifier is used as input for indexing. The default is value.

Running HBaseMapReduceIndexerTool

HBaseMapReduceIndexerTool is a MapReduce batch job driver that takes input data from an HBase table, creates Solr index shards, and writes the indexes to HDFS in a flexible, scalable, and fault-tolerant manner. It also supports merging the output shards into a set of live customer-facing Solr servers in SolrCloud.

Run the command as follows:

  • For package-based deployments:

    hadoop --config /etc/hadoop/conf \
    jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \
    --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" \
    -Dmapreduce.reduce.java.opts="-Xmx512m" \
    --hbase-indexer-file $HOME/morphline-hbase-mapper.xml \
    --zk-host 127.0.0.1/solr --collection hbase-collection1 \
    --go-live --log4j src/test/resources/log4j.properties
  • For parcel-based deployments:
    hadoop --config /etc/hadoop/conf \
    jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \
    --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" \
    -Dmapreduce.reduce.java.opts="-Xmx512m" \
    --hbase-indexer-file $HOME/morphline-hbase-mapper.xml \
    --zk-host 127.0.0.1/solr --collection hbase-collection1 \
    --go-live --log4j src/test/resources/log4j.properties
To invoke the command-line help in a default parcels installation, use:
$ hadoop jar /opt/cloudera/parcels/CDH/jars/hbase-indexer-mr-*-job.jar --help
To invoke the command-line help in a default packages installation, use:
$ hadoop jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --help

The full command line usage help is as follows:

usage: hadoop [GenericOptions]... jar hbase-indexer-mr-*-job.jar
       [--hbase-indexer-zk STRING] [--hbase-indexer-name STRING]
       [--hbase-indexer-file FILE]
       [--hbase-indexer-component-factory STRING]
       [--hbase-table-name STRING] [--hbase-start-row BINARYSTRING]
       [--hbase-end-row BINARYSTRING] [--hbase-start-time STRING]
       [--hbase-end-time STRING] [--hbase-timestamp-format STRING]
       [--zk-host STRING] [--go-live] [--collection STRING]
       [--go-live-threads INTEGER] [--help] [--output-dir HDFS_URI]
       [--overwrite-output-dir] [--morphline-file FILE]
       [--morphline-id STRING] [--solr-home-dir DIR]
       [--update-conflict-resolver FQCN] [--reducers INTEGER]
       [--max-segments INTEGER] [--fair-scheduler-pool STRING] [--dry-run]
       [--log4j FILE] [--verbose] [--clear-index] [--show-non-solr-cloud]

MapReduce batch job driver that takes  input  data  from an HBase table and
creates Solr index shards and writes the  indexes into HDFS, in a flexible,
scalable, and fault-tolerant manner.  It  also  supports merging the output
shards into a  set  of  live  customer-facing  Solr  servers  in SolrCloud.
Optionally, documents  can  be  sent  directly  from  the  mapper  tasks to
SolrCloud, which is a  much  less  scalable  approach  but enables updating
existing documents in SolrCloud. The  program  proceeds  in one or multiple
consecutive MapReduce-based phases, as follows:

1) Mapper phase: This (parallel)  phase  scans  over the input HBase table,
extracts the relevant content,  and  transforms it into SolrInputDocuments.
If run as a mapper-only job,  this phase also writes the SolrInputDocuments
directly to a live  SolrCloud  cluster.  The  conversion from HBase records
into Solr documents  is  performed  via  a  hbase-indexer configuration and
typically based on a morphline.

2)   Reducer   phase:   This   (parallel)    phase   loads   the   mapper's
SolrInputDocuments into  one  EmbeddedSolrServer  per  reducer.  Each  such
reducer and Solr server can be  seen  as  a (micro) shard. The Solr servers
store their data in HDFS.

3) Mapper-only  merge  phase:  This  (parallel)  phase  merges  the  set of
reducer shards into the number of  Solr  shards expected by the user, using
a mapper-only job.  This  phase  is  omitted  if  the  number  of shards is
already equal to the number of shards expected by the user

4) Go-live phase: This optional  (parallel)  phase merges the output shards
of the previous phase into a  set  of  live customer-facing Solr servers in
SolrCloud. If this phase  is  omitted  you  can  explicitly point each Solr
server to one of the HDFS output shard directories

Fault Tolerance: Mapper and reducer  task  attempts  are retried on failure
per the standard MapReduce semantics. On program startup all data in the --
output-dir is deleted  if  that  output  directory  already  exists  and --
overwrite-output-dir is specified. This means  that  if the whole job fails
you can  retry  simply  by  rerunning  the  program  again  using  the same
arguments.

HBase Indexer parameters:
  Parameters for specifying the  HBase  indexer  definition and/or where it
  should be loaded from.

  --hbase-indexer-zk STRING
                         The address of the  ZooKeeper  ensemble from which
                         to fetch  the  indexer  definition  named --hbase-
                         indexer-name.  Format   is:   a   list   of  comma
                         separated host:port pairs,  each  corresponding to
                         a zk  server.  Example: '127.0.0.1:2181,127.0.0.1:
                         2182,127.0.0.1:2183'
  --hbase-indexer-name STRING
                         The name of  the  indexer  configuration  to fetch
                         from the  ZooKeeper  ensemble  specified  with  --
                         hbase-indexer-zk. Example: myIndexer
  --hbase-indexer-file FILE
                         Relative  or  absolute  path   to  a  local  HBase
                         indexer XML configuration file.  If supplied, this
                         overrides --hbase-indexer-zk  and --hbase-indexer-
                         name. Example: /path/to/morphline-hbase-mapper.xml
  --hbase-indexer-component-factory STRING
                         Classname of the hbase indexer component factory.

HBase scan parameters:
  Parameters for specifying what data is included while reading from HBase.

  --hbase-table-name STRING
                         Optional name of  the  HBase  table containing the
                         records  to   be   indexed.   If   supplied,  this
                         overrides the  value  from  the  --hbase-indexer-*
                         options. Example: myTable
  --hbase-start-row BINARYSTRING
                         Binary string  representation  of  start  row from
                         which to start  indexing  (inclusive).  The format
                         of the supplied row  key  should use two-digit hex
                         values prefixed by \x for non-ascii characters (e.
                         g. 'row\x00'). The semantics  of this argument are
                         the same as those  for  the HBase Scan#setStartRow
                         method. The default is  to  include  the first row
                         of the table. Example: AAAA
  --hbase-end-row BINARYSTRING
                         Binary string representation of  end row prefix at
                         which  to  stop  indexing   (exclusive).  See  the
                         description   of   --hbase-start-row    for   more
                         information. The default  is  to  include the last
                         row of the table. Example: CCCC
  --hbase-start-time STRING
                         Earliest timestamp (inclusive)  in  time  range of
                         HBase cells  to  be  included  for  indexing.  The
                         default is to include all cells. Example: 0
  --hbase-end-time STRING
                         Latest timestamp (exclusive) of  HBase cells to be
                         included for indexing. The  default  is to include
                         all cells. Example: 123456789
  --hbase-timestamp-format STRING
                         Timestamp format to be  used to interpret --hbase-
                         start-time and --hbase-end-time.  This  is a java.
                         text.SimpleDateFormat compliant format  (see http:
                         //docs.oracle.
                         com/javase/8/docs/api/java/text/SimpleDateFormat.
                         html). If  this  parameter  is  omitted  then  the
                         timestamps   are   interpreted    as   number   of
                         milliseconds  since  the   standard   epoch  (Unix
                         time). Example: yyyy-MM-dd'T'HH:mm:ss.SSSZ

Solr cluster arguments:
  Arguments that provide information about your Solr cluster.

  --zk-host STRING       The address of a ZooKeeper  ensemble being used by
                         a SolrCloud cluster. This  ZooKeeper ensemble will
                         be examined  to  determine  the  number  of output
                         shards to create  as  well  as  the  Solr  URLs to
                         merge the output shards into  when using the --go-
                         live option. Requires that  you  also  pass the --
                         collection to merge the shards into.

                         The   --zk-host   option   implements   the   same
                         partitioning semantics as  the  standard SolrCloud
                         Near-Real-Time (NRT)  API.  This  enables  to  mix
                         batch  updates  from   MapReduce   ingestion  with
                         updates from standard  Solr  NRT  ingestion on the
                         same SolrCloud  cluster,  using  identical  unique
                         document keys.

                         Format is: a  list  of  comma  separated host:port
                         pairs,  each  corresponding   to   a   zk  server.
                         Example: '127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:
                         2183' If the optional  chroot  suffix  is used the
                         example  would  look  like:  '127.0.0.1:2181/solr,
                         127.0.0.1:2182/solr,127.0.0.1:2183/solr'     where
                         the client would  be  rooted  at  '/solr'  and all
                         paths would  be  relative  to  this  root  -  i.e.
                         getting/setting/etc... '/foo/bar' would  result in
                         operations being run on  '/solr/foo/bar' (from the
                         server perspective).

                         If --solr-home-dir  is  not  specified,  the  Solr
                         home  directory  for   the   collection   will  be
                         downloaded from this ZooKeeper ensemble.

Go live arguments:
  Arguments for  merging  the  shards  that  are  built  into  a  live Solr
  cluster. Also see the Cluster arguments.

  --go-live              Allows you to  optionally  merge  the  final index
                         shards into a  live  Solr  cluster  after they are
                         built. You can pass the  ZooKeeper address with --
                         zk-host and the relevant  cluster information will
                         be auto detected.  (default: false)
  --collection STRING    The SolrCloud  collection  to  merge  shards  into
                         when  using  --go-live   and  --zk-host.  Example:
                         collection1
  --go-live-threads INTEGER
                         Tuning knob that indicates  the  maximum number of
                         live merges  to  run  in  parallel  at  one  time.
                         (default: 1000)

Optional arguments:
  --help, -help, -h      Show this help message and exit
  --output-dir HDFS_URI  HDFS directory to  write  Solr  indexes to. Inside
                         there one  output  directory  per  shard  will  be
                         generated.    Example:     hdfs://c2202.mycompany.
                         com/user/$USER/test
  --overwrite-output-dir
                         Overwrite the directory  specified by --output-dir
                         if it already  exists.  Using  this parameter will
                         result in the  output  directory being recursively
                         deleted at job startup. (default: false)
  --morphline-file FILE  Relative or absolute path  to  a local config file
                         that contains one  or  more  morphlines.  The file
                         must be UTF-8 encoded.  The  file will be uploaded
                         to each MR task.  If  supplied, this overrides the
                         value   from   the    --hbase-indexer-*   options.
                         Example: /path/to/morphlines.conf
  --morphline-id STRING  The identifier  of  the  morphline  that  shall be
                         executed within the  morphline  config  file, e.g.
                         specified by --morphline-file. If the --morphline-
                         id option is  ommitted  the  first (i.e. top-most)
                         morphline within  the  config  file  is  used.  If
                         supplied, this overrides  the  value  from  the --
                         hbase-indexer-* options. Example: morphline1
  --solr-home-dir DIR    Optional relative or absolute path  to a local dir
                         containing  Solr  conf/  dir   and  in  particular
                         conf/solrconfig.xml and optionally  also lib/ dir.
                         This directory will be  uploaded  to each MR task.
                         Example: src/test/resources/solr/minimr
  --update-conflict-resolver FQCN
                         Fully qualified class name  of  a  Java class that
                         implements the  UpdateConflictResolver  interface.
                         This  enables  deduplication  and  ordering  of  a
                         series of document  updates  for  the  same unique
                         document key. For example,  a  MapReduce batch job
                         might index multiple files  in  the same job where
                         some of the files contain  old and new versions of
                         the very  same  document,  using  the  same unique
                         document key.
                         Typically,  implementations   of   this  interface
                         forbid collisions  by  throwing  an  exception, or
                         ignore all but the  most  recent document version,
                         or, in the general  case,  order colliding updates
                         ascending  from  least   recent   to  most  recent
                         (partial) update. The caller of this interface (i.
                         e.  the  Hadoop  Reducer)   will  then  apply  the
                         updates to  Solr  in  the  order  returned  by the
                         orderUpdates() method.
                         The                                        default
                         RetainMostRecentUpdateConflictResolver
                         implementation ignores  all  but  the  most recent
                         document version, based on  a configurable numeric
                         Solr    field,    which     defaults     to    the
                         file_last_modified timestamp (default: org.apache.
                         solr.hadoop.dedup.
                         RetainMostRecentUpdateConflictResolver)
  --reducers INTEGER     Tuning knob that indicates  the number of reducers
                         to  index  into.  0  indicates  that  no  reducers
                         should be  used,  and  documents  should  be  sent
                         directly  from  the  mapper  tasks  to  live  Solr
                         servers.  -1  indicates   use   all  reduce  slots
                         available on the  cluster.  -2  indicates  use one
                         reducer  per  output  shard,  which  disables  the
                         mtree merge  MR  algorithm.  The  mtree  merge  MR
                         algorithm improves scalability  by  spreading load
                         (in  particular  CPU  load)   among  a  number  of
                         parallel reducers that  can  be  much  larger than
                         the number of solr  shards  expected  by the user.
                         It can  be  seen  as  an  extension  of concurrent
                         lucene merges  and  tiered  lucene  merges  to the
                         clustered case. The  subsequent  mapper-only phase
                         merges  the  output  of   said   large  number  of
                         reducers to the number  of  shards expected by the
                         user,   again   by    utilizing   more   available
                         parallelism on the cluster. (default: -1)
  --max-segments INTEGER
                         Tuning knob that indicates  the  maximum number of
                         segments to be contained  on  output  in the index
                         of each reducer shard.  After  a reducer has built
                         its output index  it  applies  a  merge  policy to
                         merge segments  until  there  are  <=  maxSegments
                         lucene  segments  left  in   this  index.  Merging
                         segments involves reading  and  rewriting all data
                         in all these  segment  files, potentially multiple
                         times,  which  is  very  I/O  intensive  and  time
                         consuming. However, an  index  with fewer segments
                         can later be merged  faster,  and  it can later be
                         queried  faster  once  deployed  to  a  live  Solr
                         serving shard. Set  maxSegments  to  1 to optimize
                         the index for low query  latency. In a nutshell, a
                         small maxSegments  value  trades  indexing latency
                         for subsequently improved query  latency. This can
                         be  a  reasonable  trade-off  for  batch  indexing
                         systems. (default: 1)
  --fair-scheduler-pool STRING
                         Optional tuning knob  that  indicates  the name of
                         the fair scheduler  pool  to  submit  jobs to. The
                         Fair Scheduler is a  pluggable MapReduce scheduler
                         that provides a way to  share large clusters. Fair
                         scheduling is a method  of  assigning resources to
                         jobs such that all jobs  get, on average, an equal
                         share of resources  over  time.  When  there  is a
                         single job  running,  that  job  uses  the  entire
                         cluster. When  other  jobs  are  submitted,  tasks
                         slots that free up are  assigned  to the new jobs,
                         so that each job gets  roughly  the same amount of
                         CPU time.  Unlike  the  default  Hadoop scheduler,
                         which forms a queue of  jobs, this lets short jobs
                         finish in reasonable time  while not starving long
                         jobs. It is also an  easy  way  to share a cluster
                         between multiple of users.  Fair  sharing can also
                         work with  job  priorities  -  the  priorities are
                         used as  weights  to  determine  the  fraction  of
                         total compute time that each job gets.
  --dry-run              Run in local mode  and  print  documents to stdout
                         instead of loading them  into  Solr. This executes
                         the  morphline  in  the  client  process  (without
                         submitting a job  to  MR)  for  quicker turnaround
                         during early  trial  &  debug  sessions. (default:
                         false)
  --log4j FILE           Relative or absolute  path  to  a log4j.properties
                         config file on the  local  file  system. This file
                         will  be  uploaded  to   each  MR  task.  Example:
                         /path/to/log4j.properties
  --verbose, -v          Turn on verbose output. (default: false)
  --clear-index          Will attempt  to  delete  all  entries  in  a solr
                         index before starting  batch  build.  This  is not
                         transactional so  if  the  build  fails  the index
                         will be empty. (default: false)
  --show-non-solr-cloud  Also show options for  Non-SolrCloud  mode as part
                         of --help. (default: false)

Generic options supported are
  --conf <configuration file>
                         specify an application configuration file
  -D <property=value>    use value for given property
  --fs <local|namenode:port>
                         specify a namenode
  --jt <local|resourcemanager:port>
                         specify a ResourceManager
  --files <comma separated list of files>
                         specify comma separated files to  be copied to the
                         map reduce cluster
  --libjars <comma separated list of jars>
                         specify comma separated  jar  files  to include in
                         the classpath.
  --archives <comma separated list of archives>
                         specify comma separated archives  to be unarchived
                         on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

Examples:

# (Re)index a table in GoLive mode based on a local indexer config file
hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapred.child.java.opts=-Xmx500m' \
  --hbase-indexer-file indexer.xml \
  --zk-host 127.0.0.1/solr \
  --collection collection1 \
  --go-live \
  --log4j src/test/resources/log4j.properties

# (Re)index a table in GoLive mode using a local morphline-based indexer config file
# Also include extra library jar file containing JSON tweet Java parser:
hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  --libjars /path/to/kite-morphlines-twitter-0.10.0.jar \
  -D 'mapred.child.java.opts=-Xmx500m' \
  --hbase-indexer-file src/test/resources/morphline_indexer_without_zk.xml \
  --zk-host 127.0.0.1/solr \
  --collection collection1 \
  --go-live \
  --morphline-file src/test/resources/morphlines.conf \
  --output-dir hdfs://c2202.mycompany.com/user/$USER/test \
  --overwrite-output-dir \
  --log4j src/test/resources/log4j.properties

# (Re)index a table in GoLive mode
hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapred.child.java.opts=-Xmx500m' \
  --hbase-indexer-file indexer.xml \
  --zk-host 127.0.0.1/solr \
  --collection collection1 \
  --go-live \
  --log4j src/test/resources/log4j.properties

# (Re)index a table with direct writes to SolrCloud
hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapred.child.java.opts=-Xmx500m' \
  --hbase-indexer-file indexer.xml \
  --zk-host 127.0.0.1/solr \
  --collection collection1 \
  --reducers 0 \
  --log4j src/test/resources/log4j.properties

# (Re)index a table based on a indexer config stored in ZK
hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapred.child.java.opts=-Xmx500m' \
  --hbase-indexer-zk zk01 \
  --hbase-indexer-name docindexer \
  --go-live \
  --log4j src/test/resources/log4j.properties

# MapReduce on Yarn - Pass custom JVM arguments
HADOOP_CLIENT_OPTS='-DmaxConnectionsPerHost=10000 -DmaxConnections=10000'; \
hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapreduce.map.java.opts=-DmaxConnectionsPerHost=10000 -DmaxConnections=10000' \
  -D 'mapreduce.reduce.java.opts=-DmaxConnectionsPerHost=10000 -DmaxConnections=10000' \
  --hbase-indexer-zk zk01 \
  --hbase-indexer-name docindexer \
  --go-live \
  --log4j src/test/resources/log4j.properties

# MapReduce on MR1 - Pass custom JVM arguments
HADOOP_CLIENT_OPTS='-DmaxConnectionsPerHost=10000 -DmaxConnections=10000'; \
hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapreduce.child.java.opts=-DmaxConnectionsPerHost=10000 -DmaxConnections=10000' \
  --hbase-indexer-zk zk01 \
  --hbase-indexer-name docindexer \
  --go-live \
  --log4j src/test/resources/log4j.properties

Using --go-live with SSL or Kerberos

The go-live phase of the indexer jobs sends a MERGEINDEXES request from the indexer client (the node from which the MR job was submitted) to the live Solr servers. If the Solr server has SSL enabled, you need to ensure that the indexer client trusts the certificate presented by the Solr server(s), otherwise you get an SSLPeerUnverifiedException.

  1. Specify the location of the trust store by setting the following HADOOP_OPTS variable before launching the indexer job:
    HADOOP_OPTS="-Djavax.net.ssl.trustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks "
  2. If the Solr servers have Kerberos authentication enabled, you need to ensure that the indexer client can authenticate via Kerberos to the Solr servers. For this, you need to create a Java Authentication and Authorization Service configuration (JAAS) file locally on the node where the indexing job is launched:

    • If you are authenticating using kinit to obtain credentials, you can configure the client to use your credential cache by creating a jaas.conf file with the following contents:
      Client {
       com.sun.security.auth.module.Krb5LoginModule required
       useKeyTab=false
       useTicketCache=true
       principal="<user>@EXAMPLE.COM";
       };
      Replace <user> with your username, and EXAMPLE.COM with your Kerberos realm.
    • If you want the client application to authenticate using a keytab, modify jaas-client.conf as follows:
      Client {
       com.sun.security.auth.module.Krb5LoginModule required
       useKeyTab=true
       keyTab="/path/to/user.keytab"
       storeKey=true
       useTicketCache=false
       principal="<user>@EXAMPLE.COM";
      };
      Replace /path/to/user.keytab with the keytab file you want to use and <user>@EXAMPLE.COM with the principal in the keytab. If you are using a service principal that includes the hostname, make sure that it is included in the jaas.conf file (for example, solr/solr01.example.com@EXAMPLE.COM).
  3. If you are using a ticket cache, you need to do a kinit to acquire a ticket for the configured principal before launching the indexer.

  4. Specify the authentication configuration in the HADOOP_OPTS environment variable:

    • For package-based installations:

      HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf -Djavax.net.ssl.trustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks" \
      hadoop --config /etc/hadoop/conf \
      jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \
      --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" -Dmapreduce.reduce.java.opts="-Xmx512m" \
      --hbase-indexer-file /home/systest/hbasetest/morphline-hbase-mapper.xml \
      --zk-host 127.0.0.1/solr \
      --collection hbase-collection1 \
      --go-live --log4j src/test/resources/log4j.properties
    • For parcel-based installations:
      HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf -Djavax.net.ssl.trustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks" \
      hadoop --config /etc/hadoop/conf \
      jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \
      --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" -Dmapreduce.reduce.java.opts="-Xmx512m" \
      --hbase-indexer-file /home/systest/hbasetest/morphline-hbase-mapper.xml \
      --zk-host 127.0.0.1/solr \
      --collection hbase-collection1 \
      --go-live --log4j src/test/resources/log4j.properties

Understanding --go-live and HDFS ACLs

When run with a reduce phase, as opposed to as a mapper-only job, the indexer creates an offline index on HDFS in the output directory specified by the --output-dir parameter. If the --go-live parameter is specified, Solr merges the resulting offline index into the live running service. Thus, the Solr service must have read access to the contents of the output directory in order to complete the --go-live step. If --overwrite-output-dir is specified, the indexer deletes and recreates any existing output directory; in an environment with restrictive permissions, such as one with an HDFS umask of 077, the Solr user may not be able to read the contents of the newly created directory. To address this issue, the indexer automatically applies the HDFS ACLs to enable Solr to read the output directory contents. These ACLs are only applied if HDFS ACLs are enabled on the HDFS NameNode. For more information, see HDFS Extended ACLs.

The indexer only makes ACL updates to the output directory and its contents. If the output directory's parent directories do not include the execute permission, the Solr service cannot access the output directory. Solr must have execute permissions from standard permissions or ACLs on the parent directories of the output directory.