Using the Lily HBase Batch Indexer for Indexing

With Cloudera Search, you can batch index HBase tables using MapReduce jobs. This batch indexing does not require:

  • HBase replication
  • The Lily HBase Indexer Service
  • Registering a Lily HBase Indexer configuration with the Lily HBase Indexer Service

The indexer supports flexible, custom, application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase. This way, applications can use the search result set to directly access matching raw HBase cells.

Batch indexing column families of tables in an HBase cluster requires:

  • Populating an HBase table
  • Creating a corresponding collection in Search
  • Creating a Lily HBase Indexer configuration
  • Creating a Morphline configuration file
  • Understanding the extractHBaseCells morphline command
  • Running HBaseMapReduceIndexerTool

Populating an HBase Table

After configuring and starting your system, create an HBase table and add rows to it. For example:

$ hbase shell

hbase(main):002:0> create 'record', {NAME => 'data'}
hbase(main):002:0> put 'record', 'row1', 'data', 'value'
hbase(main):001:0> put 'record', 'row2', 'data', 'value2'

Creating a Corresponding Collection in Search

A collection in Search used for HBase indexing must have a Solr schema that accommodates the types of HBase column families and qualifiers that are being indexed. To begin, consider adding the all-inclusive data field to a default schema. Once you decide on a schema, create a collection using a command of the form:

$ solrctl instancedir --generate $HOME/hbase-collection1
$ edit $HOME/hbase-collection1/conf/schema.xml
$ solrctl instancedir --create hbase-collection1 $HOME/hbase-collection1
$ solrctl collection --create hbase-collection1

Creating a Lily HBase Indexer Configuration

Configure individual Lily HBase Indexers using the hbase-indexer command-line utility. Typically, there is one Lily HBase Indexer configuration for each HBase table, but there can be as many Lily HBase Indexer configurations as there are tables, column families, and corresponding collections in Search. Each Lily HBase Indexer configuration is defined in an XML file, such as morphline-hbase-mapper.xml.

An indexer configuration XML file must refer to the MorphlineResultToSolrMapper implementation and point to the location of a Morphline configuration file, as shown in the following morphline-hbase-mapper.xml indexer configuration file:

$ cat $HOME/morphline-hbase-mapper.xml

<?xml version="1.0"?>
<indexer table="record"
mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper">

   <!-- The relative or absolute path on the local file system to the
   morphline configuration file. -->
   <!-- Use relative path "morphlines.conf" for morphlines managed by
   Cloudera Manager -->
   <param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/>

   <!-- The optional morphlineId identifies a morphline if there are multiple
   morphlines in morphlines.conf -->
   <!-- <param name="morphlineId" value="morphline1"/> -->

</indexer>

The Lily HBase Indexer configuration file also supports the standard attributes of any HBase Lily Indexer on the top-level <indexer> element: table, mapping-type, read-row, mapper, unique-key-formatter, unique-key-field, row-field, column-family-field, andtable-family-field. It does not support the <field> element and <extract> elements.

Creating a Morphline Configuration File

After creating an indexer configuration XML file, control its behavior by configuring morphline ETL transformation commands in a morphlines.conf configuration file. The morphlines.conf configuration file can contain any number of morphline commands. Typically, an extractHBaseCells command is the first command. The readAvroContainer or readAvro morphline commands are often used to extract Avro data from the HBase byte array. This configuration file can be shared among different applications that use morphlines.

$ cat /etc/hbase-solr/conf/morphlines.conf

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"]

    commands : [
      {
        extractHBaseCells {
          mappings : [
            {
              inputColumn : "data:*"
              outputField : "data"
              type : string
              source : value
            }

            #{
            #  inputColumn : "data:item"
            #  outputField : "_attachment_body"
            #  type : "byte[]"
            #  source : value
            #}
          ]
        }
      }

      #for avro use with type : "byte[]" in extractHBaseCells mapping above
      #{ readAvroContainer {} }
      #{
      #  extractAvroPaths {
      #    paths : {
      #      data : /user_name
      #    }
      #  }
      #}

      { logTrace { format : "output record: {}", args : ["@{}"] } }
    ]
  }
]

Understanding the extractHBaseCells Morphline Command

The extractHBaseCells morphline command extracts cells from an HBase result and transforms the values into a SolrInputDocument. The command consists of an array of zero or more mapping specifications.

Each mapping has:

  • The inputColumn parameter, which specifies the data from HBase for populating a field in Solr. It has the form of a column family name and qualifier, separated by a colon. The qualifier portion can end in an asterisk, which is interpreted as a wildcard. In this case, all matching column-family and qualifier expressions are used. The following are examples of valid inputColumn values:
    • mycolumnfamily:myqualifier
    • mycolumnfamily:my*
    • mycolumnfamily:*
  • The outputField parameter specifies the morphline record field to which to add output values. The morphline record field is also known as the Solr document field. Example: first_name.
  • Dynamic output fields are enabled by the outputField parameter ending with a * wildcard. For example:
    inputColumn : "m:e:*"
    outputField : "belongs_to_*"
    In this case, if you make these puts in HBase:
    put 'table_name' , 'row1' , 'm:e:1' , 'foo'
    put 'table_name' , 'row1' , 'm:e:9' , 'bar'
    Then the fields of the Solr document are as follows:
    belongs_to_1 : foo
    belongs_to_9 : bar
  • The type parameter defines the data type of the content in HBase. All input data is stored in HBase as byte arrays, but all content in Solr is indexed as text, so a method for converting byte arrays to the actual data type is required. The type parameter can be the name of a type that is supported by org.apache.hadoop.hbase.util.Bytes.to* (which currently includes byte[], int, long, string, boolean, float, double, short, and bigdecimal). Use type byte[] to pass the byte array through to the morphline without conversion.
    • type:byte[] copies the byte array unmodified into the record output field
    • type:int converts with org.apache.hadoop.hbase.util.Bytes.toInt
    • type:long converts with org.apache.hadoop.hbase.util.Bytes.toLong
    • type:string converts with org.apache.hadoop.hbase.util.Bytes.toString
    • type:boolean converts with org.apache.hadoop.hbase.util.Bytes.toBoolean
    • type:float converts with org.apache.hadoop.hbase.util.Bytes.toFloat
    • type:double converts with org.apache.hadoop.hbase.util.Bytes.toDouble
    • type:short converts with org.apache.hadoop.hbase.util.Bytes.toShort
    • type:bigdecimal converts with org.apache.hadoop.hbase.util.Bytes.toBigDecimal
    Alternatively, the type parameter can be the name of a Java class that implements the com.ngdata.hbaseindexer.parse.ByteArrayValueMapper interface.
    HBase data formatting does not always match what is specified by org.apache.hadoop.hbase.util.Bytes.*. For example, this can occur with data of type float or double. You can enable indexing of such HBase data by converting the data. There are various ways to do so including:
    • Using Java morphline command to parse input data, converting it to the expected output. For example:
      {
       imports : "import java.util.*;" code: """ // manipulate the contents of a record field
       String stringAmount = (String) record.getFirstValue("amount");
       Double dbl = Double.parseDouble(stringAmount); record.replaceValues("amount",dbl);
       return child.process(record); // pass record to next command in chain """
      }
    • Creating table fields with binary format and then using types such as double or float in a morphline.conf. You could create a table in HBase for storing doubles using commands similar to:
      CREATE TABLE sample_lily_hbase ( id string, amount double, ts timestamp )
      STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,ti:amount#b,ti:ts,')
      TBLPROPERTIES ('hbase.table.name' = 'sample_lily'); 
  • The source parameter determines which portion of an HBase KeyValue is used as indexing input. Valid choices are value or qualifier. When value is specified, the HBase cell value is used as input for indexing. When qualifier is specified, then the HBase column qualifier is used as input for indexing. The default is value.

Running HBaseMapReduceIndexerTool

Run HBaseMapReduceIndexerTool to index the HBase table using a MapReduce job, as follows:

  • For package-based deployments:

    hadoop --config /etc/hadoop/conf \
    jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \
    --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" \
    -Dmapreduce.reduce.java.opts="-Xmx512m" \
    --hbase-indexer-file $HOME/morphline-hbase-mapper.xml \
    --zk-host 127.0.0.1/solr --collection hbase-collection1 \
    --go-live --log4j src/test/resources/log4j.properties
  • For parcel-based deployments:
    hadoop --config /etc/hadoop/conf \
    jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \
    --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" \
    -Dmapreduce.reduce.java.opts="-Xmx512m" \
    --hbase-indexer-file $HOME/morphline-hbase-mapper.xml \
    --zk-host 127.0.0.1/solr --collection hbase-collection1 \
    --go-live --log4j src/test/resources/log4j.properties

Using --go-live with SSL or Kerberos

The go-live phase of the indexer jobs sends a MERGEINDEXES request from the indexer client (the node from which the MR job was submitted) to the live Solr servers. If the Solr server has SSL enabled, you need to ensure that the indexer client trusts the certificate presented by the Solr server(s), otherwise you get an SSLPeerUnverifiedException.

  1. Specify the location of the trust store by setting the following HADOOP_OPTS variable before launching the indexer job:
    HADOOP_OPTS="-Djavax.net.ssl.trustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks "
  2. If the Solr servers have Kerberos authentication enabled, you need to ensure that the indexer client can authenticate via Kerberos to the Solr servers. For this, you need to create a Java Authentication and Authorization Service configuration (JAAS) file locally on the node where the indexing job is launched:

    • If you are authenticating using kinit to obtain credentials, you can configure the client to use your credential cache by creating a jaas.conf file with the following contents:
      Client {
       com.sun.security.auth.module.Krb5LoginModule required
       useKeyTab=false
       useTicketCache=true
       principal="<user>@EXAMPLE.COM";
       };
      Replace <user> with your username, and EXAMPLE.COM with your Kerberos realm.
    • If you want the client application to authenticate using a keytab, modify jaas-client.conf as follows:
      Client {
       com.sun.security.auth.module.Krb5LoginModule required
       useKeyTab=true
       keyTab="/path/to/user.keytab"
       storeKey=true
       useTicketCache=false
       principal="<user>@EXAMPLE.COM";
      };
      Replace /path/to/user.keytab with the keytab file you want to use and <user>@EXAMPLE.COM with the principal in the keytab. If you are using a service principal that includes the hostname, make sure that it is included in the jaas.conf file (for example, solr/solr01.example.com@EXAMPLE.COM).
  3. If you are using a ticket cache, you need to do a kinit to acquire a ticket for the configured principal before launching the indexer.

  4. Specify the authentication configuration in the HADOOP_OPTS environment variable:

    • For package-based installations:

      HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf -Djavax.net.ssl.trustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks" \
      hadoop --config /etc/hadoop/conf \
      jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \
      --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" -Dmapreduce.reduce.java.opts="-Xmx512m" \
      --hbase-indexer-file /home/systest/hbasetest/morphline-hbase-mapper.xml \
      --zk-host 127.0.0.1/solr \
      --collection hbase-collection1 \
      --go-live --log4j src/test/resources/log4j.properties
    • For parcel-based installations:
      HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf -Djavax.net.ssl.trustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks" \
      hadoop --config /etc/hadoop/conf \
      jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \
      --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" -Dmapreduce.reduce.java.opts="-Xmx512m" \
      --hbase-indexer-file /home/systest/hbasetest/morphline-hbase-mapper.xml \
      --zk-host 127.0.0.1/solr \
      --collection hbase-collection1 \
      --go-live --log4j src/test/resources/log4j.properties

Understanding --go-live and HDFS ACLs

When run with a reduce phase, as opposed to as a mapper-only job, the indexer creates an offline index on HDFS in the output directory specified by the --output-dir parameter. If the --go-live parameter is specified, Solr merges the resulting offline index into the live running service. Thus, the Solr service must have read access to the contents of the output directory in order to complete the --go-live step. If --overwrite-output-dir is specified, the indexer deletes and recreates any existing output directory; in an environment with restrictive permissions, such as one with an HDFS umask of 077, the Solr user may not be able to read the contents of the newly created directory. To address this issue, the indexer automatically applies the HDFS ACLs to enable Solr to read the output directory contents. These ACLs are only applied if HDFS ACLs are enabled on the HDFS NameNode. For more information, see HDFS Extended ACLs.

The indexer only makes ACL updates to the output directory and its contents. If the output directory's parent directories do not include the run permission, the Solr service cannot access the output directory. Solr must have run permissions from standard permissions or ACLs on the parent directories of the output directory.