Lily HBase Batch Indexing for Cloudera Search
With Cloudera Search, you can batch index HBase tables using the Lily HBase batch indexer MapReduce job, also known as HBaseMapReduceIndexerTool. This batch indexing does not require:
- HBase replication
- The Lily HBase Indexer Service
- Registering a Lily HBase Indexer configuration with the Lily HBase Indexer Service
The indexer supports flexible, custom, application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase. This way, applications can use the search result set to directly access matching raw HBase cells.
The following procedures demonstrate creating a small HBase table and using the HBaseMapReduceIndexerTool to index the table into a collection:
- Populating an HBase Table
- Creating a Collection in Cloudera Search
- Creating a Lily HBase Indexer Configuration File
- Creating a Morphline Configuration File
- Understanding the extractHBaseCells Morphline Command
- Running HBaseMapReduceIndexerTool
- Using --go-live with SSL or Kerberos
- Understanding --go-live and HDFS ACLs
Populating an HBase Table
After configuring and starting your system, create an HBase table and add rows to it. For example:
$ hbase shell hbase(main):002:0> create 'sample_table', {NAME => 'data'} hbase(main):002:0> put 'sample_table', 'row1', 'data', 'value' hbase(main):001:0> put 'sample_table', 'row2', 'data', 'value2'
Creating a Collection in Cloudera Search
A collection in Search used for HBase indexing must have a Solr schema that accommodates the types of HBase column families and qualifiers that are being indexed. To begin, consider adding the all-inclusive data field to a default schema. Once you decide on a schema, create a collection using commands similar to the following:
$ solrctl instancedir --generate $HOME/hbase_collection_config ## Edit $HOME/hbase_collection_config/conf/schema.xml as needed ## ## If you are using Sentry for authorization, copy solrconfig.xml.secure to solrconfig.xml as follows: ## ## cp $HOME/hbase_collection_config/conf/solrconfig.xml.secure $HOME/hbase_collection_config/conf/solrconfig.xml ## $ solrctl instancedir --create hbase_collection_config $HOME/hbase_collection_config $ solrctl collection --create hbase_collection -s <numShards> -c hbase_collection_config
Creating a Lily HBase Indexer Configuration File
Configure individual Lily HBase Indexers using the hbase-indexer command-line utility. Typically, there is one Lily HBase Indexer configuration file for each HBase table, but there can be as many Lily HBase Indexer configuration files as there are tables, column families, and corresponding collections in Search. Each Lily HBase Indexer configuration is defined in an XML file, such as morphline-hbase-mapper.xml.
An indexer configuration XML file must refer to the MorphlineResultToSolrMapper implementation and point to the location of a Morphline configuration file, as shown in the following morphline-hbase-mapper.xml indexer configuration file. For Cloudera Manager managed environments, set morphlineFile to the relative path morphlines.conf. For unmanaged environments, specify the absolute path to a morphlines.conf file that exists on the Lily HBase Indexer host. Make sure the file is readable by the HBase system user (hbase by default).
$ cat $HOME/morphline-hbase-mapper.xml <?xml version="1.0"?> <indexer table="sample_table" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"> <!-- The relative or absolute path on the local file system to the morphline configuration file. --> <!-- Use relative path "morphlines.conf" for morphlines managed by Cloudera Manager --> <param name="morphlineFile" value="/path/to/morphlines.conf"/> <!-- The optional morphlineId identifies a morphline if there are multiple morphlines in morphlines.conf --> <!-- <param name="morphlineId" value="morphline1"/> --> </indexer>
The Lily HBase Indexer configuration file also supports the standard attributes of any HBase Lily Indexer on the top-level <indexer> element: table, mapping-type, read-row, mapper, unique-key-formatter, unique-key-field, row-field, column-family-field, and table-family-field. It does not support the <field> element and <extract> elements.
Creating a Morphline Configuration File
After creating an indexer configuration XML file, you can configure morphline ETL transformation commands in a morphlines.conf configuration file. The morphlines.conf configuration file can contain any number of morphline commands. Typically, an extractHBaseCells command is the first command. The readAvroContainer or readAvro morphline commands are often used to extract Avro data from the HBase byte array. This configuration file can be shared among different applications that use morphlines.
If you are using Cloudera Manager, the morphlines.conf file is edited within Cloudera Manager (
).For unmanaged environments, create the morphlines.conf file on the Lily HBase Indexer host:
$ cat /etc/hbase-solr/conf/morphlines.conf morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"] commands : [ { extractHBaseCells { mappings : [ { inputColumn : "data:*" outputField : "data" type : string source : value } #{ # inputColumn : "data:item" # outputField : "_attachment_body" # type : "byte[]" # source : value #} ] } } #for avro use with type : "byte[]" in extractHBaseCells mapping above #{ readAvroContainer {} } #{ # extractAvroPaths { # paths : { # data : /user_name # } # } #} { logTrace { format : "output record: {}", args : ["@{}"] } } ] } ]
Understanding the extractHBaseCells Morphline Command
The extractHBaseCells morphline command extracts cells from an HBase result and transforms the values into a SolrInputDocument. The command consists of an array of zero or more mapping specifications.
Each mapping has:
- The inputColumn parameter, which specifies the data from HBase for populating a field in Solr. It has the form of a
column family name and qualifier, separated by a colon. The qualifier portion can end in an asterisk, which is interpreted as a wildcard. In this case, all matching column-family and qualifier
expressions are used. The following are examples of valid inputColumn values:
- mycolumnfamily:myqualifier
- mycolumnfamily:my*
- mycolumnfamily:*
- The outputField parameter specifies the morphline record field to which to add output values. The morphline record field is also known as the Solr document field. Example: first_name.
- Dynamic output fields are enabled by the outputField parameter ending with a wildcard (*). For example:
inputColumn : "mycolumnfamily:*" outputField : "belongs_to_*"
In this case, if you make these puts in HBase:put 'table_name' , 'row1' , 'mycolumnfamily:1' , 'foo' put 'table_name' , 'row1' , 'mycolumnfamily:9' , 'bar'
Then the fields of the Solr document are as follows:belongs_to_1 : foo belongs_to_9 : bar
- The type parameter defines the data type of the content in HBase. All input data is stored in HBase as byte arrays, but
all content in Solr is indexed as text, so a method for converting byte arrays to the actual data type is required. The type parameter can be the name of a type that is supported by org.apache.hadoop.hbase.util.Bytes.to* (which currently includes byte[], int, long,
string, boolean, float, double, short, and
bigdecimal). Use type byte[] to pass the byte array through to the morphline without conversion.
- type:byte[] copies the byte array unmodified into the record output field
- type:int converts with org.apache.hadoop.hbase.util.Bytes.toInt
- type:long converts with org.apache.hadoop.hbase.util.Bytes.toLong
- type:string converts with org.apache.hadoop.hbase.util.Bytes.toString
- type:boolean converts with org.apache.hadoop.hbase.util.Bytes.toBoolean
- type:float converts with org.apache.hadoop.hbase.util.Bytes.toFloat
- type:double converts with org.apache.hadoop.hbase.util.Bytes.toDouble
- type:short converts with org.apache.hadoop.hbase.util.Bytes.toShort
- type:bigdecimal converts with org.apache.hadoop.hbase.util.Bytes.toBigDecimal
HBase data formatting does not always match what is specified by org.apache.hadoop.hbase.util.Bytes.*. For example, this can occur with data of type float or double. You can enable indexing of such HBase data by converting the data. There are various ways to do so, including:- Using Java morphline command to parse input data, converting it to the expected output. For example:
{ imports : "import java.util.*;" code: """ // manipulate the contents of a record field String stringAmount = (String) record.getFirstValue("amount"); Double dbl = Double.parseDouble(stringAmount); record.replaceValues("amount",dbl); return child.process(record); // pass record to next command in chain """ }
- Creating table fields with binary format and then using types such as double or float in a morphline.conf. You could create a table in HBase for storing
doubles using commands similar to:
CREATE TABLE sample_lily_hbase ( id string, amount double, ts timestamp ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,ti:amount#b,ti:ts,') TBLPROPERTIES ('hbase.table.name' = 'sample_lily');
- The source parameter determines which portion of an HBase KeyValue is used as indexing input. Valid choices are value or qualifier. When value is specified, the HBase cell value is used as input for indexing. When qualifier is specified, then the HBase column qualifier is used as input for indexing. The default is value.
Running HBaseMapReduceIndexerTool
Run the command as follows:
-
For package-based deployments:
hadoop --config /etc/hadoop/conf \ jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" \ -Dmapreduce.reduce.java.opts="-Xmx512m" \ --hbase-indexer-file $HOME/morphline-hbase-mapper.xml \ --zk-host 127.0.0.1/solr --collection hbase-collection1 \ --go-live --log4j src/test/resources/log4j.properties
-
For parcel-based deployments:
hadoop --config /etc/hadoop/conf \ jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" \ -Dmapreduce.reduce.java.opts="-Xmx512m" \ --hbase-indexer-file $HOME/morphline-hbase-mapper.xml \ --zk-host 127.0.0.1/solr --collection hbase-collection1 \ --go-live --log4j src/test/resources/log4j.properties
$ hadoop jar /opt/cloudera/parcels/CDH/jars/hbase-indexer-mr-*-job.jar --help
$ hadoop jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --help
The full command line usage help is as follows:
usage: hadoop [GenericOptions]... jar hbase-indexer-mr-*-job.jar [--hbase-indexer-zk STRING] [--hbase-indexer-name STRING] [--hbase-indexer-file FILE] [--hbase-indexer-component-factory STRING] [--hbase-table-name STRING] [--hbase-start-row BINARYSTRING] [--hbase-end-row BINARYSTRING] [--hbase-start-time STRING] [--hbase-end-time STRING] [--hbase-timestamp-format STRING] [--zk-host STRING] [--go-live] [--collection STRING] [--go-live-threads INTEGER] [--help] [--output-dir HDFS_URI] [--overwrite-output-dir] [--morphline-file FILE] [--morphline-id STRING] [--solr-home-dir DIR] [--update-conflict-resolver FQCN] [--reducers INTEGER] [--max-segments INTEGER] [--fair-scheduler-pool STRING] [--dry-run] [--log4j FILE] [--verbose] [--clear-index] [--show-non-solr-cloud] MapReduce batch job driver that takes input data from an HBase table and creates Solr index shards and writes the indexes into HDFS, in a flexible, scalable, and fault-tolerant manner. It also supports merging the output shards into a set of live customer-facing Solr servers in SolrCloud. Optionally, documents can be sent directly from the mapper tasks to SolrCloud, which is a much less scalable approach but enables updating existing documents in SolrCloud. The program proceeds in one or multiple consecutive MapReduce-based phases, as follows: 1) Mapper phase: This (parallel) phase scans over the input HBase table, extracts the relevant content, and transforms it into SolrInputDocuments. If run as a mapper-only job, this phase also writes the SolrInputDocuments directly to a live SolrCloud cluster. The conversion from HBase records into Solr documents is performed via a hbase-indexer configuration and typically based on a morphline. 2) Reducer phase: This (parallel) phase loads the mapper's SolrInputDocuments into one EmbeddedSolrServer per reducer. Each such reducer and Solr server can be seen as a (micro) shard. The Solr servers store their data in HDFS. 3) Mapper-only merge phase: This (parallel) phase merges the set of reducer shards into the number of Solr shards expected by the user, using a mapper-only job. This phase is omitted if the number of shards is already equal to the number of shards expected by the user 4) Go-live phase: This optional (parallel) phase merges the output shards of the previous phase into a set of live customer-facing Solr servers in SolrCloud. If this phase is omitted you can explicitly point each Solr server to one of the HDFS output shard directories Fault Tolerance: Mapper and reducer task attempts are retried on failure per the standard MapReduce semantics. On program startup all data in the -- output-dir is deleted if that output directory already exists and -- overwrite-output-dir is specified. This means that if the whole job fails you can retry simply by rerunning the program again using the same arguments. HBase Indexer parameters: Parameters for specifying the HBase indexer definition and/or where it should be loaded from. --hbase-indexer-zk STRING The address of the ZooKeeper ensemble from which to fetch the indexer definition named --hbase- indexer-name. Format is: a list of comma separated host:port pairs, each corresponding to a zk server. Example: '127.0.0.1:2181,127.0.0.1: 2182,127.0.0.1:2183' --hbase-indexer-name STRING The name of the indexer configuration to fetch from the ZooKeeper ensemble specified with -- hbase-indexer-zk. Example: myIndexer --hbase-indexer-file FILE Relative or absolute path to a local HBase indexer XML configuration file. If supplied, this overrides --hbase-indexer-zk and --hbase-indexer- name. Example: /path/to/morphline-hbase-mapper.xml --hbase-indexer-component-factory STRING Classname of the hbase indexer component factory. HBase scan parameters: Parameters for specifying what data is included while reading from HBase. --hbase-table-name STRING Optional name of the HBase table containing the records to be indexed. If supplied, this overrides the value from the --hbase-indexer-* options. Example: myTable --hbase-start-row BINARYSTRING Binary string representation of start row from which to start indexing (inclusive). The format of the supplied row key should use two-digit hex values prefixed by \x for non-ascii characters (e. g. 'row\x00'). The semantics of this argument are the same as those for the HBase Scan#setStartRow method. The default is to include the first row of the table. Example: AAAA --hbase-end-row BINARYSTRING Binary string representation of end row prefix at which to stop indexing (exclusive). See the description of --hbase-start-row for more information. The default is to include the last row of the table. Example: CCCC --hbase-start-time STRING Earliest timestamp (inclusive) in time range of HBase cells to be included for indexing. The default is to include all cells. Example: 0 --hbase-end-time STRING Latest timestamp (exclusive) of HBase cells to be included for indexing. The default is to include all cells. Example: 123456789 --hbase-timestamp-format STRING Timestamp format to be used to interpret --hbase- start-time and --hbase-end-time. This is a java. text.SimpleDateFormat compliant format (see http: //docs.oracle. com/javase/8/docs/api/java/text/SimpleDateFormat. html). If this parameter is omitted then the timestamps are interpreted as number of milliseconds since the standard epoch (Unix time). Example: yyyy-MM-dd'T'HH:mm:ss.SSSZ Solr cluster arguments: Arguments that provide information about your Solr cluster. --zk-host STRING The address of a ZooKeeper ensemble being used by a SolrCloud cluster. This ZooKeeper ensemble will be examined to determine the number of output shards to create as well as the Solr URLs to merge the output shards into when using the --go- live option. Requires that you also pass the -- collection to merge the shards into. The --zk-host option implements the same partitioning semantics as the standard SolrCloud Near-Real-Time (NRT) API. This enables to mix batch updates from MapReduce ingestion with updates from standard Solr NRT ingestion on the same SolrCloud cluster, using identical unique document keys. Format is: a list of comma separated host:port pairs, each corresponding to a zk server. Example: '127.0.0.1:2181,127.0.0.1:2182,127.0.0.1: 2183' If the optional chroot suffix is used the example would look like: '127.0.0.1:2181/solr, 127.0.0.1:2182/solr,127.0.0.1:2183/solr' where the client would be rooted at '/solr' and all paths would be relative to this root - i.e. getting/setting/etc... '/foo/bar' would result in operations being run on '/solr/foo/bar' (from the server perspective). If --solr-home-dir is not specified, the Solr home directory for the collection will be downloaded from this ZooKeeper ensemble. Go live arguments: Arguments for merging the shards that are built into a live Solr cluster. Also see the Cluster arguments. --go-live Allows you to optionally merge the final index shards into a live Solr cluster after they are built. You can pass the ZooKeeper address with -- zk-host and the relevant cluster information will be auto detected. (default: false) --collection STRING The SolrCloud collection to merge shards into when using --go-live and --zk-host. Example: collection1 --go-live-threads INTEGER Tuning knob that indicates the maximum number of live merges to run in parallel at one time. (default: 1000) Optional arguments: --help, -help, -h Show this help message and exit --output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside there one output directory per shard will be generated. Example: hdfs://c2202.mycompany. com/user/$USER/test --overwrite-output-dir Overwrite the directory specified by --output-dir if it already exists. Using this parameter will result in the output directory being recursively deleted at job startup. (default: false) --morphline-file FILE Relative or absolute path to a local config file that contains one or more morphlines. The file must be UTF-8 encoded. The file will be uploaded to each MR task. If supplied, this overrides the value from the --hbase-indexer-* options. Example: /path/to/morphlines.conf --morphline-id STRING The identifier of the morphline that shall be executed within the morphline config file, e.g. specified by --morphline-file. If the --morphline- id option is ommitted the first (i.e. top-most) morphline within the config file is used. If supplied, this overrides the value from the -- hbase-indexer-* options. Example: morphline1 --solr-home-dir DIR Optional relative or absolute path to a local dir containing Solr conf/ dir and in particular conf/solrconfig.xml and optionally also lib/ dir. This directory will be uploaded to each MR task. Example: src/test/resources/solr/minimr --update-conflict-resolver FQCN Fully qualified class name of a Java class that implements the UpdateConflictResolver interface. This enables deduplication and ordering of a series of document updates for the same unique document key. For example, a MapReduce batch job might index multiple files in the same job where some of the files contain old and new versions of the very same document, using the same unique document key. Typically, implementations of this interface forbid collisions by throwing an exception, or ignore all but the most recent document version, or, in the general case, order colliding updates ascending from least recent to most recent (partial) update. The caller of this interface (i. e. the Hadoop Reducer) will then apply the updates to Solr in the order returned by the orderUpdates() method. The default RetainMostRecentUpdateConflictResolver implementation ignores all but the most recent document version, based on a configurable numeric Solr field, which defaults to the file_last_modified timestamp (default: org.apache. solr.hadoop.dedup. RetainMostRecentUpdateConflictResolver) --reducers INTEGER Tuning knob that indicates the number of reducers to index into. 0 indicates that no reducers should be used, and documents should be sent directly from the mapper tasks to live Solr servers. -1 indicates use all reduce slots available on the cluster. -2 indicates use one reducer per output shard, which disables the mtree merge MR algorithm. The mtree merge MR algorithm improves scalability by spreading load (in particular CPU load) among a number of parallel reducers that can be much larger than the number of solr shards expected by the user. It can be seen as an extension of concurrent lucene merges and tiered lucene merges to the clustered case. The subsequent mapper-only phase merges the output of said large number of reducers to the number of shards expected by the user, again by utilizing more available parallelism on the cluster. (default: -1) --max-segments INTEGER Tuning knob that indicates the maximum number of segments to be contained on output in the index of each reducer shard. After a reducer has built its output index it applies a merge policy to merge segments until there are <= maxSegments lucene segments left in this index. Merging segments involves reading and rewriting all data in all these segment files, potentially multiple times, which is very I/O intensive and time consuming. However, an index with fewer segments can later be merged faster, and it can later be queried faster once deployed to a live Solr serving shard. Set maxSegments to 1 to optimize the index for low query latency. In a nutshell, a small maxSegments value trades indexing latency for subsequently improved query latency. This can be a reasonable trade-off for batch indexing systems. (default: 1) --fair-scheduler-pool STRING Optional tuning knob that indicates the name of the fair scheduler pool to submit jobs to. The Fair Scheduler is a pluggable MapReduce scheduler that provides a way to share large clusters. Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job gets. --dry-run Run in local mode and print documents to stdout instead of loading them into Solr. This executes the morphline in the client process (without submitting a job to MR) for quicker turnaround during early trial & debug sessions. (default: false) --log4j FILE Relative or absolute path to a log4j.properties config file on the local file system. This file will be uploaded to each MR task. Example: /path/to/log4j.properties --verbose, -v Turn on verbose output. (default: false) --clear-index Will attempt to delete all entries in a solr index before starting batch build. This is not transactional so if the build fails the index will be empty. (default: false) --show-non-solr-cloud Also show options for Non-SolrCloud mode as part of --help. (default: false) Generic options supported are --conf <configuration file> specify an application configuration file -D <property=value> use value for given property --fs <local|namenode:port> specify a namenode --jt <local|resourcemanager:port> specify a ResourceManager --files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster --libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. --archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] Examples: # (Re)index a table in GoLive mode based on a local indexer config file hadoop --config /etc/hadoop/conf \ jar hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml \ -D 'mapred.child.java.opts=-Xmx500m' \ --hbase-indexer-file indexer.xml \ --zk-host 127.0.0.1/solr \ --collection collection1 \ --go-live \ --log4j src/test/resources/log4j.properties # (Re)index a table in GoLive mode using a local morphline-based indexer config file # Also include extra library jar file containing JSON tweet Java parser: hadoop --config /etc/hadoop/conf \ jar hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml \ --libjars /path/to/kite-morphlines-twitter-0.10.0.jar \ -D 'mapred.child.java.opts=-Xmx500m' \ --hbase-indexer-file src/test/resources/morphline_indexer_without_zk.xml \ --zk-host 127.0.0.1/solr \ --collection collection1 \ --go-live \ --morphline-file src/test/resources/morphlines.conf \ --output-dir hdfs://c2202.mycompany.com/user/$USER/test \ --overwrite-output-dir \ --log4j src/test/resources/log4j.properties # (Re)index a table in GoLive mode hadoop --config /etc/hadoop/conf \ jar hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml \ -D 'mapred.child.java.opts=-Xmx500m' \ --hbase-indexer-file indexer.xml \ --zk-host 127.0.0.1/solr \ --collection collection1 \ --go-live \ --log4j src/test/resources/log4j.properties # (Re)index a table with direct writes to SolrCloud hadoop --config /etc/hadoop/conf \ jar hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml \ -D 'mapred.child.java.opts=-Xmx500m' \ --hbase-indexer-file indexer.xml \ --zk-host 127.0.0.1/solr \ --collection collection1 \ --reducers 0 \ --log4j src/test/resources/log4j.properties # (Re)index a table based on a indexer config stored in ZK hadoop --config /etc/hadoop/conf \ jar hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml \ -D 'mapred.child.java.opts=-Xmx500m' \ --hbase-indexer-zk zk01 \ --hbase-indexer-name docindexer \ --go-live \ --log4j src/test/resources/log4j.properties # MapReduce on Yarn - Pass custom JVM arguments HADOOP_CLIENT_OPTS='-DmaxConnectionsPerHost=10000 -DmaxConnections=10000'; \ hadoop --config /etc/hadoop/conf \ jar hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml \ -D 'mapreduce.map.java.opts=-DmaxConnectionsPerHost=10000 -DmaxConnections=10000' \ -D 'mapreduce.reduce.java.opts=-DmaxConnectionsPerHost=10000 -DmaxConnections=10000' \ --hbase-indexer-zk zk01 \ --hbase-indexer-name docindexer \ --go-live \ --log4j src/test/resources/log4j.properties # MapReduce on MR1 - Pass custom JVM arguments HADOOP_CLIENT_OPTS='-DmaxConnectionsPerHost=10000 -DmaxConnections=10000'; \ hadoop --config /etc/hadoop/conf \ jar hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml \ -D 'mapreduce.child.java.opts=-DmaxConnectionsPerHost=10000 -DmaxConnections=10000' \ --hbase-indexer-zk zk01 \ --hbase-indexer-name docindexer \ --go-live \ --log4j src/test/resources/log4j.properties
Using --go-live with SSL or Kerberos
The go-live phase of the indexer jobs sends a MERGEINDEXES request from the indexer client (the node from which the MR job was submitted) to the live Solr servers. If the Solr server has SSL enabled, you need to ensure that the indexer client trusts the certificate presented by the Solr server(s), otherwise you get an SSLPeerUnverifiedException.
-
Specify the location of the trust store by setting the following HADOOP_OPTS variable before launching the indexer job:
HADOOP_OPTS="-Djavax.net.ssl.trustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks "
-
If the Solr servers have Kerberos authentication enabled, you need to ensure that the indexer client can authenticate via Kerberos to the Solr servers. For this, you need to create a Java Authentication and Authorization Service configuration (JAAS) file locally on the node where the indexing job is launched:
- If you are authenticating using kinit to obtain credentials, you can configure the client to use your credential cache by creating a jaas.conf file with the following contents:
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=false useTicketCache=true principal="<user>@EXAMPLE.COM"; };
Replace <user> with your username, and EXAMPLE.COM with your Kerberos realm. - If you want the client application to authenticate using a keytab, modify jaas-client.conf as follows:
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/path/to/user.keytab" storeKey=true useTicketCache=false principal="<user>@EXAMPLE.COM"; };
Replace /path/to/user.keytab with the keytab file you want to use and <user>@EXAMPLE.COM with the principal in the keytab. If you are using a service principal that includes the hostname, make sure that it is included in the jaas.conf file (for example, solr/solr01.example.com@EXAMPLE.COM).
- If you are authenticating using kinit to obtain credentials, you can configure the client to use your credential cache by creating a jaas.conf file with the following contents:
-
If you are using a ticket cache, you need to do a kinit to acquire a ticket for the configured principal before launching the indexer.
-
Specify the authentication configuration in the HADOOP_OPTS environment variable:
-
For package-based installations:
HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf -Djavax.net.ssl.trustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks" \ hadoop --config /etc/hadoop/conf \ jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" -Dmapreduce.reduce.java.opts="-Xmx512m" \ --hbase-indexer-file /home/systest/hbasetest/morphline-hbase-mapper.xml \ --zk-host 127.0.0.1/solr \ --collection hbase-collection1 \ --go-live --log4j src/test/resources/log4j.properties
-
For parcel-based installations:
HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf -Djavax.net.ssl.trustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks" \ hadoop --config /etc/hadoop/conf \ jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \ --conf /etc/hbase/conf/hbase-site.xml -Dmapreduce.map.java.opts="-Xmx512m" -Dmapreduce.reduce.java.opts="-Xmx512m" \ --hbase-indexer-file /home/systest/hbasetest/morphline-hbase-mapper.xml \ --zk-host 127.0.0.1/solr \ --collection hbase-collection1 \ --go-live --log4j src/test/resources/log4j.properties
-
Understanding --go-live and HDFS ACLs
When run with a reduce phase, as opposed to as a mapper-only job, the indexer creates an offline index on HDFS in the output directory specified by the --output-dir parameter. If the --go-live parameter is specified, Solr merges the resulting offline index into the live running service. Thus, the Solr service must have read access to the contents of the output directory in order to complete the --go-live step. If --overwrite-output-dir is specified, the indexer deletes and recreates any existing output directory; in an environment with restrictive permissions, such as one with an HDFS umask of 077, the Solr user may not be able to read the contents of the newly created directory. To address this issue, the indexer automatically applies the HDFS ACLs to enable Solr to read the output directory contents. These ACLs are only applied if HDFS ACLs are enabled on the HDFS NameNode. For more information, see HDFS Extended ACLs.
The indexer only makes ACL updates to the output directory and its contents. If the output directory's parent directories do not include the execute permission, the Solr service cannot access the output directory. Solr must have execute permissions from standard permissions or ACLs on the parent directories of the output directory.