前言:
因为项目需要,试着搭建了一下HBase二级索引的环境,网上看了一些教程,无一不坑,索性整理一份比较完整的。本文适当的精简和绕过了一些“老司机一看就知道”的内容,适合刚接触这一领域但是有一定Linux和Hadoop基础的读者,不适合完全初学者。
环境约束:
OS:CentOS6.7-x86_64
JDK:jdk1.7.0_109
hadoop-2.6.0+cdh5.4.1
hbase-solr-1.5+cdh5.4.1 (hbase-indexer-1.5-cdh5.4.1)
solr-4.10.3-cdh5.4.1
zookeeper-3.4.5-cdh5.4.1
hbase-1.0.0-cdh5.4.1
文中所用CDH软件下载页:
CDH 5.4.x Packaging and Tarball Information | 5.x | Cloudera Documentation
一、基本环境准备
1.一个3节点Hadoop集群,服务器计划角色分配如下:
先把Namenode、Datanode、zookeeper、Journalnode、ZKFC跑起来,具体技术自行突破,不是本文重点,无需多言。
2.下载好所需的CDH版本软件:
在文首的链接页面下载好tarball,需要注意的是HBase-solr的tarball是整个项目文件,但是我们用到的只是它的部署文件,解压缩hbase-solr-1.5+cdh5.4.1的tarball,在 hbase-solr-1.5-cdh5.4.1\hbase-indexer-dist\target 下找到hbase-indexer-1.5-cdh5.4.1.tar.gz,后面会用到。
二、部署hbase-indexer
将hbase-indexer-1.5-cdh5.4.1.tar.gz拷贝到node2或者node3上
解压缩hbase-indexer-1.5-cdh5.4.1.tar.gz:
tar zxvf hbase-indexer-1.5-cdh5.4.1.tar.gz
修改hbase-indexer的参数:
vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-site.xml
hbaseindexer.zookeeper.connectstring
node1:2181,node2:2181,node3:2181
hbase.zookeeper.quorum
node1,node2,node3
配置hbase-indexer-env.sh:
vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-env.sh
修改JAVA_HOME
# Set environment variables here.
# This script sets variables multiple times over the course of starting an hbase-indexer process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase-indexer, etc.)
# The java implementation to use. Java 1.6 required.
export JAVA_HOME=/usr/java/jdk1.7.0/
#根据实际环境修改
# Extra Java CLASSPATH elements. Optional.
# export HBASE_INDEXER_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
# export HBASE_INDEXER_HEAPSIZE=1000
# Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -XX:+UseConcMarkSweepGC"
使用scp命令把整个hbase-indexer-1.5-cdh5.4.1复制到node3上
三、部署HBase
解压缩hbase的tarball
tar zxvf hbase-1.0.0-cdh5.4.1.tar.gz
同样要修改hbase-site.xml
vim hbase-1.0.0-cdh5.4.1/conf/hbase-site.xml
需要在
hbase.rootdir
hdfs://node1:9000/hbase
The directory shared by RegionServers
hbase.master
node1:60000
hbase.cluster.distributed
true
The mode the cluster will be in.Possible values are
false: standalone and pseudo-distributed setups with managed Zookeeper
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
hbase.replication
true
SEP is basically replication, so enable it
replication.source.ratio
1.0
Source ratio of 100% makes sure that each SEP consumer is actually used (otherwise, some can sit idle, especially with small clusters)
replication.source.nb.capacity
1000
Maximum number of hlog entries to replicate in one go. If this is large, and a consumer takes a while to process the events, the HBase rpc call will time out.
replication.replicationsource.implementation
com.ngdata.sep.impl.SepReplicationSource
A custom replication source that fixes a few things and adds some functionality (doesn't interfere with normal replication usage).
hbase.zookeeper.quorum
node1,node2,node3
The directory shared by RegionServers
hbase.zookeeper.property.dataDir
/home/HBasetest/zookeeperdata
Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
类似的,修改hbase-env.sh
vim hbase-1.0.0-cdh5.4.1/conf/hbase-env.sh
修改JAVA_HOME和HBASE_HOME
# Set environment variables here.
# This script sets variables multiple times over the course of starting an hbase process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase, etc.)
# The java implementation to use. Java 1.7+ required.
# export JAVA_HOME=/usr/java/jdk1.6.0/
export JAVA_HOME=/opt/jdk1.7.0_79
export HBASE_HOME=/home/HBasetest/hbase-1.0.0-cdh5.4.1
#根据实际填写
# Extra Java CLASSPATH elements. Optional.
# export HBASE_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
# export HBASE_HEAPSIZE=1000
# Uncomment below if you intend to use off heap cache.
# export HBASE_OFFHEAPSIZE=1000
# For example, to allocate 8G of offheap, to 8G:
# export HBASE_OFFHEAPSIZE=8G
# Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
将hbase-indexer-1.5-cdh5.4.1/lib目录下的这4个文件复制到hbase-1.0.0-cdh5.4.1/lib/目录下
hbase-sep-api-1.5-cdh5.4.1.jar
hbase-sep-impl-1.5-hbase1.0-cdh5.4.1.jar
hbase-sep-impl-common-1.5-cdh5.4.1.jar
hbase-sep-tools-1.5-cdh5.4.1.jar
修改hbase-1.0.0-cdh5.4.1/conf/regionservers为如下内容:
node2
node3
然后将目录hbase-1.0.0-cdh5.4.1复制到node2和node3上面
四、部署Solr
直接在node1上解压缩就好。。。
五、运行测试
1.运行HBase
在node1上执行:
./hbase-1.0.0-cdh5.4.1/bin/start-hbase.sh
2.运行HBase-indexer
分别在node2和node3上执行:
./hbase-indexer-1.5-cdh5.4.1/bin/hbase-indexer server
如果想以后台方式运行,可以使用screen或者nohup
3.运行Solr
分别在node1上进入solr下面的sample子目录,执行:
java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkHost=node1:2181,node3:2181,node4:2181/solr -jar start.jar
同样,如果想以后台方式运行,可以使用screen或者nohup
使用http://node1:8983/solr/#/访问solr的主页
六、数据索引测试
将Hadoop集群、HBase、HBase-Indexer、Solr都跑起来之后,首先用HBase创建一个数据表:
在任一node上的HBase安装目录下运行:
./bin/hbase shell
create 'indexdemo-user', { NAME => 'info', REPLICATION_SCOPE => '1' }
在部署了HBase-Indexer的节点上,进入HBase-Indexer部署目录,使用HBase-Indexer的demo下的配置文件创建一个索引:
./bin/hbase-indexer add-indexer -n myindexer -c .demo/user_indexer.xml -cp solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1
编辑hbase-indexer-1.5-cdh5.4.1/demo/下的字段定义文件:
保存为indexdemo-indexer.xml
添加indexer实例
在hbase-indexer-1.5-cdh5.4.1/demo下运行:
./bin/hbase-indexer add-indexer -n myindexer -c indexdemo-indexer.xml -cp \
solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1 -z node1,node2,node3
准备一些测试数据,因为项目需要对千万级以上的记录进行索引的测试,所以用命令行手敲的方式插入数据有点不大现实,HBase也支持使用shell命令批量执行以文本方式存储的命令集合,但在千万级别这个数量级的数据量面前还是很苍白,最后我还是选择了用Java编程的方式实现快速的批量插入记录。
Eclipse里面新建一个Java工程,导入HBase部署目录下lib内的所有内容。程序源代码如下:
package com.hbasetest.hbtest;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
public class DataInput {
private static Configuration configuration;
static {
configuration = HBaseConfiguration.create();
configuration.set("hbase.zookeeper.property.clientPort", "2181");
configuration.set("hbase.zookeeper.quorum", "node1,node2,node3");
}
public static void main(String[] args) {
try {
List putList = new ArrayList();
HTable table = new HTable(configuration, "indexdemo-user");
for (int i =0; i<=14000000 ;i++)
{
Put put = new Put(Integer.toString(i).getBytes());
put.add("info".getBytes(), "firstname".getBytes(), ("Java.value.firstname"+Integer.toString(i)).getBytes());
put.add("info".getBytes(), "lastname".getBytes(), ("Java.value.lastname"+Integer.toString(i)).getBytes());
putList.add(put);
System.out.println("put successfully! " + Integer.toString(i) );
} table.put(putList);
} catch (IOException e) {
e.printStackTrace();
}
}
}
这段代码使用了批量put的办法,如果运行这个程序的机器内存不够大,建议做问题分治,多搞几个putList。
剩下的检索测试就简单了,不再赘述。