补充:
黑线——
所以配置HBase服务之前要保证hdfs和Zookeeper服务是ok的。
2.HBase配置
1)hbase-env.sh
export JAVA_HOME=……
export HBASE_MANAGES_ZK=false #禁用自带的Zookeeper
bin/hbase-daemon.sh start master
bin/hbase-daemon.sh start regionserver
连接hbase shell:
bin/hbase shell
建表
create tableName,familyName
插入数据
put ‘table1’,‘nicole’,‘info:username’,‘ynh’
浏览表
scan
删除表
drop表前需要先disable(禁用)表
添加conf/backup-masters
启动hbase
查看端口:60010
备注:
1.解压安装包
2.server.properties配置
3.producer.properties配置
步骤:
1.启动Zookeeper
bin/zkServer.sh start
2.启动kafka:
bin/kafka-server-start.sh config/server.properties
3.创建topic:
bin/kafka-topics.sh --create --zookeeper bigdata-pro01.ynh.com:2181,bigdata-pro02.ynh.com:2181,bigdata-pro03.ynh.com:2181 --replication-factor 1 --partition 1 --topic protest
4.查看创建的话题:(在zookeeper中查看,因为broker的topic是由Zookeeper进行管理)
启动Zookeeper客户端:
bin/zkCli.sh
在/brokers/topics中查看
5.生产者生产一个消息:
bin/kafka-console-producer.sh --broker-list bigdata-pro01.ynh.com:9092,bigdata-pro02.ynh.com:9092,bigdata-pro03.ynh.com:9092 --topic protest
6.启动一个consumer:
bin/kafka-console-consumer.sh --zookeeper bigdata-pro01.ynh.com:2181,bigdata-pro02.ynh.com:2181,bigdata-pro03.ynh.com:2181 --from-beginning --topic protest
实际结果测试:
机器2作为producer发布消息:
同时在机器2(clone session)上创建一个consumer实时获取消息
备注:
flume-env.sh:
flume-conf.properties:
flume-env.sh:
flume-conf.properties:
数据来源:搜狗实验室
说明:具体请参考Flume官网与其他框架集成时的source、channel、sink的配置介绍
Flume数据合并节点(机器1)的配置:
1.flume-env.sh:
2.flume-conf.properties:
1.将下载的日志文件中的tab替换为逗号
cat weblogs.log|tr “\t” “,” > weblog2.log
cat weblog2.log|tr " " “,” > weblog3.log
2.然后将数据分发到2,3机器
创建自定义的类(参考自带的AsyncHbaseEventSerializer类):
package org.apache.flume.sink.hbase;
import com.google.common.base.Charsets;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.SimpleHbaseEventSerializer.KeyType;
import org.hbase.async.AtomicIncrementRequest;
import org.hbase.async.PutRequest;
import java.util.ArrayList;
import java.util.List;
public class MyAsyncHbaseEventSerializer implements AsyncHbaseEventSerializer {
private byte[] table;
private byte[] cf;
private byte[] payload;//列的value
private byte[] payloadColumn;//列名
private byte[] incrementColumn;
private String rowPrefix;
private byte[] incrementRow;
private KeyType keyType;
@Override
public void initialize(byte[] table, byte[] cf) {
this.table = table;//表名
this.cf = cf;//列名
}
@Override
public List<PutRequest> getActions() {//flume每get一次event(一行数据),就调用一次getActions
List<PutRequest> actions = new ArrayList<PutRequest>();
if (payloadColumn != null) {
byte[] rowKey;
try {
String[] columns=new String(this.payloadColumn).split(",");
//每一行的数据
String[] values=new String(this.payload).split(",");
//自定义rowKey
String datetime=String.valueOf(values[0]);//valueOf可以将null转为String中的空字符串
String userid=String.valueOf(values[1]);
rowKey=SimpleRowKeyGenerator.getMyRowKey(userid,datetime);
for(int i=0;i<columns.length;i++){//将此行按列值存储
if(columns.length != values.length) break;
byte[] colColumn=columns[i].getBytes();
byte[] colValue=values[i].getBytes("UTF8");
PutRequest putRequest = new PutRequest(table, rowKey, cf,
colColumn, colValue);
actions.add(putRequest);
}
} catch (Exception e) {
throw new FlumeException("Could not get row key!", e);
}
}
return actions;
}
public List<AtomicIncrementRequest> getIncrements() {
List<AtomicIncrementRequest> actions = new ArrayList<AtomicIncrementRequest>();
if (incrementColumn != null) {
AtomicIncrementRequest inc = new AtomicIncrementRequest(table,
incrementRow, cf, incrementColumn);
actions.add(inc);
}
return actions;
}
@Override
public void cleanUp() {
// TODO Auto-generated method stub
}
@Override
public void configure(Context context) {
String pCol = context.getString("payloadColumn", "pCol");
String iCol = context.getString("incrementColumn", "iCol");
rowPrefix = context.getString("rowPrefix", "default");
String suffix = context.getString("suffix", "uuid");
if (pCol != null && !pCol.isEmpty()) {
if (suffix.equals("timestamp")) {
keyType = KeyType.TS;
} else if (suffix.equals("random")) {
keyType = KeyType.RANDOM;
} else if (suffix.equals("nano")) {
keyType = KeyType.TSNANO;
} else if(suffix.equals("uuid")){
keyType = KeyType.UUID;
}else{
keyType = KeyType.MYKEY;
}
payloadColumn = pCol.getBytes(Charsets.UTF_8);
}
if (iCol != null && !iCol.isEmpty()) {
incrementColumn = iCol.getBytes(Charsets.UTF_8);
}
incrementRow = context.getString("incrementRow", "incRow").getBytes(Charsets.UTF_8);
}
@Override
public void setEvent(Event event) {
this.payload = event.getBody();
}
@Override
public void configure(ComponentConfiguration conf) {
// TODO Auto-generated method stub
}
}
在SimpleRowKeyGenerator类中添加方法:
public static byte[] getMyRowKey(String userid, String datetime) throws UnsupportedEncodingException {
return (userid +"-"+ datetime +"-"+ String.valueOf(System.currentTimeMillis())).getBytes("UTF8");
}
补充:用maven编译Apache flume-ng 1.7.0问题:
1.
flume-ng-sinks
org.apache.flume
1.7.0
Fail!无法导入!
解决:在setting中引入阿里云镜像
2.import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.conf.ComponentConfiguration;
无法导入
解决:从官网的flume下的lib包里本地导入。
参考前文“Flume数据合并节点(机器1)的配置”小节中的flume-conf.properties的完整配置
备注:
package main.java;
import java.io.*;
/**
* Created by niccoleynh on 2019/2/1.
*/
public class LogReaderWriter {
static String readFileName;
static String writeFileName;
public static void main(String args[]){
readFileName = args[0];
writeFileName = args[1];
try {
// readInput();
readFileByLines(readFileName);
}catch(Exception e){
}
}
public static void readFileByLines(String fileName) {
FileInputStream fis = null;
InputStreamReader isr = null;
BufferedReader br = null;
String tempString = null;
try {
System.out.println("以行为单位读取文件内容,一次读一整行:");
fis = new FileInputStream(fileName);// FileInputStream
// 从文件系统中的某个文件中获取字节
isr = new InputStreamReader(fis,"GBK");
br = new BufferedReader(isr);
int count=0;
while ((tempString = br.readLine()) != null) {
count++;
// 显示行号
Thread.sleep(300);
String str = new String(tempString.getBytes("UTF8"),"GBK");
System.out.println("row:"+count+">>>>>>>>"+tempString);
method1(writeFileName,tempString);
//appendMethodA(writeFileName,tempString);
}
isr.close();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
} finally {
if (isr != null) {
try {
isr.close();
} catch (IOException e1) {
}
}
}
}
public static void method1(String file, String conent) {
BufferedWriter out = null;
try {
out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(file, true)));
out.write("\n");
out.write(conent);
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
chmod u+x XXX.jar
本地配置输入文件与输出文件目录
编写读取日志执行脚本:weblog-shell.sh
#!/bin/bash
echo "start log......"
java -jar /opt/jars/LogReaderWriter.jar /opt/datas/weblog.log /opt/datas/weblog-flume.log
1.每个flume节点如下所示编写脚本
flume-ynh-start.sh
#!/bin/bash
echo "flume-2 start......"
bin/flume-ng agent --conf conf -f conf/flume-conf.properties -n agent2 -Dflume.root.logger=INFO,console
用于测试flume-producer的数据是否接收到
#!/bin/bash
echo "consumer test……"
bin/kafka-console-consumer.sh --zookeeper bigdata-pro01.ynh.com:2181,bigdata-pro02.ynh.com:2181,bigdata-pro03.ynh.com:2181 --from-beginning --topic protest
说明:案例过程中:机器配置不够好时,可以不用使用HA
步骤:
1.启动zookeeper
2.启动Journalnode
3.启动hdfs服务
4.在各个namenode节点上启动DFSZK Failover Controller,先在哪台机器启动,那个机器的namenode就是active namenode
备注:
如果出现错误,请参考前面章节“HDFS-HA服务启动及自动故障转移测试”同步namenode元数据、初始化ZK
步骤:
1.启动HBase集群
/bin/start-hbase.sh
2.查看web页面60010启动情况
3.检查shell启动情况:
bin/hbase shell
备注:
步骤:
1.kafka启动:
bin/kafka-server-start.sh config/server.properties
2.整理topic
进入zkCli
bin/zkCli.sh
ls /brokers/topics
删除多余的topic:
rmr /brokers/topics/XXX
1.创建业务topic
bin/kafka-topics.sh --create --zookeeper bigdata-pro01.ynh.com:2181,bigdata-pro02.ynh.com:2181,bigdata-pro03.ynh.com:2181 --replication-factor 2 --partitions 1 --topic weblogs
2.进入zkcli.sh 查看
ls /brokers/topics
1.进入HBase shell
2.创建业务数据表
create ‘weblogs’,info’
注意: (参考官网安装说明注意事项)flume-1.7.0要对应kafka0.9X及以上。
1.创建kafka存放日志路径
mkdir kafka-logs
2.启动三台kafka
3.启动flume日志收集节点服务
4.启动flume日志合并分发节点服务
5. 启动weblog-shell.sh
6.启动一个consumer消费者,查看是否有数据
7.查看hbase是否拿到数据
进入hbase shell
count ‘weblogs’
scan ‘weblogs’
步骤:
1.登入机器1的虚拟机
1)连接到外网:修改自动获取IP
2)切换到root用户
3)清除yum缓存(包括下载的软件包和header)
yum clean all
4)下载安装mysql服务
yum install mysql-server
5)查看mysql启动状态
service mysqld status
6)启动mysql服务
sudo service mysqld start
mysql -uroot -p123456
mysql命令:
show databases;
use test;
show tables;
一、什么是Hive?
重点
步骤:
官网下载版本0.13.1
机器3上解压(机器1,2做了HA,为了减缓1,2的压力)
配置hive-env.sh:
启动hdfs服务
在hdfs上创建hive的目录
bin/hdfs dfs -mkdir -p /user/hive/warehouse
bin/hdfs dfs -chmod g+w /user/hive/warehouse
bin/hive
show databases;
use databaseName;
show tables;
在hive/conf下新建hive-site.xml文件。我们将hive元数据的在mysql上的存储配置在这个文件(类似于JDBC连接配置)
设置用户的链接:
进入mysql
查询用户信息:
select host,user,password from user;
更新用户信息:
update user set host=’%’ where user=‘root’ and host='localhost’
刷新信息
flush privileges;
拷贝驱动包mysql-connector-java-5.1.27-bin.jar到hive的lib目录下:
保证机器3到其他机器能够无秘钥登录(参考前面章节设置机器间的无秘钥登录)
步骤:
1.启动hdfs和yarn
2.启动hive
bin/hive
3.通过hive创建表
create table test(id int,name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;
4.创建数据文件test.txt
5.加载数据文件到hive表中
load data local inpath ‘opt/datas/test.txt’ into table test;
备注:
步骤:
1.在hive-site文件中配置Zookeeper,hive通过这参数的配置去连接HBase
2.将hbase的这9个包copy到hive/lib下,如果是hbase和hive都是下载的CDH同版本的,就不需要copy,因为CDH本身已经做了集成。
ln -s $HBASE_HOME/lib/hbase-server-0.98.6-cdh5.3.0.jar $HIVE_HOME/hbase-server-0.98.6-cdh5.3.0.jar
ln -s $HBASE_HOME/lib/hbase-client-0.98.6-cdh5.3.0.jar $HIVE_HOME/hbase-client-0.98.6-cdh5.3.0.jar
ln -s $HBASE_HOME/lib/hbase-protocol-0.98.6-cdh5.3.0.jar $HIVE_HOME/hbase-protocol-0.98.6-cdh5.3.0.jar
ln -s $HBASE_HOME/lib/hbase-it-0.98.6-cdh5.3.0.jar $HIVE_HOME/hbase-it-0.98.6-cdh5.3.0.jar
ln -s $HBASE_HOME/lib/htrace-core-2.04.jar $HIVE_HOME/htrace-core-2.04.jar
ln -s $HBASE_HOME/lib/hbase-hadoop2-compat-0.98.6-cdh5.3.0.jar $HIVE_HOME/hbase-hadoop2-compat-0.98.6-cdh5.3.0.jar
ln -s $HBASE_HOME/lib/hbase-hadoop-compat-0.98.6-cdh5.3.0.jar $HIVE_HOME/hbase-hadoop-compat-0.98.6-cdh5.3.0.jar
ln -s $HBASE_HOME/lib/high-scale-lib-1.1.1.jar $HIVE_HOME/high-scale-lib-1.1.1.jar
ln -s $HBASE_HOME/lib/hbase-common-0.98.6-cdh5.3.0.jar $HIVE_HOME/hbase-common-0.98.6-cdh5.3.0.jar
3.创建Hive的外部表,与HBase集成
CREATE EXTERNAL TABLE weblogs (
id STRING,
datetime STRING,
userid STRING,
searchname STRING,
retorder STRING,
cliorder STRING,
cliurl STRING
)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,info:datetime,info:userid,info:searchname,info:retorder,info:cliorder,info:cliurl”)
TBLPROPERTIES(“hbase.table.name”=“weblogs”);
结果展示:
1)启动hdfs集群服务
2)启动yarn集群服务
3)启动Zookeeper集群服务
4)启动HBase集群服务
5)启动hive服务
补充:
说明:
此系列文章为网课学习时所记录的笔记,希望给同为小白的学习者贡献一点帮助吧,如有理解错误之处,还请大佬指出。学习不就是不断纠错不断成长的过程嘛~