Elasticsearch数据导入Hive说明文档

本文为一次Elasticsearch数据导入Hive的案例说明文档,读者可参考文中操作调整自己的操作方式:

以测试部es主机192.xxx.x.128为例,导入索引数据到本地Hive

一、准备:

可先查看es服务器index列表,对目标数量和大小心中有数(此步可省)
curl -X GET ‘http://192.xxx.x.128:9200/_cat/indices?v‘

启动Hvie的shell界面,启动时指定预先设置的Elasticsearch-hive插件(启动方法不唯一)
hive -hiveconf hive.aux.jars.path=file:///opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/hive/lib/elasticsearch-hadoop-hive-6.5.3.jar

二、原理:

elasticsearch-hadoop包提供了hive表连接映射es的功能,建立es的hive映射表后,向hive表中添加数据即为向es“表”中添加数据,从hive表中读取数据即为从es“表”中读取数据,因两种数据库结构和数据类型方面的差异,难点在于将所有字段都建立正确的映射关系。下文对表中涉及的5类特殊es的字段的处理作出解释:

解释:

(1)_id字段,同_index,_score,_type都为es的元数据字段,映射时必须开启元数据读取:
‘es.read.metadata’ = ‘true’
并以指定格式注明映射关系:
‘es.mapping.names’ =’id:_metadata._id,index:_metadata._index,score:_metadata._score,type:_metadata._type’
(2)prospector.type字段属于es的Object类型,因其在es底层存储格式为“对象.属性”格式,所以也可在创建hive映射外表声明字段类型时去掉“.”:
prospectortype string,
并声明映射关系:
‘es.mapping.names’ =’prospectortype:prospector.type’
而Object类型也可正确对应hive的struct类型,在hive中声明字段类型为struct而不需要映射亦可(但个人认为特殊的数据类型不方便后续处理):
prospector STRUCT
(3)clientId字段,字段名中含有大写字母I,大写字母在匹配es字段时会自动转为小写,导致匹配不到字段值为空,需要声明映射关系:
‘es.mapping.names’ =’clientId:clientId’
(4)syslog_message字段含有“_”,在hive中不能解析,去掉_同时需要声明映射关系:
‘es.mapping.names’ =’syslogmessage:syslog_message’
(5)tags字段为数字类型,对应hive的数组类型即可,即建表时声明字段类型:
tags array,
该类型因不能被impala解析,导入hive后需再进行一次转String类型的处理,详见下面建表语句。

三、建表语句:

1、cslogdata表

创建映射外表:

CREATE EXTERNAL TABLE test (
timestamp Timestamp,
dataId string,
dataTag string,
dataTitle string,
hostname string,
mobile bigint,
module string,
uid string,
unick string,
url string,
version string,
id string,
index string,
score string,
type string,
beathostname string,
beatname string,
beatversion string,
browser string,
browserVersion string,
clientId string,
fieldslogformat string,
inputtype string,
ip string,
logData string,
logType string,
offset bigint,
os string,
platform string,
prospectortype string,
referer string,
sessionid string,
source string,
time bigint,
userAgent string,
userData string,
way string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes' = 'http://192.xxx.x.128:9200',
'es.mapping.names' = 'dataId:dataId,dataTag:dataTag,dataTitle:dataTitle,browserVersion:browserVersion,clientId:clientId,logData:logData,logType:logType,userAgent:userAgent,userData:userData,timestamp:@timestamp,version:@version,id:_metadata._id,index:_metadata._index,score:_metadata._score,type:_metadata._type,beathostname:beat.hostname,beatname:beat.name,beatversion:beat.version,hostname:host.name,inputtype:input.type,prospectortype:prospector.type,fieldslogformat:fields.log_format',
'es.resource' = 'logstash-20*.*/doc',
'es.read.metadata' = 'true'
);
注:‘es.resource’ = ‘logstash-20*.*/doc’为es的index/type

检查数据字段是否为空:

select * from test limit 5;

创建hive表存入数据

create table cslogdata as select
timestamp ,
dataId ,
dataTag ,
dataTitle ,
hostname ,
mobile ,
module ,
uid ,
unick ,
url ,
version ,
id ,
index ,
score ,
type ,
beathostname ,
beatname ,
beatversion ,
browser ,
browserVersion ,
clientId ,
fieldslogformat ,
inputtype ,
ip ,
logData ,
logType ,
offset ,
os ,
platform ,
prospectortype ,
referer ,
sessionid ,
source ,
time ,
userAgent ,
userData ,
way
from test;

检查:

select * from test limit 5;

删除:

drop table test;

2、cssyslogdata表

CREATE EXTERNAL TABLE test (
syslogtimestamp string,
timestamp timestamp,
version string,
id string,
index string,
score string,
type string,
beathostname string,
beatname string,
beatversion string,
fieldslogformat string,
hostname string,
inputtype string,
message string,
offset bigint,
prospectortype string,
source string,
sysloghostname string,
syslogmessage string,
syslogpid string,
syslogprogram string,
tags array
)
STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler’
TBLPROPERTIES(
‘es.nodes’ = ‘http://192.168.3.128:9200‘,
‘es.mapping.names’ = ‘syslogtimestamp:syslog_timestamp,timestamp:@timestamp,version:@version,id:_metadata._id,index:_metadata._index,score:_metadata._score,type:_metadata._type,beathostname:beat.hostname,beatname:beat.name,beatversion:beat.version,hostname:host.name,inputtype:input.type,fieldslogformat:fields.log_format,prospectortype:prospector.type,sysloghostname:syslog_hostname,syslogmessage:syslog_message,syslogpid:syslog_pid,syslogprogram:syslog_program’,
‘es.resource’ = ‘arlog-syslog-2018.*/doc’,
‘es.read.metadata’ = ‘true’
);
select * from test limit 5;

create table cssyslogdata as select
syslogtimestamp ,
timestamp ,
version ,
id ,
index ,
score ,
type ,
beathostname ,
beatname ,
beatversion ,
fieldslogformat ,
hostname ,
inputtype ,
message ,
offset ,
prospectortype ,
source ,
sysloghostname ,
syslogmessage ,
syslogpid ,
syslogprogram ,
tags
from test;
select * from cssyslogdata limit 5;

后知道impala不支持array类型数据。所以用以下语句将array格式转为string格式为表cssyslogdata2(语法:concat_wc(‘,’tags) tags):

create table cssyslogdata2 as select
syslogtimestamp ,
timestamp ,
version ,
id ,
index ,
score ,
type ,
beathostname ,
beatname ,
beatversion ,
fieldslogformat ,
hostname ,
inputtype ,
message ,
offset ,
prospectortype ,
source ,
sysloghostname ,
syslogmessage ,
syslogpid ,
syslogprogram ,
concat_ws(‘,’,tags)tags
from cssyslogdata;

3、csmonitordata表

CREATE EXTERNAL TABLE test (
hostname string,
httpurl string,
monitorip string,
monitorname string,
monitorstatus string,
tags array,
timestamp timestamp,
version string,
id string,
index string,
score string,
type string,
beathostname string,
beatname string,
beatversion string,
httpresponsestatus bigint,
httprttcontentus bigint,
httprttresponseheaderus bigint,
httprtttotalus bigint,
httprttvalidateus bigint,
httprttwriterequestus bigint,
monitordurationus bigint,
monitorhost string,
monitorid string,
monitorscheme string,
monitortype string,
resolvehost string,
resolveip string,
resolverttus bigint,
tcpport bigint,
tcprttconnectus bigint,
tlsrtthandshakeus bigint,
type2 string
)
STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler’
TBLPROPERTIES(
‘es.nodes’ = ‘http://192.168.3.128:9200‘,
‘es.mapping.names’ = ‘hostname:host.name,httpurl:http.url,monitorip:monitor.ip,monitorname:monitor.name,monitorstatus:monitor.status,tags:tags,timestamp:@timestamp,version:@version,id:_metadata._id,index:_metadata._index,score:_metadata._score,type:_metadata._type,beathostname:beat.hostname,beatname:beat.name,beatversion:beat.version,httpresponsestatus:http.response.status,httprttcontentus:http.rtt.content.us,httprttresponseheaderus:http.rtt.response_header.us,httprtttotalus:http.rtt.total.us,httprttvalidateus:http.rtt.validate.us,httprttwriterequestus:http.rtt.write_request.us,monitordurationus:monitor.duration.us,monitorhost:monitor.host,monitorid:monitor.id,monitorscheme:monitor.scheme,monitortype:monitor.type,resolvehost:resolve.host,resolveip:resolve.ip,resolverttus:resolve.rtt.us,tcpport:tcp.port,tcprttconnectus:tcp.rtt.connect.us,tlsrtthandshakeus:tls.rtt.handshake.us,type2:type’,
‘es.resource’ = ‘arlog-monitor-2018.*/doc’,
‘es.read.metadata’ = ‘true’
);
select * from test limit 5;

create table csmonitordata as select
hostname,
httpurl,
monitorip,
monitorname,
monitorstatus,
timestamp,
version,
id,
index,
score,
type,
beathostname,
beatname,
beatversion,
httpresponsestatus,
httprttcontentus,
httprttresponseheaderus,
httprtttotalus,
httprttvalidateus,
httprttwriterequestus,
monitordurationus,
monitorhost,
monitorid,
monitorscheme,
monitortype,
resolvehost,
resolveip,
resolverttus,
tcpport,
tcprttconnectus,
tlsrtthandshakeus,
type2
from test;
select * from csmonitordata limit 5;

 

你可能感兴趣的:(Elasticsearch数据导入Hive说明文档)