环境:
CDH5.16.2(hive1.1.0)、ES6.7.2
1、 关于拼音分词
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=0,endOffset=10,lastStartOffset=9 for field 'enterprise_name.pinyin' (这里报的是拼音分词)
在网上搜了很多 ,还是有很靠谱的文章:
https://github.com/medcl/elasticsearch-analysis-pinyin/pull/206/commits/7cbc3d8926c8549b1049b90e90fce415097990be
根据里面的修改了拼音分词的源码,重新使用maven编译打包,将elasticsearch-analysis-pinyin-6.3.0.jar改为elasticsearch-analysis-pinyin-6.7.2.jar,然后将拼音分词的zip包打开,将这个新打包的jar替换进去,重新在线上把旧的拼音分词remove掉再install新的zip,重启,ok。
2、关于IK分词
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=6,endOffset=7,lastStartOffset=7 for field 'enterprise_name'(这里报的是这个字段,这个字段我用的ik_max_word分词)
将IK分词词库 extra_new_word.dic 里的词先全部清空(移到其他地方),然后正常导入,导入数据后再把分词词库移回去就ok了。
在hive中数据导入ES,需要一个包:add jar /root/work/elasticsearch-hadoop-6.7.2.jar;
mapping片段:
PUT /enterprise_credit_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"index": {
"analysis": {
"analyzer": {
"pinyin_analyzer": {
"tokenizer": "my_pinyin"
}
},
"tokenizer": {
"my_pinyin": {
"type" : "pinyin",
"keep_first_letter":true,
"keep_separate_first_letter" : true,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 20,
"lowercase" : true
}
}
}
}
},
"mappings": {
"enterprise_credit_type": {
"properties": {
"enterprise_name": {
"type": "text",
"index": true,
"analyzer": "ik_max_word",
"fields": {
"pinyin": {
"type": "text",
"store": false,
"term_vector": "with_offsets",
"analyzer": "pinyin_analyzer",
"boost": 10
}
}
},
"operators": {
"type": "keyword",
"index": true,
"fields": {
"pinyin": {
"type": "text",
"store": false,
"analyzer": "pinyin_analyzer",
"boost": 10
}
}
},
"registered_money":{
"type": "keyword",
"ignore_above": 256
},
。。。。。。
在hive中建表:
create external table APP_json_result_external(
enterprise_name string ,
operators string ,
。。。。。
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = '/enterprise_credit_index/enterprise_credit_type',
'es.mapping.id' = 'unified_social_credit_code',
'es.nodes'='10.10.10.10:9200,10.10.10.10:9201,10.10.10.10:9202',
'es.nodes.wan.only'='true',
'es.index.auto.create' = 'false',
'es.write.operation' = 'upsert',
'es.batch.write.refresh'='true',
'es.index.read.missing.as.empty'='false');
将数据导入:
insert OVERWRITE table APP_json_result_external XXX,XXX,XXX from tableName;
this all, have fun !