hive数据导入ElasticSearch坑记录

环境:

CDH5.16.2(hive1.1.0)、ES6.7.2

1、 关于拼音分词

    org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=0,endOffset=10,lastStartOffset=9 for field 'enterprise_name.pinyin' (这里报的是拼音分词)

在网上搜了很多 ,还是有很靠谱的文章:

https://github.com/medcl/elasticsearch-analysis-pinyin/pull/206/commits/7cbc3d8926c8549b1049b90e90fce415097990be

根据里面的修改了拼音分词的源码,重新使用maven编译打包,将elasticsearch-analysis-pinyin-6.3.0.jar改为elasticsearch-analysis-pinyin-6.7.2.jar,然后将拼音分词的zip包打开,将这个新打包的jar替换进去,重新在线上把旧的拼音分词remove掉再install新的zip,重启,ok。

2、关于IK分词

org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=6,endOffset=7,lastStartOffset=7 for field 'enterprise_name'(这里报的是这个字段,这个字段我用的ik_max_word分词)

将IK分词词库 extra_new_word.dic 里的词先全部清空(移到其他地方),然后正常导入,导入数据后再把分词词库移回去就ok了。

在hive中数据导入ES,需要一个包:add jar /root/work/elasticsearch-hadoop-6.7.2.jar;

mapping片段:

PUT /enterprise_credit_index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "index": {
      "analysis": {
        "analyzer": {
          "pinyin_analyzer": {
            "tokenizer": "my_pinyin"
          }
        },
        "tokenizer": {
          "my_pinyin": {
            "type" : "pinyin",
            "keep_first_letter":true,
            "keep_separate_first_letter" : true,
            "keep_full_pinyin" : true,
            "keep_original" : true,
            "limit_first_letter_length" : 20,
            "lowercase" : true
          }
        }
      }
    }
  },
  "mappings": {
    "enterprise_credit_type": {
      "properties": {
        "enterprise_name": {
          "type": "text",
          "index": true,
          "analyzer": "ik_max_word",
          "fields": {
            "pinyin": {
              "type": "text",
              "store": false,
              "term_vector": "with_offsets",
              "analyzer": "pinyin_analyzer",
              "boost": 10
            }
          }
        },
        "operators": {
          "type": "keyword",
          "index": true,
          "fields": {
            "pinyin": {
              "type": "text",
              "store": false,
              "analyzer": "pinyin_analyzer",
              "boost": 10
            }
          }
        },
        "registered_money":{
          "type": "keyword",
          "ignore_above": 256
        },

。。。。。。

在hive中建表:

create external table APP_json_result_external(
enterprise_name         string           ,
operators               string   ,                       

。。。。。

)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = '/enterprise_credit_index/enterprise_credit_type',
'es.mapping.id' = 'unified_social_credit_code',
'es.nodes'='10.10.10.10:9200,10.10.10.10:9201,10.10.10.10:9202',
'es.nodes.wan.only'='true',
'es.index.auto.create' = 'false',
'es.write.operation' = 'upsert',
'es.batch.write.refresh'='true',
'es.index.read.missing.as.empty'='false');

将数据导入:

insert OVERWRITE table APP_json_result_external XXX,XXX,XXX from tableName;

this all, have fun !

 

你可能感兴趣的:(ES)