hive表增量更新到elasticsearch方案

目录

1. 背景

2. 总体思路

3. sql

       (1) 增量删除的数据

(2) 增量增加的数据

 

4. elasticsearch 设计

 


1. 背景

     hive表中有大量的业务数据,数据量比加大几千万上亿的量级,业务数据每天会有一部分数据发生变化。如果是每天全量更新到elasticsearch,会造成es集群节点的jvm压力巨大,影响es集群的可用性。所以,需要增量更新数据,降低es集群压力。

2. 总体思路

     首先需要,定义一个主键,当内容变化时对应的主键也发生变化,当数据没有发生变化时,主键保持不变。(主键可以通过md5生产唯一主键)。通过sql的中的加减法(left、right join)来找出每天需要增量删除和增量增加的数据。然后同步到es时,在额外增加一个字段(当前实现增加的是 is_valid 字段,见下面sql, is_valid=0表示无效数据,is_valid=1表示有效数据),该字段是一个标志位,标识同步到es的数据是是否可以业务使用。
 

3. sql

 (1) 增量删除的数据

#!/bin/sh
version_now=$(date -d"-2 day" +%Y-%m-%d)
version_pre=$(date -d"-3 day" +%Y-%m-%d)

hive -e "DROP TABLE IF EXISTS app.tmp_xz_jimi3_sku_description_delete"

hive -e "

CREATE TABLE app.tmp_xz_jimi3_sku_description_delete AS
SELECT
	main_id,
	item_sku,
	item_main_sku,
	a.bot_id,
	vender_id,
	category3,
	category3_id,
	entity_type,
	entity_value,
	entity_source,
	brand_code,
	brand_en,
	is_valid,
	version
FROM
	(
		SELECT
			tmp_pre.main_id,
			item_sku,
			item_main_sku,
			bot_id,
			vender_id,
			category3,
			category3_id,
			entity_type,
			entity_value,
			entity_source,
			brand_code,
			brand_en,
			is_valid,
			'${version_now}' AS version
		FROM
			(
				SELECT
					main_id,
					item_sku,
					item_main_sku,
					bot_id,
					vender_id,
					category3,
					category3_id,
					entity_type,
					entity_value,
					entity_source,
					brand_code,
					brand_en,
					0 AS is_valid
				FROM
					app.sku_description_corpus
				WHERE
					dt = '${version_pre}'
			)
			tmp_pre
		LEFT JOIN
			(
				SELECT main_id FROM app.sku_description_corpus WHERE dt = '${version_now}'
			)
			tmp_now
		ON
			tmp_now.main_id = tmp_pre.main_id
		WHERE
			tmp_now.main_id IS NULL
	)
	a

"

 

 (2) 增量增加的数据

#!/bin/sh
version_now=$(date -d"-2 day" +%Y-%m-%d)
version_pre=$(date -d"-3 day" +%Y-%m-%d)

hive -e "DROP TABLE IF EXISTS app.tmp_xz_jimi3_sku_description_add"
hive -e "
 
CREATE TABLE app.tmp_xz_jimi3_sku_description_add AS
SELECT
	tmp_now.main_id,
	item_sku,
	item_main_sku,
	bot_id,
	vender_id,
	category3,
	category3_id,
	entity_type,
	entity_value,
	entity_source,
	brand_code,
	brand_en,
	is_valid,
	'${version_now}' as version
FROM
	(
		SELECT
			main_id,
			item_sku,
			item_main_sku,
			bot_id,
			vender_id,
			category3,
			category3_id,
			entity_type,
			entity_value,
			entity_source,
			brand_code,
			brand_en ,
			1 AS is_valid
		FROM
			app.sku_description_corpus
		WHERE
			dt = '${version_now}'
	)
	tmp_now
LEFT JOIN
	(
		SELECT main_id FROM app.sku_description_corpus WHERE dt = '${version_pre}'
	)
	tmp_pre
ON
	tmp_now.main_id = tmp_pre.main_id
WHERE
	tmp_pre.main_id IS NULL
"

 

4. elasticsearch 设计

     es的索引设计如下,在把hive表的数据插入到es时,hive表中的main_id 就是es的 _id 这样保证同一个main_id在es只要一条数据。当需要删除数据,t 相比较t-1 时需要删除的数据 main_id是一样的,这个时候同步到es就可以把t-1时的数据覆盖。

{
  "settings": {
    "index": {
      "number_of_shards": "8",
      "number_of_replicas": "1"
    }
  },
  "mappings": {
    "xz_jimi3_sku_describe_info": {
      "dynamic": "false",
      "_all": {
        "enabled": false
      },
      "properties": {
        "id": {
          "type": "keyword"
        },
        "itemSku": {
          "type": "keyword"
        },
        "itemMainSku": {
          "type": "keyword"
        },
        "botId": {
          "type": "keyword"
        },
        "venderId": {
          "type": "keyword"
        },
        "cate3Id": {
          "type": "keyword"
        },
        "cate3Name": {
          "type": "keyword"
        },
        "entityType": {
          "type": "keyword"
        },
        "entityValue": {
          "type": "text",
          "analyzer": "whitespace"
        },
        "entitySource": {
          "type": "keyword"
        },
        "brandEn": {
          "type": "keyword"
        },
        "brandId": {
          "type": "keyword"
        },
        "valid": {
          "type": "keyword"
        },
        "dt": {
          "type": "keyword"
        }
      }
    }
  }
}

 

 

 

你可能感兴趣的:(Hive,shell)