目录
1. 背景
2. 总体思路
3. sql
(1) 增量删除的数据
(2) 增量增加的数据
4. elasticsearch 设计
hive表中有大量的业务数据,数据量比加大几千万上亿的量级,业务数据每天会有一部分数据发生变化。如果是每天全量更新到elasticsearch,会造成es集群节点的jvm压力巨大,影响es集群的可用性。所以,需要增量更新数据,降低es集群压力。
首先需要,定义一个主键,当内容变化时对应的主键也发生变化,当数据没有发生变化时,主键保持不变。(主键可以通过md5生产唯一主键)。通过sql的中的加减法(left、right join)来找出每天需要增量删除和增量增加的数据。然后同步到es时,在额外增加一个字段(当前实现增加的是 is_valid 字段,见下面sql, is_valid=0表示无效数据,is_valid=1表示有效数据),该字段是一个标志位,标识同步到es的数据是是否可以业务使用。
#!/bin/sh
version_now=$(date -d"-2 day" +%Y-%m-%d)
version_pre=$(date -d"-3 day" +%Y-%m-%d)
hive -e "DROP TABLE IF EXISTS app.tmp_xz_jimi3_sku_description_delete"
hive -e "
CREATE TABLE app.tmp_xz_jimi3_sku_description_delete AS
SELECT
main_id,
item_sku,
item_main_sku,
a.bot_id,
vender_id,
category3,
category3_id,
entity_type,
entity_value,
entity_source,
brand_code,
brand_en,
is_valid,
version
FROM
(
SELECT
tmp_pre.main_id,
item_sku,
item_main_sku,
bot_id,
vender_id,
category3,
category3_id,
entity_type,
entity_value,
entity_source,
brand_code,
brand_en,
is_valid,
'${version_now}' AS version
FROM
(
SELECT
main_id,
item_sku,
item_main_sku,
bot_id,
vender_id,
category3,
category3_id,
entity_type,
entity_value,
entity_source,
brand_code,
brand_en,
0 AS is_valid
FROM
app.sku_description_corpus
WHERE
dt = '${version_pre}'
)
tmp_pre
LEFT JOIN
(
SELECT main_id FROM app.sku_description_corpus WHERE dt = '${version_now}'
)
tmp_now
ON
tmp_now.main_id = tmp_pre.main_id
WHERE
tmp_now.main_id IS NULL
)
a
"
#!/bin/sh
version_now=$(date -d"-2 day" +%Y-%m-%d)
version_pre=$(date -d"-3 day" +%Y-%m-%d)
hive -e "DROP TABLE IF EXISTS app.tmp_xz_jimi3_sku_description_add"
hive -e "
CREATE TABLE app.tmp_xz_jimi3_sku_description_add AS
SELECT
tmp_now.main_id,
item_sku,
item_main_sku,
bot_id,
vender_id,
category3,
category3_id,
entity_type,
entity_value,
entity_source,
brand_code,
brand_en,
is_valid,
'${version_now}' as version
FROM
(
SELECT
main_id,
item_sku,
item_main_sku,
bot_id,
vender_id,
category3,
category3_id,
entity_type,
entity_value,
entity_source,
brand_code,
brand_en ,
1 AS is_valid
FROM
app.sku_description_corpus
WHERE
dt = '${version_now}'
)
tmp_now
LEFT JOIN
(
SELECT main_id FROM app.sku_description_corpus WHERE dt = '${version_pre}'
)
tmp_pre
ON
tmp_now.main_id = tmp_pre.main_id
WHERE
tmp_pre.main_id IS NULL
"
es的索引设计如下,在把hive表的数据插入到es时,hive表中的main_id 就是es的 _id 这样保证同一个main_id在es只要一条数据。当需要删除数据,t 相比较t-1 时需要删除的数据 main_id是一样的,这个时候同步到es就可以把t-1时的数据覆盖。
{
"settings": {
"index": {
"number_of_shards": "8",
"number_of_replicas": "1"
}
},
"mappings": {
"xz_jimi3_sku_describe_info": {
"dynamic": "false",
"_all": {
"enabled": false
},
"properties": {
"id": {
"type": "keyword"
},
"itemSku": {
"type": "keyword"
},
"itemMainSku": {
"type": "keyword"
},
"botId": {
"type": "keyword"
},
"venderId": {
"type": "keyword"
},
"cate3Id": {
"type": "keyword"
},
"cate3Name": {
"type": "keyword"
},
"entityType": {
"type": "keyword"
},
"entityValue": {
"type": "text",
"analyzer": "whitespace"
},
"entitySource": {
"type": "keyword"
},
"brandEn": {
"type": "keyword"
},
"brandId": {
"type": "keyword"
},
"valid": {
"type": "keyword"
},
"dt": {
"type": "keyword"
}
}
}
}
}