ElasticSearch初探

      最近在做数据漏斗的需求,就是统计ADX调用到我们DSP系统,我们DSP系统内部,对流量的过滤情况,做一个可视化界面,供PM查看。因为存储采用的ES,之前没有在项目中使用过,这两天看官方文档恶补了下,说实话,英文的官方文档真的是最好的学习资料,写语句时遇到了问题,去论坛发帖,一个外国友人回复了,因为我在提问时没有把代码格式化,以至于可读性非常差,后来改了,可读性增强,这样对别人也是一种尊重,别人浏览到这个帖子后,也能很清晰地看懂。很喜欢这种感觉,以后以官方文档为资料,不过英语真的是硬伤,需要多学习。外国友人给的建议

需求:统计参与竞价的PV数,条件是开始时间和结束时间,adxid,adlist不能为空。指标是对sidcount求和。其中对我来说难点在adlist不能为空这个条件。

下面我们看一下索引的mapping

{
  "lm_dsp_data_search_funnel-2018.08.10": {
    "mappings": {
      "doc": {
        "properties": {
          "@timestamp": {
            "type": "date"
          },
          "@version": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "abtest": {
            "type": "keyword"
          },
          "adlist": {
            "type": "nested",
            "properties": {
              "promotionid": {
                "type": "integer"
              },
              "sn": {
                "type": "integer"
              }
            }
          },
          "adxid": {
            "type": "keyword"
          },
          "age": {
            "type": "integer"
          },
          "deviceidmd5": {
            "type": "keyword"
          },
          "displocal1": {
            "type": "integer"
          },
          "displocal2": {
            "type": "integer"
          },
          "escount": {
            "type": "long"
          },
          "gender": {
            "type": "integer"
          },
          "interest": {
            "type": "integer"
          },
          "mediaid": {
            "type": "integer"
          },
          "os": {
            "type": "integer"
          },
          "sid": {
            "type": "keyword"
          },
          "sidcount": {
            "type": "integer"
          },
          "slotid": {
            "type": "integer"
          },
          "stime": {
            "type": "long"
          },
          "styleids": {
            "type": "integer"
          },
          "tags": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "userip": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

可以发现 adlist 采用了 nested 的方式

查询全部时,返回数据的一个结构,如下图所示:

 

 

 

 

 

 

 

ElasticSearch初探_第1张图片

最开始想使用es的script实现,写了下面的语句:

GET lm_dsp_data_search_funnel-2018.08.10/_search
{
  "size": 10, 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "adxid": "u-2cbc34ryk3v43nkdq9g"
          }
        },
        {
          "range": {
            "stime": {
              "gte": 1533859200000,
              "lte": 1533873600000
            }
          }
        }
      ],
      "filter": {
        "script": {
        "script": "doc['adlist'].values.length > 0"
          }
      }
    }
  }, 
  "aggs": {
    "sn_count": {
      "sum": {
        "field": "sidcount"
      }
    }
  }
}

上面的语句报下面的错误:

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
          "doc['adlist'].values.length > 0",
          "    ^---- HERE"
        ],
        "script": "doc['adlist'].values.length > 0",
        "lang": "painless"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "lm_dsp_data_search_funnel-2018.08.10",
        "node": "eMpHZINKRxeM1YsKNDcjng",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "script_stack": [
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
            "doc['adlist'].values.length > 0",
            "    ^---- HERE"
          ],
          "script": "doc['adlist'].values.length > 0",
          "lang": "painless",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "No field found for [adlist] in mapping with types []"
          }
        }
      }
    ]
  },
  "status": 500
}

从网上查资料,也没有解决,写脚本这个技能后面需要再学习一下。后来转换了想法,用别的方式实现。

方法一:

GET lm_dsp_data_search_funnel-2018.08.10/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "adxid": {
              "value": "u-2cbc34ryk3v43nkdq9g"
            }
          }
        },
        {
          "range": {
            "stime": {
              "gte": 1533859200000,
              "lte": 1533898800000
            }
          }
        },
        {
          "nested" : {
            "path" : "adlist",
            "score_mode" : "none",
            "query" : {
              "exists":{"field":"adlist"}
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "count": {
      "sum": {
        "field": "sidcount"
      }
    }
  }
}

上面的会过滤掉{"adlist":[]}这样的数据,当拿不准的时候,看官方文档:

 

 

ElasticSearch初探_第2张图片

方式二:

GET lm_dsp_data_search_funnel-2018.08.10/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "adxid": {
              "value": "u-2cbc34ryk3v43nkdq9g"
            }
          }
        },
        {
          "range": {
            "stime": {
              "from": 1533891139904,
              "to": null,
              "include_lower": false,
              "include_upper": false,
              "boost": 1
            }
          }
        },
        {
          "nested" : {
            "path" : "adlist",
            "score_mode" : "none",
            "query" : {
                "bool" : {
                    "must" : [
                      { "range" : {"adlist.promotionid" : {"gte" : "0"}} }
                    ]
                }
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "count": {
      "sum": {
        "field": "sidcount"
      }
    }
  }
}

上面的两种方法都能实现需求:

{
  "took": 42,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 397105,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "count": {
      "value": 397105
    }
  }
}

这个需求的时间花在了梳理需求,日志埋点,把日志收集到es中,统计数据,还有就是熟悉es的查询语句,今天是周五,下周一是开始的最后一天,需要快速把语句融入到代码中,这只是其中一个20分之一的需求,任务重大,不过当梳理清楚需求后,知道怎么做,下面的工作就会轻松些,很喜欢这种在工作中学习的感觉。

你可能感兴趣的:(框架类)