druid compact task 和index task 任务比较

    druid中提供了各种的ingest task ,其中包括了compact和index task ,以下对两种task的应用场景以及优缺点进行了比较

   (1)compact task

       合并指定interval之间的所有segments .语句如下:    

{
    "type": "compact",
    "id": ,
    "dataSource": ,
    "interval": ,
    "dimensions" ,
    "tuningConfig" ,
    "context": 
}

    其主要作用是合并小的segments ,将指定的interval的segments进行合并,合并个数可以根据tuningConfig的targetPartitionSize进行配置。我们主用用于定期的按天的维度合并历史的segments ,以减少segments的个数和存储,提高查询性能。

   compact task 执行时内部会转化成index task ,compact的dimensions配置经我测试,并不启作用,它会继承datasource的dimensionSpec和metricSpec,dimension和metric的设置不是很灵活 .另外rollup是否起作用决定于interval 期间的所有segments都是rolluped,且rollup的粒度无法更改,另外可通过segmentMetadata查询获取segments的元数据信息。

   针对上述compact task的缺陷,可以采用index task 

 (2)index task 

     index task 是index hadoop task 任务的简化版,也主要用于处理历史数据,可以用来操作较少的数据集。其示例如下:   

{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "column" : "timestamp",
            "format" : "auto"
          },
          "dimensionsSpec" : {
            "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
            "dimensionExclusions" : [],
            "spatialDimensions" : []
          }
        }
      },
      "metricsSpec" : [
        {
          "type" : "count",
          "name" : "count"
        },
        {
          "type" : "doubleSum",
          "name" : "added",
          "fieldName" : "added"
        },
        {
          "type" : "doubleSum",
          "name" : "deleted",
          "fieldName" : "deleted"
        },
        {
          "type" : "doubleSum",
          "name" : "delta",
          "fieldName" : "delta"
        }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "NONE",
        "intervals" : [ "2013-08-31/2013-09-01" ]
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "local",
        "baseDir" : "examples/indexing/",
        "filter" : "wikipedia_data.json"
       }
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 5000000,
      "maxRowsInMemory" : 75000
    }
  }
}

  其优点是:

  1: 可以灵活的指定dimensionsSpec,可灵活的指定dimension ,去除多余dimension .

  2: 可以灵活的指定metricSpec ,灵活的统计mertric.

  3:重新进行预聚合,queryGranularity

 4:设定segmentGranularity的周期。

 5:  也可以根据targetPartitionSize设置segments大小,合并小的segments .

经过上述的比较,index task较compact task 具有较好的灵活性。建议采用index task .

你可能感兴趣的:(druid,druid,ingest,task)