druid中提供了各种的ingest task ,其中包括了compact和index task ,以下对两种task的应用场景以及优缺点进行了比较
(1)compact task
合并指定interval之间的所有segments .语句如下:
{
"type": "compact",
"id": ,
"dataSource": ,
"interval": ,
"dimensions" ,
"tuningConfig" ,
"context":
}
其主要作用是合并小的segments ,将指定的interval的segments进行合并,合并个数可以根据tuningConfig的targetPartitionSize进行配置。我们主用用于定期的按天的维度合并历史的segments ,以减少segments的个数和存储,提高查询性能。
compact task 执行时内部会转化成index task ,compact的dimensions配置经我测试,并不启作用,它会继承datasource的dimensionSpec和metricSpec,dimension和metric的设置不是很灵活 .另外rollup是否起作用决定于interval 期间的所有segments都是rolluped,且rollup的粒度无法更改,另外可通过segmentMetadata查询获取segments的元数据信息。
针对上述compact task的缺陷,可以采用index task
(2)index task
index task 是index hadoop task 任务的简化版,也主要用于处理历史数据,可以用来操作较少的数据集。其示例如下:
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
}
},
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
{
"type" : "doubleSum",
"name" : "added",
"fieldName" : "added"
},
{
"type" : "doubleSum",
"name" : "deleted",
"fieldName" : "deleted"
},
{
"type" : "doubleSum",
"name" : "delta",
"fieldName" : "delta"
}
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE",
"intervals" : [ "2013-08-31/2013-09-01" ]
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "examples/indexing/",
"filter" : "wikipedia_data.json"
}
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsInMemory" : 75000
}
}
}
其优点是:
1: 可以灵活的指定dimensionsSpec,可灵活的指定dimension ,去除多余dimension .
2: 可以灵活的指定metricSpec ,灵活的统计mertric.
3:重新进行预聚合,queryGranularity
4:设定segmentGranularity的周期。
5: 也可以根据targetPartitionSize设置segments大小,合并小的segments .
经过上述的比较,index task较compact task 具有较好的灵活性。建议采用index task .