先搭建几个节点:coordinator、historical、overlord、middleManager。并且启动服务。
前提:需要准备好mysql(http://my.oschina.net/u/2460844/blog/637334 该文中说明了mysql的配置)、hdfs集群、zookeeper(单机版就可以)
1. __common 配置:
druid.extensions.coordinates=["io.druid.extensions:druid-examples","io.druid.extensions:druid-kafka-eight","io.druid.extensions:mysql-metadata-storage","io.druid.extensions:druid-hdfs-storage"] druid.extensions.localRepository=extensions-repo druid.zk.service.host=druid01:2181 druid.metadata.storage.type=mysql druid.metadata.storage.connector.connectURI=jdbc:mysql://druid01:3306/druid druid.metadata.storage.connector.user=druid druid.metadata.storage.connector.password=diurd1234 druid.storage.type=hdfs druid.storage.storageDirectory=hdfs://vm1.cci/tmp/druid/localStorage druid.cache.type=local druid.cache.sizeInBytes=10000000 druid.selectors.indexing.serviceName=overlord druid.selectors.coordinator.serviceName=coordinator druid.emitter=logging
2. coordinator 配置:
druid.host=druid01 druid.port=8081 druid.service=coordinator druid.coordinator.startDelay=PT5M
3. historical 配置:
druid.host=druid02 druid.port=8082 druid.service=druid/historical druid.historical.cache.useCache=true druid.historical.cache.populateCache=true druid.processing.buffer.sizeBytes=100000000 druid.processing.numThreads=3 druid.server.http.numThreads=5 druid.server.maxSize=300000000000 druid.segmentCache.locations=[{"path": " /tmp/druid/indexCache", "maxSize": 300000000000}] druid.monitoring.monitors=["io.druid.server.metrics.HistoricalMetricsMonitor", "com.metamx.metrics.JvmMonitor"]
4. overlord 配置:
druid.host=druid03 druid.port=8090 druid.service=overlord druid.indexer.autoscale.doAutoscale=true druid.indexer.autoscale.strategy=ec2 druid.indexer.autoscale.workerIdleTimeout=PT90m druid.indexer.autoscale.terminatePeriod=PT5M druid.indexer.autoscale.workerVersion=0 druid.indexer.logs.type=local druid.indexer.logs.directory=/tmp/druid/indexlog druid.indexer.runner.type=remote druid.indexer.runner.minWorkerVersion=0 # Store all task state in the metadata storage druid.indexer.storage.type=metadata #druid.indexer.fork.property.druid.processing.numThreads=1 #druid.indexer.fork.property.druid.computation.buffer.size=100000000 druid.indexer.runner.type=remote
5. middleManager 配置:
druid.host=druid04 druid.port=8091 druid.service=druid/middlemanager druid.indexer.logs.type=local druid.indexer.logs.directory=/tmp/druid/indexlog druid.indexer.fork.property.druid.processing.numThreads=5 druid.indexer.fork.property.druid.computation.buffer.size=100000000 # Resources for peons druid.indexer.runner.javaOpts=-server -Xmx3g druid.indexer.task.baseTaskDir=/tmp/persistent/task/
6. 分别启动各个节点,如果出现了启动问题,很能是因为内存问题,可适当调整java运行参数。
7. 需要导入的数据 wikipedia_data.csv , wikipedia_data.json
---wikipedia_data.json:
{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143} {"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330} {"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111} {"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900} {"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "cancer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
---wikipedia_data.csv:
2013-08-31T01:02:33Z, Gypsy Danger, en, nuclear, true, true, false, false, article, North America, United States, Bay Area, San Francisco, 57, 200, -143 2013-08-31T01:02:33Z, Gypsy Danger, en, nuclear, true, true, false, false, article, North America, United States, Bay Area, San Francisc, 57, 200, -143 2013-08-31T01:02:33Z, Gypsy Danger, en, nuclear, true, true, false, false, article, North America, United States, Bay Area, San Francis, 57, 200, -143 2013-08-31T01:02:33Z, Gypsy Danger, en, nuclear, true, true, false, false, article, North America, United States, Bay Area, San Franci, 57, 200, -143 2013-08-31T01:02:33Z, Gypsy Danger, en, nuclear, true, true, false, false, article, North America, United States, Bay Area, San Franc, 57, 200, -143 2013-08-31T01:02:33Z, Gypsy Danger, en, nuclear, true, true, false, false, article, North America, United States, Bay Area, San Fran, 57, 200, -143 2013-08-31T01:02:33Z, Gypsy Danger, en, nuclear, true, true, false, false, article, North America, United States, Bay Area, San Fra, 57, 200, -143 2013-08-31T01:02:33Z, Gypsy Danger, en, nuclear, true, true, false, false, article, North America, United States, Bay Area, San Fr, 57, 200, -143 2013-08-31T01:02:33Z, Gypsy Danger, en, nuclear, true, true, false, false, article, North America, United States, Bay Area, San F, 57, 200, -143 2013-08-31T01:02:33Z, Gypsy Danger, en, nuclear, true, true, false, false, article, North America, United States, Bay Area, Sa , 57, 200, -143
8. 注意 这里导入的数据 如果保存在本机磁盘导入时,数据文件必须保存在middleManager节点上,不然提交task后无法找到文件。如果是从hdfs中导入,只需要先put到hdfs文件系统中。这里的overlord 节点是druid03(你可以换成ip)。
9. 在任意一个节点上(保证这个节点能够访问druid03)。创建一个json的index task任务:
--9.1 导入一个 本地local保存的、json格式的文件,这个task的json怎么来写
先将数据wikipedia_data.jso保存在middleManager节点的druid的文件夹下(比如/root/druid-0.8.3)。
命令为wikipedia_index_local_json_task.json 文件:
{ "type" : "index_hadoop", "spec" : { "dataSchema" : { "dataSource" : "wikipedia", "parser" : { "type" : "string", "parseSpec" : { "format" : "json", "timestampSpec" : { "column" : "timestamp", "format" : "auto" }, "dimensionsSpec" : { "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], "dimensionExclusions" : [], "spatialDimensions" : [] } } }, "metricsSpec" : [ { "type" : "count", "name" : "count" }, { "type" : "doubleSum", "name" : "added", "fieldName" : "added" }, { "type" : "doubleSum", "name" : "deleted", "fieldName" : "deleted" }, { "type" : "doubleSum", "name" : "delta", "fieldName" : "delta" } ], "granularitySpec" : { "type" : "uniform", "segmentGranularity" : "DAY", "queryGranularity" : "NONE", "intervals" : [ "2013-08-31/2013-09-01" ] } }, "ioConfig": { "type": "index", "firehose": { "type": "local", "baseDir": "./", "filter": "wikipedia_data.json" } }, "tuningConfig": { "type": "index", "targetPartitionSize": 0, "rowFlushBoundary": 0 } } }
9.2 提交任务,前面已经说过了overlord节点在druid03上,所以想druid03提交任务
curl -X 'POST' -H 'Content-Type:application/json' -d @wikipedia_index_local_json_task.json druid03:8090/druid/indexer/v1/task
在overlord节点的日志上可以看出任务的情况,当出现如下信息表示任务成功
2016-03-29T17:35:11,385 INFO [forking-task-runner-1] io.druid.indexing.overlord.ForkingTaskRunner - Logging task index_hadoop_NN_2016-03-29T17:35:11.510+08:00 output to: /tmp/persistent/task/index_hadoop_NN_2016-03-29T17:35:11.510+08:00/log 2016-03-29T17:42:15,263 INFO [forking-task-runner-1] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[0] for task: index_hadoop_NN_2016-03-29T17:35:11.510+08:00 2016-03-29T17:42:15,265 INFO [forking-task-runner-1] io.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task log to: /tmp/druid/indexlog/index_hadoop_NN_2016-03-29T17:35:11.510+08:00.log 2016-03-29T17:42:15,267 INFO [forking-task-runner-1] io.druid.indexing.overlord.ForkingTaskRunner - Removing task directory: /tmp/persistent/task/index_hadoop_NN_2016-03-29T17:35:11.510+08:00 2016-03-29T17:42:15,284 INFO [WorkerTaskMonitor-1] io.druid.indexing.worker.WorkerTaskMonitor - Job's finished. Completed [index_hadoop_NN_2016-03-29T17:35:11.510+08:00] with status [SUCCESS
9.3 本地导入csv格式数据的 task文件示例,wikipedia_data.csv 需要先保存在middleManager节点的druid目录下(比如/root/druid-0.8.3)。
{ "type": "index", "spec": { "dataSchema": { "dataSource": "wikipedia", "parser": { "type": "string", "parseSpec": { "format" : "csv", "timestampSpec" : { "column" : "timestamp" }, "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], "dimensionsSpec" : { "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] } } }, "metricsSpec": [ { "type": "count", "name": "count" }, { "type": "doubleSum", "name": "added", "fieldName": "added" }, { "type": "doubleSum", "name": "deleted", "fieldName": "deleted" }, { "type": "doubleSum", "name": "delta", "fieldName": "delta" } ], "granularitySpec": { "type": "uniform", "segmentGranularity": "DAY", "queryGranularity": "NONE", "intervals": ["2013-08-31/2013-09-01"] } }, "ioConfig": { "type": "index", "firehose": { "type": "local", "baseDir": "./", "filter": "wikipedia_data.csv" } }, "tuningConfig": { "type": "index", "targetPartitionSize": 0, "rowFlushBoundary": 0 } } }
9.4 导入hdfs中的json文件。先需要把wikipedia_data.json put到hdfs中,记住目录然后在task文件中给定路径,hdfs路径中要带有hdfs 的namenode的 名字或者ip。这里使用vm1.cci代替namenode的ip。注意对比与本地导入task文件的区别,这些区别决定你能否导入成功。
{ "type" : "index_hadoop", "spec" : { "dataSchema" : { "dataSource" : "wikipedia", "parser" : { "type" : "string", "parseSpec" : { "format" : "json", "timestampSpec" : { "column" : "timestamp", "format" : "auto" }, "dimensionsSpec" : { "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], "dimensionExclusions" : [], "spatialDimensions" : [] } } }, "metricsSpec" : [ { "type" : "count", "name" : "count" }, { "type" : "doubleSum", "name" : "added", "fieldName" : "added" }, { "type" : "doubleSum", "name" : "deleted", "fieldName" : "deleted" }, { "type" : "doubleSum", "name" : "delta", "fieldName" : "delta" } ], "granularitySpec" : { "type" : "uniform", "segmentGranularity" : "DAY", "queryGranularity" : "NONE", "intervals" : [ "2013-08-31/2013-09-01" ] } }, "ioConfig" : { "type" : "hadoop", "inputSpec" : { "type" : "static", "paths" : "hdfs://vm1.cci/tmp/druid/datasource/wikipedia_data.json" } }, "tuningConfig" : { "type": "hadoop" } } }
9.5 导入hdfs中的csv格式文件。task文件描述如下:
{ "type": "index", "spec": { "dataSchema": { "dataSource": "wikipedia", "parser": { "type": "string", "parseSpec": { "format" : "csv", "timestampSpec" : { "column" : "timestamp" }, "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], "dimensionsSpec" : { "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] } } }, "metricsSpec": [ { "type": "count", "name": "count" }, { "type": "doubleSum", "name": "added", "fieldName": "added" }, { "type": "doubleSum", "name": "deleted", "fieldName": "deleted" }, { "type": "doubleSum", "name": "delta", "fieldName": "delta" } ], "granularitySpec": { "type": "uniform", "segmentGranularity": "DAY", "queryGranularity": "NONE", "intervals": ["2013-08-31/2013-09-01"] } }, "ioConfig" : { "type" : "hadoop", "inputSpec" : { "type" : "static", "paths" : "hdfs://vm1.cci/tmp/druid/datasource/wikipedia_data.csv" } }, "tuningConfig" : { "type": "hadoop" } } }
总结: druid.io 可以配置的项超级多,任何一个地方配置疏忽都可能会导致task失败。这里给出四种示例,还是有必要细分其中的差别。初学者磕绊在此很难免。