Druid快速入门

硬件及软件要求:

  • Java 8 or higher
  • Linux, Mac OS X, or other Unix-like OS (Windows is not supported)
  • 8G of RAM
  • 2 vCPUs

1、下载解压缩软件包

curl -O http://static.druid.io/artifacts/releases/druid-0.12.3-bin.tar.gz
tar -xzf druid-0.12.3-bin.tar.gz
cd druid-0.12.3

压缩包包含如下目录:

  • LICENSE - license文件.
  • bin/ - 快速启动相关脚本.
  • conf/* - 为集群安装提供的配置模板.
  • conf-quickstart/* - 快速入门配置文件.
  • extensions/* - 所有Druid扩展文件.
  • hadoop-dependencies/* - Druid Hadoop依赖文件.
  • lib/* - 所有Druid核心软件包.
  • quickstart/* - quickstart相关文件目录.

2、下载教程示例

curl -O http://druid.io/docs/0.12.3/tutorials/tutorial-examples.tar.gz
tar zxvf tutorial-examples.tar.gz

3、启动zookeeper

curl http://mirror.bit.edu.cn/apache/zookeeper/stable/zookeeper-3.4.12.tar.gz -o zookeeper-3.4.12.tar.gz
tar -xzf zookeeper-3.4.12.tar.gz
cd zookeeper-3.4.12
cp conf/zoo_sample.cfg conf/zoo.cfg
./bin/zkServer.sh start

4、启动Druid服务

在druid-0.12.3目录下,执行如下命令

bin/init

init会做一些初始化工作,脚本内容如下:

#!/bin/bash -eu

gzip -c -d quickstart/wikiticker-2015-09-12-sampled.json.gz > "quickstart/wikiticker-2015-09-12-sampled.json"

LOG_DIR=var

mkdir log
mkdir -p $LOG_DIR/tmp;
mkdir -p $LOG_DIR/druid/indexing-logs;
mkdir -p $LOG_DIR/druid/segments;
mkdir -p $LOG_DIR/druid/segment-cache;
mkdir -p $LOG_DIR/druid/task;
mkdir -p $LOG_DIR/druid/hadoop-tmp;
mkdir -p $LOG_DIR/druid/pids;

在不同的终端窗口启动Druid进程,本教程在同一个操作系统运行所有Druid进程,在大型分布式生产集群环境中,部分Druid进程仍可以部署在一起。

java `cat examples/conf/druid/coordinator/jvm.config | xargs` -cp "examples/conf/druid/_common:examples/conf/druid/_common/hadoop-xml:examples/conf/druid/coordinator:lib/*" io.druid.cli.Main server coordinator
java `cat examples/conf/druid/overlord/jvm.config | xargs` -cp "examples/conf/druid/_common:examples/conf/druid/_common/hadoop-xml:examples/conf/druid/overlord:lib/*" io.druid.cli.Main server overlord
java `cat examples/conf/druid/historical/jvm.config | xargs` -cp "examples/conf/druid/_common:examples/conf/druid/_common/hadoop-xml:examples/conf/druid/historical:lib/*" io.druid.cli.Main server historical
java `cat examples/conf/druid/middleManager/jvm.config | xargs` -cp "examples/conf/druid/_common:examples/conf/druid/_common/hadoop-xml:examples/conf/druid/middleManager:lib/*" io.druid.cli.Main server middleManager
java `cat examples/conf/druid/broker/jvm.config | xargs` -cp "examples/conf/druid/_common:examples/conf/druid/_common/hadoop-xml:examples/conf/druid/broker:lib/*" io.druid.cli.Main server broker

jvm.config为java进程运行参数配置,cat coordinator/jvm.config输出如下:

-server
-Xms256m
-Xmx256m
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Djava.io.tmpdir=var/tmp
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Dderby.stream.error.file=var/druid/derby.log

以上命令在不同终端窗口运行,分别启动了coordinator、overlord、historical、middleManager、broker进程。

5、重置Druid

所有持久化状态,如集群元数据存储和服务的segments都会保存在druid-0.12.3/var目录下.
如果你想停止服务,CTRL-C退出运行中的java进程。假如希望停止服务后,以初始化状态启动服务,删除log和var目录,再跑一遍init脚本,然后关闭Zookeeper,删除Zookeeper的数据目录/tmp/zookeeper。

在druid-0.12.3 目录下:

rm -rf log
rm -rf var
bin/init

假如你学习了Loading stream data from Kafka教程,在关闭Zookeeper之前你需要先关闭Kafka,删除Kafka日志目录/tmp/kafka-logs

Ctrl-C 关闭Kafka broker,删除日志目录:

rm -rf /tmp/kafka-logs

现在关闭Zookeeper,清理状态,在zookeeper-3.4.12目录下:

./bin/zkServer.sh stop
rm -rf /tmp/zookeeper

清理了Druid和Zookeeper状态数据后,重启Zookeeper和Druid服务。

6、数据集

如下数据加载教程中,我们会使用到一份数据文件,Druid包根目录下的quickstart/wikiticker-2015-09-12-sampled.json.gz,文件内容包含了2015-09-12这一天Wikipedia页面编辑事件。页面编辑事件以json对象格式存储于text文件中。

数据包含了如下列:

  • added
  • channel
  • cityName
  • comment
  • countryIsoCode
  • countryName
  • deleted
  • delta
  • isAnonymous
  • isMinor
  • isNew
  • isRobot
  • isUnpatrolled
  • metroCode
  • namespace
  • page
  • regionIsoCode
  • regionName
  • user

如下为一条示例数据:

{
  "timestamp":"2015-09-12T20:03:45.018Z",
  "channel":"#en.wikipedia",
  "namespace":"Main"
  "page":"Spider-Man's powers and equipment",
  "user":"foobar",
  "comment":"/* Artificial web-shooters */",
  "cityName":"New York",
  "regionName":"New York",
  "regionIsoCode":"NY",
  "countryName":"United States",
  "countryIsoCode":"US",
  "isAnonymous":false,
  "isNew":false,
  "isMinor":false,
  "isRobot":false,
  "isUnpatrolled":false,
  "added":99,
  "delta":99,
  "deleted":0,
}

7、从文件中夹在数据

1) 准备数据、定义数据摄取任务

A data load is initiated by submitting an ingestion task spec to the Druid overlord. For this tutorial, we'll be loading the sample Wikipedia page edits data.
向Druid overlord提交一个数据摄取任务,即完成了数据的初始化,如下我们会加载Wikipedia页面编辑数据。
examples/wikipedia-index.json定义了一个数据摄入任务,该任务读取quickstart/wikiticker-2015-09-12-sampled.json.gz中数据:

{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : [
              "channel",
              "cityName",
              "comment",
              "countryIsoCode",
              "countryName",
              "isAnonymous",
              "isMinor",
              "isNew",
              "isRobot",
              "isUnpatrolled",
              "metroCode",
              "namespace",
              "page",
              "regionIsoCode",
              "regionName",
              "user",
              { "name": "added", "type": "long" },
              { "name": "deleted", "type": "long" },
              { "name": "delta", "type": "long" }
            ]
          },
          "timestampSpec": {
            "column": "time",
            "format": "iso"
          }
        }
      },
      "metricsSpec" : [],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-12/2015-09-13"],
        "rollup" : false
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "local",
        "baseDir" : "quickstart/",
        "filter" : "wikiticker-2015-09-12-sampled.json.gz"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 5000000,
      "maxRowsInMemory" : 25000,
      "forceExtendableShardSpecs" : true
    }
  }
}

如上定义,创建了一个名为"wikipedia"的数据源.

2)加载批量数据

在druid-0.12.3 目录下,通过POST方式,提交数据摄取任务:

curl -X 'POST' -H 'Content-Type:application/json' -d @examples/wikipedia-index.json http://localhost:8090/druid/indexer/v1/task

假如任务提交成功,控制台将会打印任务ID:

{"task":"index_wikipedia_2018-06-09T21:30:32.802Z"}

可至overlord控制台http://localhost:8090/console.html查看你已提交的数据摄取任务的状态,可以周期性地刷新控制台,当任务成功时,你可以看到任务的状态变为"SUCCESS"
当摄取任务结束,数据将会被historical节点加载,且在一两分钟内可被查询到,你可以通过coordinator控制台监控数据加载进度,当控制台http://localhost:8081/#/数据源“ wikipedia”带有一个蓝色圈圈时表明"fully available"

image

8、

你可能感兴趣的:(Druid快速入门)