hello samza不容易

为什么说不那么容易说hello呢,因为在整个过程中,你不仅要等待将近一个小时下载yarn、kafka、zookeeper,还且你还会遇到2个让你无法顺利执行的状况。借助原文,我会进行说明。

Hello Samza

The hello-samza project is a stand-alone project designed to help you run your first Samza job.

Get the Code

You'll need to check out and publish Samza, since it's not available in a Maven repository right now.

git clone http://git-wip-us.apache.org/repos/asf/incubator-samza.git
cd incubator-samza
./gradlew -PscalaVersion=2.8.1 clean publishToMavenLocal

Next, check out the hello-samza project.

git clone git://github.com/linkedin/hello-samza.git

This project contains everything you'll need to run your first Samza jobs.

Start a Grid

 
A Samza grid usually comprises three different systems:  YARN,  Kafka, and  ZooKeeper. The hello-samza project comes with a script called "grid" to help you setup these systems. Start by running:
 
此处有颗雷。执行此命令之前,需要将grid中18行和20行的下载地址改成可用地址。
DOWNLOAD_KAFKA=https://dist.apache.org/repos/dist/release/kafka/0.8.0/kafka_2.8.0-0.8.0.tar.gz
DOWNLOAD_ZOOKEEPER=http://apache.mirrors.pair.com/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz
修改完毕,即可执行下面的命令了。
 
bin/grid

This command will download, install, and start ZooKeeper, Kafka, and YARN. All package files will be put in a sub-directory called "deploy" inside hello-samza's root folder.

If you get a complaint that JAVA_HOME is not set, then you'll need to set it. This can be done on Mac OSX by running:

export JAVA_HOME=$(/usr/libexec/java_home)

Once the grid command completes, you can verify that YARN is up and running by going to http://localhost:8088. This is the YARN UI.

Build a Samza Job Package

Before you can run a Samza job, you need to build a package for it. This package is what YARN uses to deploy your jobs on the grid.

mvn clean package
mkdir -p deploy/samza
tar -xvf ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz -C deploy/samza

Run a Samza Job

 
After you've built your Samza package, you can start a job on the grid using the run-job.sh script.
 
执行此命令之前,需要将 $PWD/deploy/samza/config/wikipedia-feed.properties中35行的6667修改为6665,原因是6667端口可能无法连接,这样你永远看不到kafka推送的数据。修改完毕,即可高枕无忧地执行后面的脚本了:)
 
deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties

The job will consume a feed of real-time edits from Wikipedia, and produce them to a Kafka topic called "wikipedia-raw". Give the job a minute to startup, and then tail the Kafka topic:

deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wikipedia-raw

Pretty neat, right? Now, check out the YARN UI again (http://localhost:8088). This time around, you'll see your Samza job is running!

Generate Wikipedia Statistics

Let's calculate some statistics based on the messages in the wikipedia-raw topic. Start two more jobs:

deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-parser.properties
deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-stats.properties

The first job (wikipedia-parser) parses the messages in wikipedia-raw, and extracts information about the size of the edit, who made the change, etc. You can take a look at its output with:

deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wikipedia-edits

The last job (wikipedia-stats) reads messages from the wikipedia-edits topic, and calculates counts, every ten seconds, for all edits that were made during that window. It outputs these counts to the wikipedia-stats topic.

deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wikipedia-stats

The messages in the stats topic look like this:

{"is-talk":2,"bytes-added":5276,"edits":13,"unique-titles":13}
{"is-bot-edit":1,"is-talk":3,"bytes-added":4211,"edits":30,"unique-titles":30,"is-unpatrolled":1,"is-new":2,"is-minor":7}
{"bytes-added":3180,"edits":19,"unique-titles":19,"is-unpatrolled":1,"is-new":1,"is-minor":3}
{"bytes-added":2218,"edits":18,"unique-titles":18,"is-unpatrolled":2,"is-new":2,"is-minor":3}

If you check the YARN UI, again, you'll see that all three jobs are now listed.

Shutdown

After you're done, you can clean everything up using the same grid script.

bin/grid stop yarn
bin/grid stop kafka
bin/grid stop zookeeper

Congratulations! You've now setup a local grid that includes YARN, Kafka, and ZooKeeper, and run a Samza job on it. Next up, check out the Background and API Overview pages.

你可能感兴趣的:(hello samza不容易)