The hello-samza project is a stand-alone project designed to help you run your first Samza job.
You'll need to check out and publish Samza, since it's not available in a Maven repository right now.
git clone http://git-wip-us.apache.org/repos/asf/incubator-samza.git
cd incubator-samza
./gradlew -PscalaVersion=2.8.1 clean publishToMavenLocal
Next, check out the hello-samza project.
git clone git://github.com/linkedin/hello-samza.git
This project contains everything you'll need to run your first Samza jobs.
bin/grid
This command will download, install, and start ZooKeeper, Kafka, and YARN. All package files will be put in a sub-directory called "deploy" inside hello-samza's root folder.
If you get a complaint that JAVA_HOME is not set, then you'll need to set it. This can be done on Mac OSX by running:
export JAVA_HOME=$(/usr/libexec/java_home)
Once the grid command completes, you can verify that YARN is up and running by going to http://localhost:8088. This is the YARN UI.
Before you can run a Samza job, you need to build a package for it. This package is what YARN uses to deploy your jobs on the grid.
mvn clean package
mkdir -p deploy/samza
tar -xvf ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz -C deploy/samza
deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
The job will consume a feed of real-time edits from Wikipedia, and produce them to a Kafka topic called "wikipedia-raw". Give the job a minute to startup, and then tail the Kafka topic:
deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wikipedia-raw
Pretty neat, right? Now, check out the YARN UI again (http://localhost:8088). This time around, you'll see your Samza job is running!
Let's calculate some statistics based on the messages in the wikipedia-raw topic. Start two more jobs:
deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-parser.properties
deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-stats.properties
The first job (wikipedia-parser) parses the messages in wikipedia-raw, and extracts information about the size of the edit, who made the change, etc. You can take a look at its output with:
deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wikipedia-edits
The last job (wikipedia-stats) reads messages from the wikipedia-edits topic, and calculates counts, every ten seconds, for all edits that were made during that window. It outputs these counts to the wikipedia-stats topic.
deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wikipedia-stats
The messages in the stats topic look like this:
{"is-talk":2,"bytes-added":5276,"edits":13,"unique-titles":13}
{"is-bot-edit":1,"is-talk":3,"bytes-added":4211,"edits":30,"unique-titles":30,"is-unpatrolled":1,"is-new":2,"is-minor":7}
{"bytes-added":3180,"edits":19,"unique-titles":19,"is-unpatrolled":1,"is-new":1,"is-minor":3}
{"bytes-added":2218,"edits":18,"unique-titles":18,"is-unpatrolled":2,"is-new":2,"is-minor":3}
If you check the YARN UI, again, you'll see that all three jobs are now listed.
After you're done, you can clean everything up using the same grid script.
bin/grid stop yarn
bin/grid stop kafka
bin/grid stop zookeeper
Congratulations! You've now setup a local grid that includes YARN, Kafka, and ZooKeeper, and run a Samza job on it. Next up, check out the Background and API Overview pages.