解决Azkaban 添加插件错误解决

转载自:https://secdiary.com/uncategorized/linkedin-azkaban-workflow-management-on-hdp2/

Okay guys, fire up your HDP2 sandbox for a different way of managing your Apache Pig jobs in Hadoop. I’ve been looking at designing my jobs through Oozie in Hue, and kind of figured it wasn’t worth the while if someone else had done it better. It was freakingly annoying not being able to export workflows from Oozie in Hue (and you can import stuff very easily).

This time it’s an interesting company which relies heavily on Hadoop coming to the rescue: LinkedIn, and their Azkaban project.

This post is written with a basis in Azkaban 2.1 – and meant to get you quickly started. Refer to the main pages of the projects at GitHub IO for more up to date information. I’ll try to link you to the relevant parts of their guide along the way.

Basic Setup

Azkaban is bases it’s operation on a MySQL-server. Quick setup:

yum install mysql-server
CREATE DATABASE azkaban;
CREATE USER 'azkaban'@'localhost' IDENTIFIED BY 'hadoop';
GRANT SELECT,INSERT,UPDATE,DELETE ON azkaban.* to 'azkaban'@'localhost' WITH GRANT OPTION;

Also, add max_allowed_packet=1024MB to the mysqld-section of /etc/my.cnf. You can now restart the MySQL service and let the config changes take effect.

service mysqld restart

Simply setup the database and populate it:

cd /usr/local/src
wget https://s3.amazonaws.com/azkaban2/azkaban2/2.1/azkaban-sql-script-2.1.tar.gz
tar -xvzf azkaban-sql-script-2.1.tar.gz
cd azkaban-2.1/
mysql -D azkaban < create-all-sql-2.1.sql

Now that we have Azkabans back-end database up and running, we can setup the web interface part. If extracted in the same folder as the SQL scripts it’ll place itself neatly inside. Don’t exactly know why they decided to separate these really.

cd /usr/local/src
wget https://s3.amazonaws.com/azkaban2/azkaban2/2.1/azkaban-web-server-2.1.tar.gz
tar -xvzf azkaban-web-server-2.1.tar.gz

First generate a keystone for the SSL connectors.

cd azkaban-2.1
keytool -keystore keystore -alias jetty -genkey -keyalg RSA

Also, you need the ODBC connector. On the current sandbox it will be something like this:

wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.27.zip
unzip mysql-connector-java-5.1.27.zip
cp mysql-connector-java-5.1.27/mysql-connector-java-5.1.27-bin.jar extlib/

You can now configure conf/azkaban.properties accordingly (set the username, passwords and database). Also, the users are located here: conf/azkaban-users.xml.

When initiating the web server the first time I experience an error requesting a temporary storage area set by the $tmpdir. Curious I did a little research and found a Google Groups discussion on the subject, stating:

If you use azkaban in large scale systems, it is important that tmpdir is managed/configured carefully, because a lot of programs use that space. Defaulting it to /tmp could cause operation troubles.

So for our setup we’ll just use /tmp, but keep it in mind. Anyway, bin/azkaban-web-start.sh ended up looking like this:

HADOOP_HOME=/etc/hadoop/conf/
azkaban_dir=$(readlink -f $0)/..
base_dir=$azkaban_dir
tmpdir=/tmp

You can now start the web server by running bin/azkaban-web-start.sh, and go to https://:8443. Default username and password can be found in conf/azkaban-users.xml (azkaban/azkaban).

You’ve now set up the front-end of the job scheduler. You now need to setup what LinkedIn has called the executor backend. It’s really just what makes the things you configure in the front-end execute in the back-end. It’s kind of straight forward as well, same procedure:

cd /usr/local/src
wget https://s3.amazonaws.com/azkaban2/azkaban2/2.1/azkaban-executor-server-2.1.tar.gz
tar -xvzf azkaban-executor-server-2.1.tar.gz

Once again I reconfigure the start script to read:

HADOOP_HOME=/etc/hadoop/conf/
azkaban_dir=$(dirname $(readlink -f $0))/..
base_dir=$azkaban_dir
tmpdir=/tmp

Before you start it you will want a to at least explore some of the plugins, we’ll drop them the same way:

echo "azkaban.jobtype.plugin.dir=plugins/jobtypes" >> conf/azkaban.properties
mkdir azkaban-2.1/plugins/viewer
cd /usr/local/src
wget https://s3.amazonaws.com/azkaban2/azkaban-plugins/azkaban-jobtype-2.1.tar.gz
tar -xvzf azkaban-jobtype-2.1.tar.gz
mv azkaban-jobtype-2.1 azkaban-2.1/plugins/jobtypes

That’s the setup. You can start the executor script now: bin/azkaban-executor-start.sh. That might give you an error, but verify that you are up and running: ps -ef | grep AzkabanExec. In addition you should remove the job type plugins you don’t use (on my part that is Hive for instance).

I also got this error:

Plugin jobtypes failed to load. azkaban.jobtype.JobTypeManagerException: Failed to get jobtype propertiesCould not find variable substitution for variable 'hadoop.home' in key 'jobtype.classpath'.

For that I had to modify the jobtype.classpath= in e.g. plugins/jobtypes/pig-0.11.0/private.properties, and replace the hadoop.home variable like this:

jobtype.classpath=${hadoop.home}/conf,${hadoop.home}/lib/*,lib/*

To:

jobtype.classpath=/etc/hadoop/conf,/etc/hadoop/lib/*,lib/*

Also removed all the pig versions that I didn’t use. That made it tick.

I later noticed that there is a file here plugins/jobtypes/common.properties where you can set all these properties.

 

The Sweet Part

You are now setup and ready to go. Let’s try setting up a Pig job. This is one thing I really like about Azkaban by the way, the way you create jobs offline. Very simple and straight forward.

A little boring, but we’ll just use the default word count example. There are four files in the archive: The job configuration, the pig word count jar, the data used for pig and a README.

What you really need to focus on here is the wordcountpig.job.

type=pig
pig.script=src/wordcountpig.pig
user.to.proxy=azkaban
HDFSRoot=/tmp
param.inDataLocal=res/rpfarewell
param.inData=${HDFSRoot}/wordcountpigin
param.outData=${HDFSRoot}/wordcountpigout

When you are done inspecting the zip archive compress it again (zip -r pig-wc.zip pig-wc/). You are now ready to upload and execute it. Also, upload the file in the res-directory to /tmp/wordcountpigin (hdfs dfs -put /tmp/rp /tmp/wordcountpigin)

Create a test project

Upload the inspected archive

 

Hover over the workflow to execute or review it

If you enter the workflow and execute it you should get something like this:

Successfully executed workflow

I had  a lot of problems with dependencies not resolving (such as Log4j which wasn’t in the class path) so I chose to use the pig with Hadoop jar. I also used pig 0.12 which worked well by just overwriting the 0.11 job type.

./bin/azkaban-executor-shutdown.sh
cd plugins/jobtypes/pig-0.11.0/
mv pig-0.11.0-withouthadoop.jar pig-0.11.0-withouthadoop.jar.original
wget ftp://apache.uib.no/pub/apache/pig/pig-0.12.0/pig-0.12.0.tar.gz
tar -xvzf pig-0.12.0.tar.gz
cp pig-0.12.0/pig-0.12.0.jar pig-0.11.0-withouthadoop.jar
rm -r pig-0.12.0

That should take you to this:

Job ran successfully, or else you’ll just see annoying red dots instead

So – that’s to get you started. You will probably like to have a look at scheduling and parameter substitution next.

Discussion and Conclusions

Azkaban is an excellent tool, balancing a graphical user interface and technical requirements a much better way than Oozie. Despite that I noticed when setting up that the installation and setup process could be much more streamlined – and the plugin system isn’t very consistent from a beginners view at least. When I figured it, Azkaban worked very well though, and connected you almost directly into the terminal.

The strongest benefit of Azkaban I think is the offline job edit and upload. I hope you enjoy it!

你可能感兴趣的:(azkaban)