标签(空格分隔): 大数据
大数据协作框架是一个桐城,就是Hadoop2生态系统中几个辅助的Hadoop2.x框架。主要如下:
1,数据转换工具Sqoop
2,文件搜集框架Flume
3,任务调度框架Oozie
4,大数据Web工具Hue
任务调度框架
1,Linux Crontab
2,Azkaban –https://azkaban.github.io/
3,Ozie –http://oozie.apache.org/ 功能强大 难度大
工作流调度
协作调度(定时,数据可用性)
binder(批量)
4,Zeus –https://github.com/michael8335/zeus2
Oozie概述
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.
Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
Oozie is a scalable, reliable and extensible system.
1,一个基于工作流引擎的开源框架,是由Cloudera公司贡献给Apache的,它能够提供对Hadoop Mapreduce和Pig Jobs的任务调度与协调。Oozie需要部署到Java Servlet容器中运行。
2,Oozie工作流定义,同Jboss jBPM提供的jPDL一样,提供了类似的流程定义语言hPDL,通过XML文件格式来实现流程的定义。对于工作流系统,一般会有很多不同功能的节点,比如分支,并发,汇合等等。
3,Oozie定义了控制流节点(Control Flow Nodes)和动作节点(Action Nodes),其中控制流节点定义了流程的开始和结束,以及控制流程的执行路径(Execution Path),如decision,fork,join等;而动作节点包括Haoop map-reduce hadoop文件系统,Pig,SSH,HTTP,eMail和Oozie子流程
tar -zxf oozie-4.0.0-cdh5.3.6.tar.gz -C /opt/app
sudo adduser oozie
sudo passwd oozie
<property>
<name>hadoop.proxyuser.hadoop001.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop001.groups</name>
<value>*</value>
</property>
Hadoop集群安装在A账户下,OOZIE安装在某节点的B账户下,该账户属于C用户组。那么代理设置表示如下含义:A账户在该节点拥有代替C用户组提交任务的权限。Hadoop集群安装在A账户下,OOZIE安装在某节点的B账户下,该账户属于C用户组。那么代理设置表示如下含义:A账户在该节点拥有代替C用户组提交任务的权限。
tar -zxf oozie-hadooplibs-4.0.0-cdh5.3.6.tar.gz -C ../
mkdir libext
cp -r ./* /opt/app/oozie-4.0.0-cdh5.3.6/libext/
cp ext-2.2.zip /opt/app/oozie-4.0.0-cdh5.3.6/libext
$ bin/oozie-setup.sh prepare-war
$ bin/ooziedb.sh create -sqlfile oozie.sql -run
$ bin/oozie-setup.sh sharelib create \
-fs hdfs://xingyunfei001.com.cn \
-locallib oozie-sharelib-4.0.0-cdh5.3.6-yarn.tar.gz
Start Oozie as a daemon process run:
$ bin/oozied.sh start
查看
http://xingyunfei001.com.cn:11000/oozie/
三,运行测试example
tar -zxf oozie-examples.tar.gz
[hadoop001@xingyunfei001 hadoop_2.5.0_cdh]$ bin/hdfs dfs -put /opt/app/oozie-4.0.0-cdh5.3.6/examples
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
nameNode=hdfs://xingyunfei001.com.cn:8020
jobTracker=xingyunfei001.com.cn:8032
queueName=default
examplesRoot=examples
oozie.wf.application.path=${nameNode}/user/hadoop001/${examplesRoot}/apps/map-reduce/workflow.xml
outputDir=map-reduce
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/opt/app/hadoop_2.5.0_cdh/etc/hadoop</value>
<description>
Comma separated AUTHORITY=HADOOP_CONF_DIR, where AUTHORITY is the HOST:PORT of
the Hadoop service (JobTracker, HDFS). The wildcard '*' configuration is
used when there is no exact match for an authority. The HADOOP_CONF_DIR contains
the relevant Hadoop *-site.xml files. If the path is relative is looked within
the Oozie configuration directory; though the path can be absolute (i.e. to point
to Hadoop client conf/ directories in the local filesystem.
</description>
</property>
$ bin/oozied.sh stop
$ bin/oozied.sh start
$ bin/oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run
–总结
1,job.properties:属性指向workflow.xml所在的HDFS的位置(local)
2,workflow:(hdfs)
*start
*action
–mapreduce/shell
–ok
–error
*kill
*end
3,lib:依赖的jar包(hdfs)
1,创建相应的工程
mkdir wordcount
cd wordcount
mkdir lib
cp hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar /opt/app/oozie-4.0.0-cdh5.3.6/examples/apps/wordcount/lib/
cp job.properties /opt/app/oozie-4.0.0-cdh5.3.6/examples/apps/wordcount/
cp workflow.xml /opt/app/oozie-4.0.0-cdh5.3.6/examples/apps/wordcount/
2,在hdfs上创建输入输出目录
bin/hdfs dfs -mkdir ooziedir
bin/hdfs dfs -mkdir ooziedir/input
bin/hdfs dfs -mkdir ooziedir/output
bin/hdfs dfs -put /opt/app/hadoop_2.5.0_cdh/etc/hadoop/core-site.xml ooziedir/input/
bin/yarn jar /opt/app/hadoop_2.5.0_cdh/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar wordcount /user/hadoop001/ooziedir/input/core-site.xml /user/hadoop001/ooziedir/output/out001
3,编辑job.properties文件
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
nameNode=hdfs://xingyunfei001.com.cn:8020
jobTracker=xingyunfei001.com.cn:8032
queueName=default
examplesRoot=examples
oozie.wf.application.path=${nameNode}/user/hadoop001/${examplesRoot}/apps/map-reduce/workflow.xml
inputFile=/user/hadoop001/ooziedir/input/core-site.xml
outputDir=/user/hadoop001/ooziedir/output/outwordcount
4,编辑workflow.xml文件
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-wordcount-wf">
<start to="mr-node"/>
<action name="mr-node">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/${outputDir}"/>
</prepare>
<configuration>
<!--hadoop new api-->
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<!--queue-->
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<!--1,input-->
<property>
<name>mapred.input.dir</name>
<value>${inputFile}</value>
</property>
<!--2,mapper-->
<property>
<name>mapreduce.job.map.class</name>
<value>org.apache.hadoop.examples.WordCount$TokenizerMapper</value>
</property>
<property>
<name>mapreduce.map.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.map.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<!--3,reducer-->
<property>
<name>mapreduce.job.reduce.class</name>
<value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>
</property>
<property>
<name>mapreduce.job.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.job.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<!--3,output-->
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
5,将本地工程上传到hdfs
bin/hdfs dfs -put /opt/app/oozie-4.0.0-cdh5.3.6/examples/apps/wordcount /user/hadoop001/examples/apps/
6,启动oozie调度
[hadoop001@xingyunfei001 oozie-4.0.0-cdh5.3.6]$ bin/oozie job -oozie http://localhost:11000/oozie -config examples/apps/wordcount/job.properties -run
bin/hdfs dfs -text /user/hadoop001/ooziedir/output/outwordcount/part*
总结:
1,nodename的命名长度不要超过20个字符 [a-zA-Z][-_a-zA-Z0-9]
1,本地创建工作空间
cp hive shellhive
2,创建shell脚本(hive-select.sh)
#!/bin/bash
/opt/app/hive_0.13.1_cdh/bin/hive -e "select id,url,ip from db_track.track_log limit 10;"
3,修改job.properties文件
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<workflow-app xmlns='uri:oozie:workflow:0.4' name='shell-select-wf'>
<start to='shell-node' />
<action name='shell-node'>
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${EXEC}</exec>
<file>${shellFile}#${EXEC}</file>
</shell>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
4,将工作目录传到hdfs指定的目录上
bin/hdfs dfs -put /opt/app/oozie-4.0.0-cdh5.3.6/examples/apps/shellhive /user/hadoop001/examples/apps/
5,执行oozie调度
bin/oozie job -oozie http://localhost:11000/oozie -config examples/apps/shellhive/job.properties -run
--系统时区
[root@xingyunfei001 hadoop001]# cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
[root@xingyunfei001 hadoop001]# date -R
--Wed, 30 Mar 2016 22:36:08 +0800
--修改oozie时区
--修改oozie-site.xml配置文件
<property>
<name>oozie.processing.timezone</name>
<value>GMT+0800</value>
</property>
--重新启动oozie生效
1,创建工作环境
[hadoop001@xingyunfei001 apps]$ mkdir cron-test
[hadoop001@xingyunfei001 cron]$ cp coordinator.xml ../cron-test
[hadoop001@xingyunfei001 shellhive]$ cp hive-select.sh ../cron-test
[hadoop001@xingyunfei001 shellhive]$ cp job.properties ../cron-test
[hadoop001@xingyunfei001 shellhive]$ cp workflow.xml ../cron-test
2,修改coordinate.xml配置文件
<coordinator-app name="cron-coord" frequency="${coord:minutes(12)}" start="${start}" end="${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.2">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
<property>
<name>EXEC</name>
<value>${EXEC}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
3,修改job.properties文件
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
nameNode=hdfs://bigdata-senior01.ibeifeng.com:8020
jobTracker=bigdata-senior01.ibeifeng.com:8032
queueName=default
examplesRoot=user/hadoop001/examples/apps
oozie.coord.application.path=${nameNode}/${examplesRoot}/cron-test
start=2016-03-31T00:40+0800
end=2016-03-31T00:59+0800
workflowAppUri=${nameNode}/${examplesRoot}/cron-test
EXEC=hive-select.sh
4,修改workflow.xml文件
<workflow-app xmlns="uri:oozie:workflow:0.4" name="emp-select-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapreduce.job.queuename</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${EXEC}</exec>
<file>${nameNode}/${examplesRoot}/cron-test/${EXEC}#${EXEC}</file>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
5,上传本地工程到hdfs上面
bin/hdfs dfs -put /opt/app/oozie-4.0.0-cdh5.3.6/examples/apps/cron-test /user/hadoop001/examples/apps/
6,执行定时计划
$ bin/oozie job -oozie http://localhost:11000/oozie -config examples/apps/cron-test/job.properties -run
当前系统时间已经超过计划开始执行时间时,任务开始时系统会立即执行计划,后面的计划按照计划配置的开始时间进行计算
nameNode=hdfs://xingyunfei001.com.cn:8020
jobTracker=xingyunfei001.com.cn:8032
queueName=default
examplesRoot=user/hadoop001/examples/apps
oozie.coord.application.path=${nameNode}/${examplesRoot}/cron-test
start=2016-03-31T18:25+0800
end=2016-03-31T18:40+0800
workflowAppUri=${nameNode}/${examplesRoot}/cron-test
EXEC=hive-select.sh