Oracle JDK
安装了elasticSearch的系统应该已经配置好了JDK环境; 推荐JDK7
scala开发包br>
spark依赖于scala运行, scala是开发spark统计程序的官方语言; 推荐安装scala-2.11版本
hadoop开发包
hadoop-yarn为spark运算提供资源管理及hdfs存储; 推荐apache hadoop-2.7.3版本
spark开发包
用于分布式运算; 推荐apache spark 2.0.2版本
ES-Hadoop插件(假设需要与elasticSearch交互)
es-hadoop作为hadoop/spark集成elasticSearch的插件使用; 推荐es-hadoop_5.1.1版本
mkdir /usr/local/java && cd /usr/local/java
wget "http://download.oracle.com/otn/java/jdk/7u76-b13/jdk-7u76-linux-x64.tar.gz"
tar -zxf jdk-7u76-linux-x64.tar.gz && rm -f jdk-7u76-linux-x64.tar.gz
在/etc/profile中加入如下变量:
export JAVA_HOME=/usr/local/java/jdk1.7.0_76
export JRE_HOME=/usr/local/java/jdk1.7.0_76/jre
export PATH=$PATH:/usr/local/java/jdk1.7.0_76/bin
export CLASSPATH=./:/usr/local/java/jdk1.7.0_76/lib:/usr/local/java/jdk1.7.0_76/jre/lib
source /etc/profile
mkdir /usr/local/scala && cd /usr/local/scala/
wget "http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz"
tar -zxf scala-2.11.8.tgz && rm -f scala-2.11.8.tgz
echo "export SCALA_HOME=/usr/local/scala/scala-2.11.8" >> /etc/profile
echo "export PATH=$SCALA_HOME/bin:$PATH" >> /etc/profile
source /etc/profile
groupadd hadoop # 添加hadoop用户组
useradd hadoop -g hadoop # 添加hadoop用户并加入hadoop组
vim /etc/sudoers # 编辑sudoers文件,给hadoop用户sudo权限
hadoop ALL=(ALL) ALL # 在sudoers末尾加上这一行
修改各机器主机名, 用ä¥方便区分节点
假设有三台机器, 一个用作master节点, 两个用于slave节点,如下:
192.168.1.100 master
192.168.1.101 slave01
192.168.1.102 slave02
那么在将各个hostname分别改为master, slave01, slave02后, 各自配置/etc/hosts:
echo "192.168.1.100 master" >> /etc/hosts
echo "192.168.1.101 slave01" >> /etc/hosts
echo "192.168.1.102 slave02" >> /etc/hosts
配置免密码登录
hadoop集群中需要配置namenode(master节点)通过用户hadoop免密码登录到本地以及其他datanode(slave节点);
具体做法是将master节点上的rsa这类证书分发到各个slave节点对应ssh配置目录, 这里略过具体过程。
下载hadoop2.7.3
mkdir /usr/local/hadoop && cd /usr/local/hadoop
wget "https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz"
tar -zxf hadoop-2.7.3.tar.gz && rm -f hadoop-2.7.3.tar.gz
mkdir -p /usr/local/hadoop/hdfs/data
mkdir -p /usr/local/hadoop/hdfs/name
mkdir -p /usr/local/hadoop/tmp
chown -R hadoop:hadoop /usr/local/hadoop
cd hadoop-2.7.3 && su hadoop
echo "export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.3" >> /etc/profile
echo "export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin" >> /etc/profile
source /etc/profile
vim etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/local/java/jdk1.7.0_76
vim etc/hadoop/yarn-env.sh
export JAVA_HOME=/usr/local/java/jdk1.7.0_76
vim etc/hadoop/slaves // 把datanode的hostname写入slaves文件, 根据实际情况修改
slave01
slave02
vim etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://master:9000value>
<description>HDFS的URI,文件系统://namenode标识:端口号description>
property>
<property>
<name>hadoop.tmp.dirname>
<value>/usr/local/hadoop/tmpvalue>
<description>namenode上本地的hadoop临时文件夹description>
property>
configuration>
vim etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dirname>
<value>/usr/local/hadoop/hdfs/namevalue>
<description>namenode上存储hdfs名字空间元数据 description>
property>
<property>
<name>dfs.data.dirname>
<value>/usr/local/hadoop/hdfs/datavalue>
<description>datanode上数据块的物理存储位置description>
property>
<property>
<name>dfs.replicationname>
<value>2value>
<description>副本个数,配置默认是3,应小于datanode机器数量description>
property>
configuration>
vim etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
<property>
<name>yarn.resourcemanager.hostnamename>
<value>mastervalue>
property>
configuration>
vim etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
configuration>
所有配置文件修改后, 将/usr/local/hadoop/文件夹拷贝到datanode中相应的位置
cd /usr/local/hadoop/hadoop-2.7.3 && su hadoop
bin/hdfs namenode -format
sbin/start-dfs.sh
sbin/start-yarn.sh
hadoop启动后, 通过http://master:50070/和http://master:8088/可以分别查看hdfs和task等状态信息
mkdir /usr/local/spark/ && cd /usr/local/spark
chown -R hadoop:hadoop /usr/local/spark
wget "http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz"
tar -zxf spark-2.0.2-bin-hadoop2.7.tgz && rm -f spark-2.0.2-bin-hadoop2.7.tgz
cd spark-2.0.2-bin-hadoop2.7
echo "export SPARK_HOME=/usr/local/spark/spark-2.0.2-bin-hadoop2.7" >> /etc/profile
echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> /etc/profile
source /etc/profile
export JAVA_HOME=/usr/local/java/jdk1.7.0_76
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export SPARK_HOME=/usr/local/spark/spark-2.0.2-bin-hadoop2.7
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.3
export SPARK_MASTER_HOST=master
export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.3/etc/hadoop/
export SPARK_HISTORY_OPTS="-Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://master:9000/sparklogs"
export LD_LIBRARY_PATH=${HADOOP_HOME}/lib/native/:$LD_LIBRARY_PATH
vim spark-default.conf
spark.eventLog.enabled true
spark.yarn.jars hdfs:///sparkjars/* # 指定spark on yarn模式下所以来的spark jar包
spark.eventLog.dir hdfs://master:9000/sparklogs
vim slaves
slave01
slave02
# SPARK_HOME
# sprk集群的日志目录:配置文件中对history-server中定义的log目录
$ hdfs dfs -mkdir /sparklogs
# 将spark的jar包拷贝到hadoop服务器上,这样避免每次计算的时候都要做去一次拷贝操作
$ hdfs dfs -mkdir /sparkjars
$ cd /usr/local/spark/spark-2.0.2-bin-hadoop2.7/ && hdfs dfs -put jars/* /sparkjars/
配置文件修改完成后, 将/usr/local/spark文件夹拷贝到其他节点对应的位置, 并配置好环境变量
spark集群启动, 在主节点中执行:
cd /usr/local/spark/spark-2.0.2-bin-hadoop2.7/ && ./sbin/start-all.sh
用自带example验证测试
hadoop@master:/usr/local/spark/spark-2.0.2-bin-hadoop2.7# bin/spark-submit --class org.apache.spark.\
examples.JavaSparkPi --master spark://master:7077 examples/jars/spark-examples_2.11-2.0.2.jar
16/12/26 15:41:13 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
[Stage 0:> (0 + 0) / 2]16/12/26 15:41:20 WARN \
TaskSetManager: Stage 0 contains a task of very large size (981 KB). The maximum recommended task size is 100 KB.
Pi is roughly 3.13608
spark启动后, 通过http://master:8080/可以查看spark当前的运行状态
下载ES-Hadoop
mkdir /usr/local/es-hadoop && cd /usr/local/es-hadoop
wget "http://download.elastic.co/hadoop/elasticsearch-hadoop-5.1.1.zip"
unzip elasticsearch-hadoop-5.1.1.zip && rm -f elasticsearch-hadoop-5.1.1.zip
cp elasticsearch-hadoop-5.1.1/dist/elasticsearch-hadoop-5.1.1.jar /usr/local/spark/spark-2.0.2-bin-hadoop2.7/jars/
通过spark访问/操作elasticSearch
hadoop@master:/usr/local/spark/spark-2.0.2-bin-hadoop2.7# ./bin/spark-submit your_spark_es_script.py