OpenStack目前已经成为众多云计算厂商搭建私有云的首选,众多学术机构也使用OpenStack搭建小规模测试环境供学生实验,在此分享使用OpenStack虚拟机搭建Hadoop2.2.0环境的过程。
1.VM环境准备
OpenStack版本:Folsom
a. 发起三台测试虚拟机,操作系统为 Ubuntu-12.04.2-x86_64
b. 配置IP地址,因为在F版本的OpenStack中,网络采用FlatDHCP模式使得虚拟机获得10.0.x.x段的Fixed IP地址,因此需要在虚拟机中配置/etc/hosts文件。
# vim /etc/hosts 127.0.0.1 localhost localhost.localdomain 10.0.0.225 hdp-server-01 10.0.1.19 hdp-server-02 10.0.1.17 hdp-server-03c. 用root在每台机器上新建用户 yarn,使用同样的密码
# useradd -m -s /bin/bash yarn # passwd yarn Enter new UNIX password: Retype new UNIX password: passwd: password updated successfullyd. 设置ssh无密码互访
#每台机器 $ su yarn $ cd ~ $ ssh-keygen -t rsa $ cat .ssh/id_rsa.pub >> .ssh/authorized_keys #可以使用 ssh localhost 测试是否可以无密码访问 #相互之间可以将.ssh/authorized_keys的内容互拷到对方的.ssh/authorized_keys文件中。e. 使用yarn账户,通过/etc/hosts文件中填写的主机名进行互访,并验证是否无密码登录。
因为采用64位的操作系统,因此不能够直接使用从官网下载的文件进行安装,必须手动编译。以下为编译过程:
2.编译Hadoop2.2.0
a. 配置JDK环境变量,假设jdk文件夹为/usr/java/jdk1.7.0_45
su yarn # vim ~/.bashrc # 追加写入 export JAVA_HOME=/usr/local/java/jdk1.7.0_45 export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${JAVA_HOME}/bin:$PATH $source ~/.bashrc #使设置生效
$sudo apt-get install g++ autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev
$cd $HOME/protobuf2.5.0 $./configure --prefix=/usr $sudo make $sudo make check $sudo make install $ protoc --version libprotoc 2.5.0
$ sudo apt-get install maven
$ cd ~ $ tar -xzvf hadoop-2.2.0-src.tar.gz $ cd hadoop-2.2.0-src/ $ mvn package -Pdist,native -DskipTests -Dtar
验证编译结果:
yarn@hdp-server-01:~$ $HOME/hadoop-2.2.0/bin/hadoop version Hadoop 2.2.0 Subversion Unknown -r Unknown Compiled by yarn on 2013-11-05T06:41Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4 This command was run using /home/yarn/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar
yarn@hdp-server-01:~$ file $HOME/hadoop-2.2.0/lib/native/* /home/yarn/hadoop-2.2.0/lib/native/libhadoop.a: current ar archive /home/yarn/hadoop-2.2.0/lib/native/libhadooppipes.a: current ar archive /home/yarn/hadoop-2.2.0/lib/native/libhadoop.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=0xaa74c9d23bfe750f160412e4465b14c88cf1c650, not stripped /home/yarn/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=0xaa74c9d23bfe750f160412e4465b14c88cf1c650, not stripped /home/yarn/hadoop-2.2.0/lib/native/libhadooputils.a: current ar archive /home/yarn/hadoop-2.2.0/lib/native/libhdfs.a: current ar archive /home/yarn/hadoop-2.2.0/lib/native/libhdfs.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=0x89671252f3c5fb7034425e80c9d31ea67da75c4d, not stripped /home/yarn/hadoop-2.2.0/lib/native/libhdfs.so.0.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=0x89671252f3c5fb7034425e80c9d31ea67da75c4d, not stripped
3. 安装配置Hadoop2.2.0
假定各node角色划分如下:
hdp-server-01 resourcemanager, nodemanager, proxyserver,historyserver, datanode, namenode
hdp-server-02 datanode, nodemanager
hdp-server-03 datanode, nodemanager
a. 目录准备
mkdir -p ~/yarn_data/tmp mkdir -p ~/yarn_data/mapred
b. 配置环境变量(追加至~/.bashrc)
#hadoop env export HADOOP_HOME="$HOME/hadoop-2.2.0" export HADOOP_PREFIX="$HADOOP_HOME/" export YARN_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME="$HADOOP_HOME" export HADOOP_COMMON_HOME="$HADOOP_HOME" export HADOOP_HDFS_HOME="$HADOOP_HOME" export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop/" export YARN_CONF_DIR=$HADOOP_CONF_DIR export PATH="$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH"
这个步骤是参考http://www.cnblogs.com/lucius/p/3435296.html,作者指出这是一个bug,本人修改后测试可以运行。
$ cd $YARN_HOME/libexec/ $ vim hadoop-config.sh #修改第96行代码为: export HADOOP_SLAVES="${HADOOP_CONF_DIR}/$1" #保存退出vim
<!-- $YARN_HOME/etc/hadoop/core-site.xml --> <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hdp-server-01:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/yarn/yarn_data/tmp/hadoop-grid</value> </property> </configuration>
<!-- $YARN_HOME/etc/hadoop/hdfs-site.xml --> <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
<!-- $YARN_HOME/etc/hadoop/yarn-site.xml --> <?xml version="1.0"?> <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>hdp-server-01:8032</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>hdp-server-01:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>hdp-server-01:8033</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>hdp-server-01:8030</value> </property> <property> <name>yarn.nodemanager.loacl-dirs</name> <value>/home/yarn/yarn_data/mapred/nodemanager</value> <final>true</final> </property> <property> <name>yarn.web-proxy.address</name> <value>hdp-server-01:8888</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
<!-- $YARN_HOME/etc/hadoop/mapred-site.xml --> <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>至此,配置文件修改完毕,修改完后将 $HOME/hadoop-2.2.0 及 $HOME/yarn_data 两个文件分别拷贝至其他机器的同样位置,注意确保文件所有者为yarn。
e. HDFS格式化
$ hdfs namenode -format
@hdp-server-01 #在不同vm上启动的服务不同,根据划分的角色 $cd $YARN_HOME $sbin/hadoop-daemon.sh --script hdfs start namenode # 启动namenode $sbin/hadoop-daemon.sh --script hdfs start datanode # 启动datanode $sbin/yarndaemon.shstart nodemanager #启动nodemanager $sbin/yarn-daemon.sh start resourcemanager # 启动resourcemanager $sbin/yarn-daemon.shstart proxyserver #启动web App proxy $sbin/mr-jobhistory-daemon.sh start historyserver jps查看 $ jps 8770 ResourceManager 11609 Jps 8644 NodeManager 9071 JobHistoryServer 8479 NameNode 9000 WebAppProxyServer 8552 DataNode @hdp-server-02 @hdp-server-03 $cd $YARN_HOME $sbin/yarndaemon.shstart nodemanager # 启动nodemanager $sbin/hadoop-daemon.sh --script hdfs start datanode # 启动datanode jps查看 $ jps 6691 NodeManager 9089 Jps 6787 DataNode
cd $YARN_HOME $ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 10 1000
Number of Maps = 10 Samples per Map = 1000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 13/12/22 17:50:42 INFO client.RMProxy: Connecting to ResourceManager at hdp-server-01/10.0.0.225:8032 13/12/22 17:50:43 INFO input.FileInputFormat: Total input paths to process : 10 13/12/22 17:50:43 INFO mapreduce.JobSubmitter: number of splits:10 13/12/22 17:50:43 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 13/12/22 17:50:43 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/12/22 17:50:43 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 13/12/22 17:50:43 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 13/12/22 17:50:43 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 13/12/22 17:50:43 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/12/22 17:50:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1387700249346_0004 13/12/22 17:50:44 INFO impl.YarnClientImpl: Submitted application application_1387700249346_0004 to ResourceManager at hdp-server-01/10.0.0.225:8032 13/12/22 17:50:44 INFO mapreduce.Job: The url to track the job: http://hdp-server-01:8888/proxy/application_1387700249346_0004/ 13/12/22 17:50:44 INFO mapreduce.Job: Running job: job_1387700249346_0004 13/12/22 17:50:53 INFO mapreduce.Job: Job job_1387700249346_0004 running in uber mode : false 13/12/22 17:50:53 INFO mapreduce.Job: map 0% reduce 0% 13/12/22 17:51:03 INFO mapreduce.Job: map 40% reduce 0% 13/12/22 17:51:13 INFO mapreduce.Job: map 90% reduce 0% 13/12/22 17:51:14 INFO mapreduce.Job: map 100% reduce 0% 13/12/22 17:51:15 INFO mapreduce.Job: map 100% reduce 100% 13/12/22 17:51:16 INFO mapreduce.Job: Job job_1387700249346_0004 completed successfully 13/12/22 17:51:16 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=226 FILE: Number of bytes written=878638 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=2680 HDFS: Number of bytes written=215 HDFS: Number of read operations=43 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 Job Counters Launched map tasks=10 Launched reduce tasks=1 Data-local map tasks=10 Total time spent by all maps in occupied slots (ms)=142127 Total time spent by all reduces in occupied slots (ms)=8333 Map-Reduce Framework Map input records=10 Map output records=20 Map output bytes=180 Map output materialized bytes=280 Input split bytes=1500 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=280 Reduce input records=20 Reduce output records=0 Spilled Records=40 Shuffled Maps =10 Failed Shuffles=0 Merged Map outputs=10 GC time elapsed (ms)=2606 CPU time spent (ms)=11090 Physical memory (bytes) snapshot=2605563904 Virtual memory (bytes) snapshot=11336945664 Total committed heap usage (bytes)=2184183808 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1180 File Output Format Counters Bytes Written=97 Job Finished in 34.098 seconds Estimated value of Pi is 3.14080000000000000000
1. 在OpenStack环境启动的虚拟机中搭建Hadoop与物理机搭建并无太大不同,但需要注意虚拟机获取到的IP地址,用openstack分配的浮动ip(Floating ip)往往不能使用。因为浮动ip是由nova-network设置,用于nat转发的,虚拟机自身并不知道这个地址。
2. 集群中使用的虚拟机最好是同样的操作系统,这样可以使用编译好的文件,因为在Hadoop2.2.0框架中 hdfs不存在Master节点,因此每个节点的配置文件都是相同的,故可以先发起一台虚拟机,安装配置完之后将其做成镜像,后续可以起多个节点,区别在于启动的服务不同。
参考:http://www.cnblogs.com/lucius/p/3435296.html