hadoop 2.2.0 cluster setup-linux

Apache hadoop2.2.0作为新一代hadoop版本,突破原来hadoop1.x的集群机器最多4000台的限制,并有效解决以前常遇到的OOM(内存溢出)问题,其创新的计算框架YARN被称为hadoop的操作系统,不仅兼容原有的mapreduce计算模型而且还可支持其他并行计算模型。

假设我们要搭建2个节点的hadoop2.2.0的集群。一个节点主机名为master,作为集群master兼slave角色运行namenode, datanode, secondarynamenode,resourcemanager和node manager 等daemon进程;另一个节点名为slave1作为集群slave角色运行datanode 和nodemanager进程.

1. 获取hadoop二进制包或者源码包: http://mirrors.cnnic.cn/apache/hadoop/common/hadoop-2.2.0/ , 使用 hadoop-2.2.0.tar.gz  或者 hadoop-2.2.0-src.tar.gz

2.  在每台机器上建立同名用户, 比如hduser;  并安装java (1.6 or 1.7)

解压软件包,比如到目录  /home/hduser/hadoop-2.2.0


如果要编译源代码,请参考以下3,4,5步骤

----------------for compile source file-----------------------

3. 下载  protocbuf2.5.0   :   https://code.google.com/p/protobuf/downloads/list,    下载最新的 maven : http://maven.apache.org/download.cgi

 编译protocbuf 2.5.0:

  1. tar -xvf protobuf-2.5.0.tar.gz
  2. cd protobuf-2.5.0  
  3. ./configure --prefix=/opt/protoc/  
  4. make && make install 

4.  安装必须的软件包

如果是rmp linux:

  1. yum install gcc  
  2. yum intall gcc-c++  
  3. yum install make
  4. yum install cmake  
  5. yum install openssl-devel  
  6. yum install ncurses-devel 

如果是Debian linux:

  1. sudo apt-get install gcc  
  2. sudo apt-get install intall g++  
  3. sudo apt-get  install make 
      4. sudo apt-get install cmake
      5. sudo apt-get install libssl-dev
      6. sudo apt-get install libncurses5-dev

5.开始编译hadoop-2.2.0源码:

mvn clean install –DskipTests 

mvn package -Pdist,native -DskipTests -Dtar 



6 如果你已经得到了编译好的包(比如hadoop-2.2.0.tar.gz),以下为安装配置过程。

用hduser登录到master机器:

6.1   安装ssh

For example on Ubuntu Linux:

$ sudo apt-get install ssh
$ sudo apt-get install rsync

Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

then can ssh from master to slaves: scp  ~/.ssh/authorized_keys   slave1:/home/hduser/.ssh/

6.2 设置 JAVA_HOME in hadoop-env.sh and yarn-env.sh inhadoop_home/etc/hadoop

6.3 编辑 core-site.xml, hdfs-site.xml, mapred-site.xml,yarn-site.xml inhadoop_home/etc/hadoop

A sample core-site.xml:

<!-- Put site-specific property overrides in this file. -->

<configuration>
                       <property>  
                              <name>fs.defaultFS</name>  
                              <value>hdfs://master:9000</value>
                     </property> 
                       <property>  
                              <name>hadoop.tmp.dir</name>  
                              <value>/home/hduser/temp</value>  
                       </property> 
</configuration>

A sample hdfs-site.xml :

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/hduser/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/hduser/dfs/data</value>
    </property>

</configuration>


A sample mapred-site.xml :

<!-- Put site-specific property overrides in this file. -->
<configuration>


 <property>  
     <name>mapreduce.framework.name</name>  
      <value>yarn</value>  
  </property>
  <property>  
      <name>yarn.app.mapreduce.am.staging-dir</name>  
       <value>/home/hduser/temp/hadoop-yarn/staging</value>  
   </property>

</configuration>

A sample yarn-site.xml :

<configuration>
<!-- Site specific YARN configuration properties -->
    <property>
       <name>yarn.nodemanager.aux-services</name>
       <value>mapreduce_shuffle</value>
    </property>


     <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
      </property>

  <property>
    <description>CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries</description>
    <name>yarn.application.classpath</name>
    <value>
        hadoop_home/etc/hadoop,
       hadoop_home/share/hadoop/common/*,
       hadoop_home/share/hadoop/common/lib/*,
       hadoop_home/share/hadoop/hdfs/*,
       hadoop_home/share/hadoop/hdfs/lib/*,
       hadoop_home/share/hadoop/mapreduce/*,
       hadoop_home/share/hadoop/mapreduce/lib/*,
       hadoop_home/share/hadoop/yarn/*,
       hadoop_home/share/hadoop/yarn/lib/*
    </value>
  </property>

</configuration>

6.4 编辑 slaves file in hadoop_home/etc/hadoop  ,使其具有以下内容

master

slave1

以上完成后,在master机器以hduser用户使用scp命令拷贝hadoop-2.2.0目录及内容到其他机器的同样路径:

scp hadoop folder 到各个机器 : scp /home/hduser/hadoop-2.2.0  slave1:/home/hduser/hadoop-2.2.0


7. 格式化hdfs (一般只进行一次,除非hdfs故障 ), 依次执行以下命令

  1. cd  /hduser/hadoop-2.2.0/bin/  
  2. ./hdfs namenode -format 

8 启动、停止hadoop集群(可多次进行, 一般启动后不停否则Application运行信息会丢失)

  1. [hadoop@master bin]$ cd ../sbin/  
  2. [hadoop@master sbin]$ ./start-all.sh

9.验证:

hdfs WEB界面 :   http://master:50070  

RM(ResourceManager)界面:    http://master:8088 


10  运行wordcount示例

1)用hdfs dfs -mkdir  -p /user/yarn/wordcount/input

    hdfs dfs -mkdir  -p /user/yarn/wordcount/output  分别建2个目录

2) 穿件一个文本文件test.txt包含如下内容:

hello world

hello hadoop!

再用  hdfs dfs -put test.txt   /user/yarn/wordcount/input  将文件上传到hdfs

3)在hadoop\bin目录下运行:

yarn jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount    /user/yarn/wordcount/input     /user/yarn/wordcount/output

成功后在RM界面看到对应的Application状态应为Succeed, 在 /user/yarn/wordcount/output 可看到 part-r-00000



你可能感兴趣的:(hadoop 2.2.0 cluster setup-linux)