中小公司自建Hadoop

本文描述了一年前在公司vmware esx上安装3节点Hadoop经历:
内容拼凑了自建bind9 DNS的经过,apt-cacher-ng优化集群下载速度问题,没准会附带主节点 /etc打包文件,
文章开始:
操作系统为什么用ubuntu?
优点:默认安装组件少,磁盘空间占用少,相对流行,
缺点:配置比较特别企业级认证相对少。
1操作系统
Ubuntu 10.4 LTS amd 64版
2安装,统一操作系统用户名
ubtusr ubtusr
3机器命名hostname务必规划好
4固定网络IP
安装虚拟化客户端系统驱动
5在/etc/apt修改软件安装Sources.list,Cloudera.list,hosts,源证书cloudera.key
在更新服务节点安装apt-cacher-ng进行安装包本地缓存,将该服务器地址包含入源地址
deb http://dwubt01:3142/archive.cloudera.com/debian lucid-cdh3 contrib
deb-src http://dwubt01:3142/archive.cloudera.com/debian lucid-cdh3 contrib
使用过的安装文件会缓冲到 /var/cache/apt-cacher-ng可以定时备份恢复该文件夹后从
http://dwubt01:3142 进行缓存重新扫描/清理
 
6安装open-SSH,VIM,HADOOP,安装sun-java6不能用openJDK

$ sudo add-apt-repository "deb http://archive.canonical.com/ RELEASE partner"
$ sudo apt-get update
$ sudo apt-get install sun-java6-jdk
 
7 SSH免输密码登陆
$ssh-keygen -t rsa
不输入密码
$cd ~.ssh/
$ll
$ scp id_rsa.pub dwubt02:/home/ubtusr/.ssh/dwubt01key
ubtusr@dwubt02's password:
$ssh ubtusr@dwubt02
$cd .ssh
$cat dwubt01key >> authorized_keys
如此操作后可以从dwubt01无密码访问dwubt02
服务端authorized_keys加入了客户端的公钥后客户端可以无密码访问服务端
8配置DNS Bind9使用域名隔离网络地址变更
$sudo apt-get install bind 9
$cd /etc/bind
其中named.conf.options加入forwarders{10.10.10.3;};
其中Named.conf.local 加入
zone "dwcld"{
         type master;
         file "/etc/bind/db.dwcld";
};
zone "dwubt01" { type master; notify no; file "/etc/bind/db.dwubt01"; };
zone "dwubt02" { type master; notify no; file "/etc/bind/db.dwubt02"; };
zone "dwubt03" { type master; notify no; file "/etc/bind/db.dwubt03"; };
zone "dwubt04" { type master; notify no; file "/etc/bind/db.dwubt04"; };
zone "dwwin01" { type master; notify no; file "/etc/bind/db.dwwin01"; };
每个独立zone对应文件db. dwubt01
$TTL 86400       ; one day
@       IN      SOA     ns0.example.net.      hostmaster.example.net. (
                        2002061000       ; serial number YYMMDDNN
                        28800   ; refresh  8 hours
                        7200    ; retry    2 hours
                        864000  ; expire  10 days
                        86400 ) ; min ttl  1 day
                NS      ns0.example.net.
                NS      ns1.example.net.
                   A       10.10.10.243
*                IN      A       10.10.10.243
包含域名的db.dwcld
;
; BIND data file for local loopback interface
;
$TTL 604800
@     IN      SOA  dwcld. root.localhost. (
                                  2              ; Serial
                             604800            ; Refresh
                              86400            ; Retry
                            2419200            ; Expire
                             604800 )          ; Negative Cache TTL
;
@     IN      NS     localhost.
@     IN      A       127.0.0.1
@     IN      AAAA        ::1
peterhost  IN      A       10.10.10.35
dwubt01            IN      A       10.10.10.243
dwubt02            IN      A       10.10.10.244
dwubt03            IN      A       10.10.10.245
dwubt04            IN      A       10.10.10.246
dwwin01            IN      A       10.10.10.247
 
9  Name Node安装,SecondaryName Node安装
拷贝数据目录,启动
10 Todo:安装网络shell便于管理
 
关闭IPV6
cat << EOF >> /etc/sysctl.conf
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
EOF
sysctl -p
 

关闭IPV6
vi  /etc/hadoop-0.20-conf/hadoop-env.sh
加入
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
:w !sudo tee %
 
确认/etc/hosts中本机dwubt01机器名映射为集群使用的IP
 
 
Installing CDH3 on Ubuntu Systems
https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation
If you are installing CDH3 on an Ubuntu system, you can download the Cloudera packages using apt. If you want to upgrade, see Upgrading to CDH3. For a list of supported operating systems, see Supported Operating Systems for CDH3.
   
Note
If you want to create your own apt repository, create a mirror of the CDH Debian directory and then create an apt repository from the mirror.
To install CDH3 on an Ubuntu system:
1.  新建软件源地址增加文件/etc/apt/sources.list.d/cloudera.list 使用以下内容
deb http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib
deb-src http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib
其中:
<RELEASE> 替换为你操作系统的版本可以运行
$lsb_release -c.
例如 Ubuntu lucid, 使用lucid-cdh3进行配置.
   
Note
To install a different version of CDH on an Ubuntu system, specify the version number you want in the <RELEASE>-cdh3 section of the deb command. For example, to install CDH3 Beta 4 for Ubuntu maverick, use maverick-cdh3b4 in the command above.
2.       下载软件源安全key可选,运行以下命令
(Optional) Add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
This key enables you to verify that you are downloading genuine packages.
3.       Update APT package index:
更新软件源后尝试搜索hadoop系列安装列表
$ sudo apt-get update
Find and install the Hadoop core package by using your favorite APT package manager, such as apt-get, aptitude, or dselect. For example:
$ apt-cache search hadoop
4.       安装核心包
$ sudo apt-get install hadoop-0.20
   
Note
In prior versions of CDH, the hadoop-0.20 package contained all of the service scripts in /etc/init.d/. In CDH3, the hadoop-0.20 package does not contain any service scripts – instead, those scripts are contained in the hadoop-0.20-<daemon> packages. Do the following step to install the daemon packages.之前的版本启动脚本在
Install each type of daemon package on the appropriate machine. For example, install the NameNode package on your NameNode machine:
在各服务器分别安装对应的hadoop组件
$ sudo apt-get install hadoop-0.20-<daemon type>
其中 <daemon type> 是以下一种:
一般主服务器装 namenode/ secondarynamenode /jobtracker
数据节点装 datanode/tasktracker
namenode
datanode
secondarynamenode
jobtracker
tasktracker
 
CDH3 Deployment on a Cluster
https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster
 
CDH3集群部署
To unleash the full power of Hadoop, you must deploy it on a group of machines. The CDH3 packages are designed to make cluster deployment easier.
为了完全展现Hadoop的能力,要部署到一组服务器上,Cloudera的CDH3安装包就是为这一目的设计的。
Before you begin deploying CDH3 on a cluster, it's useful to review how Hadoop configurations work using the alternatives framework. The alternatives system is a set of commands used to manage symlinks to choose between multiple directories that fulfill the same purpose. CDH3 uses alternatives to manage your Hadoop configuration so that you can easily switch between cluster configurations. (For more information on alternatives, see the update-alternatives(8) man page on Ubuntu systems or the alternatives(8) man page on Red Hat systems.
在部署CDH3集群前需要了解Linux的配置框架“alternatives”,这个系统使用Linux文件系统中的文件链接(符号型)组织多个版本/目录的配置,CDH3使用该框架便于配置的切换。详情查看Ubuntu和RedHat的详细说明
If you followed the deployment instructions in previous sections of this guide, you have two configurations: standalone and pseudo-distributed. To verify this, run the following commands.
在ubuntu系统显示所有可用的alternative配置包
To list alternative Hadoop configurations on Ubuntu and SUSE systems:
$ sudo update-alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.pseudo
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.pseudo - priority 30
Current `best' version is /etc/hadoop-0.20/conf.pseudo.
To list alternative Hadoop configurations on Red Hat systems:
$ sudo alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.pseudo
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.pseudo - priority 30
Current `best' version is /etc/hadoop-0.20/conf.pseudo.
Customizing the Configuration without Using a Configuration Package
在默认配置中单机和伪集群配置In the standalone and pseudo-distributed modes, the packages managed the configuration information for you. For example, when you installed the hadoop-0.20-conf-pseudo package, it added itself to the list of alternative hadoop-0.20-conf configurations with a priority of 30. This informed alternatives that the pseudo-distributed configuration should take precedence over the standalone configuration. You can customize the Hadoop configuration without using a configuration package.
如何不使用安装包配置Hadoop,To customize the Hadoop configuration without using a configuration package:
    Copy the default configuration to your custom directory.
    将默认配置目录拷贝为需要的目录名字
    $sudo  cp -r /etc/hadoop-0.20/conf.empty  /etc/hadoop-0.20/conf.prd_cluster
You can call this configuration anything you like; in this example, it's called prd_cluster. Edit the configuration files in /etc/hadoop-0.20/ conf.prd_cluster to suit your deployment.
    Activate this new configuration by informing alternatives this configuration is a higher priority than the others (such as 50):
    安装该配置目录并将优先级设为50,
    之后alternatives框架会在一组配置中自动使用最高的
    在 Ubuntu 和SUSE systems系统中安装的命令:
    $ sudo update-alternatives --install  /etc/hadoop-0.20/conf  hadoop-0.20-conf /etc/hadoop-0.20/conf.prd_cluster  50
    在 Red Hat系统中安装的命令:
    $ sudo alternatives  --install /etc/hadoop-0.20/conf  hadoop-0.20-conf /etc/hadoop-0.20/conf.prd_cluster  50
    现在文件链接/etc/hadoop-0.20/conf 会指向 /etc/hadoop-0.20/conf.prd_cluster.
    可以通过再次运行display验证
    Ubutu下
    $ sudo update-alternatives --display hadoop-0.20-conf
    hadoop-0.20-conf - status is auto.
    link currently points to /etc/hadoop-0.20/conf.prd _cluster/etc/hadoop-0.20/conf.empty - priority 10
    /etc/hadoop-0.20/conf.pseudo - priority 30
    /etc/hadoop-0.20/conf.prd_cluster - priority 50
    Current `best' version is /etc/hadoop-0.20/conf.prd _cluster.
RedHat下:
$ sudo alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.prd_cluster
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.pseudo - priority 30
/etc/hadoop-0.20/conf.prd_cluster - priority 50
Current `best' version is /etc/hadoop-0.20/conf.prd_cluster.
The alternatives system knows of three Hadoop configurations and has chosen your new conf.prd_cluster configuration because it has a higher priority. 现在alternatives系统知道有三份配置和他们的优先级。
   
If you install two configurations with the same priority, alternatives will choose the one installed last.如果有两个优先级相同的配置,会先使用最后添加的
    如何删除配置You can remove your configuration from the list of alternatives.
    To remove alternative configuration on Ubuntu and SUSE systems:
    $ sudo update-alternatives --remove hadoop-0.20-conf /etc/hadoop-0.20/conf.prd_cluster
To remove alternative configuration on Red Hat systems:
$ sudo alternatives --remove hadoop-0.20-conf /etc/hadoop-0.20/conf.prd_cluster
    如何强制切换配置You can tell alternatives to manually choose a configuration without regard to priority.
    To manually set the configuration on Ubuntu and SUSE systems:
    $ sudo update-alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.prd_cluster
To manually set the configuration on Red Hat systems:
$ sudo alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.prd_cluster
将配置推送到所有服务器
Deploying your Custom Configuration to your Entire Cluster
To deploy your custom configuration to your entire cluster:
    Push the /etc/hadoop-0.20/conf.prd_cluster directory to all nodes in your cluster. 拷贝该目录到各服务器
    Set  alternative  rules on all nodes to activate your configuration. 运行相应的激活命令
    Restart the daemons on all nodes in your cluster using the service scripts so that the new configuration files are read.重启所有的服务节点使配置生效
Format the NameNode格式化NameNode信息
Before starting the NameNode for the first time you need to format the file system.
第一次启动Hadoop前要格式化文件系统
一定要使用hadoop的Linux账号运行
   
Make sure you format the NameNode as user hdfs.
$ sudo -u hdfs hadoop namenode -format
Configuring Local Storage Directories for Use by HDFS and MapReduce配置HDFS文件夹
When moving from a pseudo distributed cluster to a full cluster deployment, you will need to specify, create, and assign the correct permissions to the local directories where you want the HDFS and MapReduce daemons to store data.
在进行集群配置前需要定制本地hadoop数据目录对于HDFS和Mapreduce服务的权限
You specify the directories by configuring the following three properties; two properties are in the hdfs-site.xml file, and one property is in the mapred-site.xml file:
可以通过以下3个属性配置目录位置
Property
属性
   
Configuration File Location
配置位置
   
Description
描述
dfs.name.dir
   
hdfs-site.xml on 在 NameNode上
   
This property specifies the directories where the NameNode stores its metadata and edit logs. Cloudera recommends that you specify at least two directories, one of which is located on an NFS mount point.描述名字节点元数据/日志存储位置,该目录数据非常重要影响整个集群,建议设置两个位置其中一个位于网络文件系统NFS上。
dfs.data.dir
   
hdfs-site.xml on 在每个DataNode
   
This property specifies the directories where the DataNode stores blocks. Cloudera recommends that you configure the disks on the DataNode in a JBOD (Just a bunch of disks) configuration, mounted at /data/1/ through /data/N, and configure dfs.data.dir to specify /data/1/dfs/dn through /data/N/dfs/dn/.
每个数据节点存储数据的位置,可不用RAID配置磁盘
mapred.local.dir
   
mapred-site.xml  在每个TaskTracker
   
This property specifies the directories where the TaskTracker will store temporary data and intermediate map output files while running MapReduce jobs. Cloudera recommends that this property specifies a directory on each of the JBOD mount points; for example, /data/1/mapred/local through /data/N/mapred/local.
每个TaskTracker的临时文件存储位置,可不用RAID
Here is an example configuration配置文件例子:
hdfs-site.xml:
<property>
 <name>dfs.name.dir</name>
 <value>/data/1/dfs/name,/nfsmount/dfs/name</value>
</property>
<property>
 <name>dfs.data.dir</name>
 <value>/data/1/dfs/dn,/data/2/dfs/dn,/data/3/dfs/dn</value>
</property>
mapred-site.xml:
<property>
 <name>mapred.local.dir</name>
 <value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value>
</property>
After specifying these directories in the mapred-site.xml and hdfs-site.xml files, you must create the directories and assign the correct file permissions to them on each node in your cluster.
In the following instructions, local path examples are used to represent Hadoop parameters. Change the path examples to match your configuration.
Local directories:
    The dfs.name.dir parameter is represented by the /data/1/dfs/nn and /data/2/dfs/nn path examples.
    The dfs.data.dir parameter is represented by the /data/1/dfs/dn, /data/2/dfs/dn, /data/3/dfs/dn, and /data/4/dfs/dn path examples.
    The mapred.local.dir parameter is represented by the /data/1/mapred/local, /data/2/mapred/local, /data/3/mapred/local, and /data/4/mapred/local path examples.
在服务器上配置hadoop本地目录To configure local storage directories for use by HDFS and MapReduce:
    Create the dfs.name.dir local directories建名字节点:
2.  $ sudo mkdir -p /data/1/dfs/nn /data/2/dfs/nn
    Create the dfs.data.dir local directories:
4.  $ sudo mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
    Create the mapred.local.dir local directories:
6.  $ sudo mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
    Configure the owner of the dfs.name.dir and dfs.data.dir directories to be the hdfs user:
8.  $ sudo chown -R hdfs:hadoop /data/1/dfs/nn /data/2/dfs/nn /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
    Configure the owner of the mapred.local.dir directory to be the mapred user:
10.$ sudo chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
Here is a summary of the correct owner and permissions of the local directories:
Directory
   
Owner
   
Permissions (see Footnote 1)
dfs.name.dir
   
hdfs:hadoop
   
drwx------
dfs.data.dir
   
hdfs:hadoop
   
drwx------
mapred.local.dir
   
mapred:hadoop
   
drwxr-xr-x
Footnote:
1 In CDH3, the Hadoop daemons automatically set the correct permissions for you if you configure the directory ownership correctly as shown in the table above.
   
Note
If you specified non-existing folders for the dfs.data.dir property in the conf/hdfs-site.xml file, CDH3 will shut down. (In previous releases, CDH3 silently ignored non-existing folders for dfs.data.dir.)
Starting HDFS启动Hadoop文件系统
To start HDFS运行:
$ sudo service hadoop-0.20-namenode start
$ sudo service hadoop-0.20-secondarynamenode start
$ sudo service hadoop-0.20-datanode start
Creating and Configuring the mapred.system.dir Directory in HDFS
配置hadoop文件系统上的MapReduce目录
    After you start HDFS and before you start JobTracker, you must also create the HDFS directory specified by the mapred.system.dir parameter and configure it to be owned by the mapred user. The mapred.system.dir parameter is represented by the following /mapred/system path example.  hadoop.tmp.dir
在启动HDFS后需要建立mapreduce的系统文件夹目录配置在conf/mapred-site.xml 文件中中mapred.system.dir或conf/core-site.xml文件中中hadoop.tmp.dir该目录会建立在HDFS文件系统上需要授权给mapre:hadoop用户
$ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
$ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred
Here is a summary of the correct owner and permissions of the mapred.system.dir directory in HDFS:
Directory
   
Owner
   
Permissions
mapred.system.dir
   
mapred:hadoop
   
(see Footnote 1)
Footnote:
1 MapReduce sets the permissions for the mapred.system.dir directory when starting up, assuming the user mapred owns that directory. Jobtracker启动时会尝试用正确的权限建立该目录
    Add the path for the mapred.system.dir directory to the conf/mapred-site.xml file.
Starting MapReduce
To start MapReduce:
$ sudo service hadoop-0.20-tasktracker start
$ sudo service hadoop-0.20-jobtracker start
Configuring the Hadoop Daemons to Start at Boot Time
To start the Hadoop daemons at boot time and on restarts, enable their init scripts using the chkconfig tool:
$ sudo chkconfig hadoop-0.20-namenode on
$ sudo chkconfig hadoop-0.20-jobtracker on
$ sudo chkconfig hadoop-0.20-secondarynamenode on
$ sudo chkconfig hadoop-0.20-tasktracker on
$ sudo chkconfig hadoop-0.20-datanode on
On Ubuntu systems, you can install the sysv-rc-conf package to get the chkconfig command or use update-rc.d:
$ sudo update-rc.d hadoop-0.20-namenode defaults
$ sudo update-rc.d hadoop-0.20-jobtracker defaults
$ sudo update-rc.d hadoop-0.20-secondarynamenode defaults
$ sudo update-rc.d hadoop-0.20-tasktracker defaults
$ sudo update-rc.d hadoop-0.20-datanode defaults
Note that you must run the commands on the correct server, according to your role definitions.
其它
Web客户端需要更改本机host文件
加入
10.10.10.243    dwubt01
10.10.10.244    dwubt02
10.10.10.245    dwubt03
10.10.10.246    dwubt04
 
Update-rc.d控制tasktracker
 

你可能感兴趣的:(hadoop,ubuntu)