环境介绍:
虚拟机四个:
hadoop-marster
hadoop-salve1
hadoop-salve2
hadoop-salve3
===========================1.Hadoop==========================================================================
=================Linux下创建伪分布式==============================================
1.下载hadoop和jdk
http://mirror.esocc.com/apache/hadoop/common
本例使用:hadoop-1.0.4.tar.gz
2.安装
2.1安装jdk
第一种:tar包
1.下载jdk并解压:(我选的是tar包的文件)
www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
2.解压:(tar -zxvf jdk-7u15-linux-x64.tar.gz -C /usr/local)
3.配置jdk环境变量
#vi /etc/profile
export JAVA_HOME=/usr/local/jdk1.7.0_15
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/jre/lib/rt.jar
:wq
#source /etc/profile
4.执行#java -version
5.编写测试类
第二种:bin包
chmod +x jdk-6u27-linux-x64.bin
./jdk-6u27-linux-x64.bin
mv jdk1.6.0_27/ /usr/local/
配置jdk环境变量
#vi /etc/profile
export JAVA_HOME=/usr/local/jdk1.6.0_27
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/jre/lib/rt.jar
:wq
#source /etc/profile
执行#java -version
3.解压配置hadoop
tar zxvf hadoop-**.tar.gz
mv hadoop-** /usr/local/
cd /usr/local/hadoop-**/conf
3.1.修改hadoop-env.sh
vi hadoop-env.sh
打开JAVA_HOME,并指定当前安装的jdk位置:
export JAVA_HOME=/usr/local/jdk1.6.0_27
3.2.修改conf-site.xml
核心配置文件,设置hadoop的HDFS的地址及端口
3.3.修改hdfs-site.xml
设置文件存储目录和备份的个数
mkidr /data/hadoop/data
3.4.配置mapred-site.xml
MapReduce配置文件,配置JobTracker的地址及端口
3.配置ssh免密码登陆
cd /root
生成密钥对可以使用rsa和dsa两种方式,分别生成两个文件。推荐使用rsa
ssh-keygen -t rsa
然后持续回车,生成一对,包含公钥和私钥,然后追加或者覆盖
追加
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
覆盖
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
完成后进行测试
ssh hadoop-master
第一次会询问是否继续链接,输入yes
然后发现进入了另一个根目录中,跟刚刚的shell不在一个里面
4.格式化Hadoop的文件系统HDFS
/usr/local/hadoop-1.0.4/bin/hadoop namenode -format
5.启动hadoop
/usr/local/hadoop-1.0.4/bin/start-all.sh
如果有必要,可以分别启动hdfs和mapreduce
start-dfs.sh和start-mapred.sh
6.验证
浏览器打开
http://hadoop-master:50030 MapReduce的web页面
http://hadoop-master:50070 HDFS的web页面
如果在主机访问虚拟机,无法访问时,注意端口防火墙和host是否设置了与ip对应
============================================================================================
============================================================================================
=================Linux下创建完全分布式==============================================
============================================================================================
============================================================================================
1、2两步与伪分布式完全一样
3.所有的节点修改/etc/hosts,添加如下对应:(如果不喜欢使用host也可以使用DNS解析服务器)
192.168.152.162 hadoop-master
192.168.152.163 hadoop-slave1
192.168.152.164 hadoop-slave2
192.168.152.165 hadoop-slave3
4.创建hadoop用户
useradd hadoop
passwd hadoop
mkdir /data/hadoop
mkdir /data/hadoop/data
mkdir /home/hadoop/dhfs
mkdir /home/hadoop/dhfs/tmp
5.ssh免密码配置
首先要以hadoop用户登录,然后进入hadoop的主目录,再按照上面的步骤生成密钥对
su hadoop
cd
ssh-keygen -t rsa
然后自行选择是追加还是覆盖
追加
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
覆盖
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
完成后进行测试
ssh hadoop-master
把每个节点的authorized_keys合并成一个文件;并替换所有节点的原有authorized_keys
=================以上5步操作需要在每个节点全部执行一遍=================================
6.在master解压配置hadoop
tar zxvf hadoop-**.tar.gz
mv hadoop-** /usr/local/
cd /usr/local/hadoop-**/conf
6.1.修改hadoop-env.sh
vi hadoop-env.sh
打开JAVA_HOME,并指定当前安装的jdk位置:
export JAVA_HOME=/usr/local/jdk1.6.0_27
6.2.修改conf-site.xml
核心配置文件,设置hadoop的HDFS的地址及端口
6.3.修改hdfs-site.xml
设置文件存储目录和备份的个数
mkidr /data/hadoop/data
6.4.配置mapred-site.xml
MapReduce配置文件,配置JobTracker的地址及端口
6.5修改masters和salves
vi /usr/local/hadoop-**/conf/masters
hadoop-master
vi /usr/local/hadoop-**/conf/salves
hadoop-slave1
hadoop-slave2
hadoop-slave3
6.6向各个节点复制hadoop
执行前先将hadoop文件夹的权限付给hadoop
chown -R hadoop.hadoop /usr/local/hadoop-1.0.4
并将目标服务器的文件夹的写权限赋予hadoop用户;或者直接将hadoop文件夹移入到hadoop用户的目录中
scp -r hadoop-1.0.4/ hadoop-slave1:/home/hadoop
scp -r hadoop-1.0.4/ hadoop-slave2:/home/hadoop
scp -r hadoop-1.0.4/ hadoop-slave3:/home/hadoop
======以下操作跟伪分布式一样==================
7.格式化Hadoop的文件系统HDFS(只在主节点启动即可)
/home/hadoop/hadoop-1.0.4/bin/hadoop namenode -format
8.启动hadoop
/home/hadoop/hadoop-1.0.4/bin/start-all.sh
如果有必要,可以分别启动hdfs和mapreduce
start-dfs.sh和start-mapred.sh
9.验证
浏览器打开
http://hadoop-master:50030 MapReduce的web页面
http://hadoop-master:50070 HDFS的web页面
如果在主机访问虚拟机,无法访问时,注意端口防火墙和host是否设置了与ip对应
10检查守护进程情况
/usr/local/jdk1.6.0_27/bin/jps
==========hello world测试===========================================
cd
mkdir input
cd input/
echo "hello world" >test1.txt
echo "hello hadoop" >test2.txt
拷贝input到hadoop的hdfs中
/home/hadoop/hadoop-1.0.4/bin/hadoop dfs -put /home/hadoop/input/test1.txt .
/home/hadoop/hadoop-1.0.4/bin/hadoop dfs -put /home/hadoop/input/test2.txt .
查看是否拷贝成功
/home/hadoop/hadoop-1.0.4/bin/hadoop dfs -ls .
运行计数器
/home/hadoop/hadoop-1.0.4/bin/hadoop jar /home/hadoop/hadoop-1.0.4/
hadoop-examples-1.0.4.jar wordcount . out
查看目录、文件结构、分词结果
/home/hadoop/hadoop-1.0.4/bin/hadoop dfs -ls .
/home/hadoop/hadoop-1.0.4/bin/hadoop dfs -ls ./out
/home/hadoop/hadoop-1.0.4/bin/hadoop dfs -cat ./out/*
查看数据写在操作系统的位置:在datanode节点使用:
ls -lR /data/hadoop/data/
/trash 回收站
===========================================================================
1.org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hadoop/in could only be replicated to 0 nodes, instead of 1
/home/hadoop/hadoop-1.0.4/bin/hadoop dfsadmin -report
查看是否为节点分配了容量
原因:Configured Capacity也就是datanode 没用分配容量
修改文件Hadoop conf/core-site.xml 中hadoop.tmp.dir的值
2.ERROR namenode.NameNode: java.io.IOException: Cannot create directory
chown -R hadoop.hadoop /home/hadoop/dhfs/
chown -R hadoop.hadoop /data/hadoop/
3.org.apache.hadoop.ipc.RPC server
这个问题基本上localhost和hostname同为127.0.0.1所致。
将hostname的ip修改为当前机器的ip地址就好了
遇到问题,多看日志文件
4.启动Permission denied (publickey,gssapi-with-mic,password)
检查当前用户是不是ssh中指定的用户
===========================================================================
====================2.Hbase=================================================================================
本测试的版本是:0.92.2
======2Hbase的基本操作=================================
export HADOOP_HOME=/home/hadoop/hadoop-1.0.4
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
export HBASE_HOME=/home/hadoop/hbase-0.92.2/
export HBASE_CONF_DIR=$HBASE_HOME/conf
export PATH=$HBASE_HOME/bin:$HBASE_HOME/conf:$PATH
1.HBase安装
下载hbase包,并解压;配置conf/hbase-site.xml
单机:
伪分布:
完全分布:
1)配置site.xml
2)配置conf/regionservers的配置
hadoop-master
hadoop-slave1
hadoop-slave2
hadoop-slave3
3)ZooKeeper配置
hbase-env.sh中
HBASE_MANAGES_ZK默认为true;表示ZooKeeper会随着HBase启动而运行;
设置为false:需要自己手动开启
4)向各个节点复制,然后配置各个节点的环境变量
scp -r hbase-0.92.2/ hadoop-slave1:/home/hadoop
scp -r hbase-0.92.2/ hadoop-slave2:/home/hadoop
scp -r hbase-0.92.2/ hadoop-slave3:/home/hadoop
5)修改hbase-env.sh文件中的java_home的环境变量
2.运行HBase
启动顺序:HDFS->ZooKeeper->HBase
单机:start-hbase.sh
伪分布:start-hbase.sh
完全分布:start-hbase.sh
========================3.Zookeeper=============================================================================
=======2.ZooKeeper的安装和配置============================================
2.1安装ZooKeeper
2.1.1单机下安装ZooKeeper
1)下载
2)安装
export ZOOKEEPER_HOME=/$HADOOP_HOME/zookeeper-3.4.5
export PATH=$PATH:$ZOOKEEPER_HOME/bin:$ZOOKEEPER_HOME/conf
3)在$ZOOKEEPER_HOME/conf下创建一个zoo.cfg文件,并添加如下内容:
#The number of milliseconds of each tick
tickTime = 2000
#the directory where the snapshot is stored
dataDir = $ZOOKEEPER_HOME/data
#the port at which the clients will connect
clientPort = 2181
2.1.2在集群下安装ZooKeeper
在奇数个服务器上安装zookeeper并安装单机模式进行配置,需要修改的是zoo.cfg文件,如下:
#The number of milliseconds of each tick
tickTime = 2000
#The number of ticks that the initial
#synchronization phase can take
initLimit = 10
#The number of ticks that can pass between
#sending a request and getting an acknowledgement
syncLimit = 5
#the directory where the snapshot is stored
dataDir = /home/sid/Downloads/hadoop-1.0.4/zookeeper-3.4.5/data
#the port at which the clients will connect
clientPort = 2181
#the location of the log file
dataLogDir = /home/sid/Downloads/hadoop-1.0.4/zookeeper-3.4.5/log
server.1 = hadoop-master:2887:3887
server.2 = hadoop-slave1:2888:3888
server.3 = hadoop-slave2:2889:3889
server.4 = hadoop-slave3:2889:3889
然后执行复制命令:
scp -r zookeeper-3.4.5/ hadoop-slave1:/home/hadoop/hadoop-1.0.4
scp -r zookeeper-3.4.5/ hadoop-slave2:/home/hadoop/hadoop-1.0.4
scp -r zookeeper-3.4.5/ hadoop-slave3:/home/hadoop/hadoop-1.0.4
在dataDir下面创建一个文件名为myid的文件,在这个文件中加入自身的serverid;如果是主机就加入1,这个serverid在集群中必须是唯一值
其中的端口号,第一个是从(follower)机器连接到主机(leader)的端口,第二个是用来进行leader选举的端口。
2.1.3在集群伪分布模式下安装ZooKeeper
安装集群模式,在$ZOOKEEPER_HOME/conf下创建三个:zoo1.cfg、zoo2.cfg、zoo3.cfg;并修改
dataDir = /home/sid/Downloads/hadoop-1.0.4/zookeeper-3.4.5/data_num
dataLogDir = /home/sid/Downloads/hadoop-1.0.4/zookeeper-3.4.5/log_num
clientPort = 218num
server.1 = localhost:2887:3889
server.2 = localhost:2888:3888
server.3 = localhost:2889:3889
并在对应的data_num中加入myid,并写入对应的num
2.2配置ZooKeeper
2.2.1最低配置:
tickTime、dataDir、clientPort
2.2.2高级配置:
dataLogDir:事务日志写入位置
maxClientCnxns:限制连接到ZooKeeper的客户端数量,并限制并发连接数量
minSessionTimeout
maxSessionTimeout
2.2.3集群配置:
initLimit:允许follower连接并同步到leader的初始化连接时间,它是以ticktime的倍数来表示
syncLimit:表示leader和follower直接发送消息时请求和应答的时间长度。
3.运行ZooKeeper
3.1.单机:zkServer.sh start
3.2.集群模式:在每台ZooKeeper运行:zkServer.sh start
3.3.集群伪分布:
zkServer.sh start zoo1.cnf
zkServer.sh start zoo2.cnf
zkServer.sh start zoo3.cnf.
hadoop、hbase、zookeeper整合
1.安装hadoop并启动
2.配置zookeeper并启动
3.配置hbase(按照完全分布式配置)
配置hbase-site.xml
然后启动start-hbase.sh
通过浏览器:http://localhost:60010/
查看列表中是否存在Zookeeper Quorum;若存在则整合成功
伪分布修改
================================
======================4.Chukwa===============================================================================
========4.Chukwa的集群搭建=========================================
1.安装:
export CHUKWA_HOME=$HADOOP_HOME/chukwa-incubating-0.5.0
export CHUKWA_CONF_DIR=$CHUKWA_HOME/etc/chukwa
export PATH=$CHUKWA_HOME/bin:$CHUKWA_HOME/sbin:$CHUKWA_CONF_DIR:$PATH
2.Hadoop和HBase集群配置
hadoop和hbase的安装看前面的笔记。然后执行下面的操作
首先将Chukwa的文件复制到hadoop中:
mv $HADOOP_HOME/conf/log4j.properties $HADOOP_HOME/conf/log4j.properties.bak
mv $HADOOP_HOME/conf/hadoop-metrics2.properties $HADOOP_HOME/conf/hadoop-metrics2.properties.bak
cp $CHUKWA_CONF_DIR/hadoop-log4j.properties $HADOOP_HOME/conf/log4j.properties
cp $CHUKWA_CONF_DIR/hadoop-metrics2.properties $HADOOP_HOME/conf/hadoop-metrics2.properties
cp $CHUKWA_HOME/share/chukwa/chukwa-0.5.0-client.jar $HADOOP_HOME/lib
cp $CHUKWA_HOME/share/chukwa/lib/json-simple-1.1.jar $HADOOP_HOME/lib
配置完成后,启动Hadoop集群,接着进行Hbase设置,需要在HBase中创建数据存储所需要的表,表的模式已经建好只需要通过hbase shell导入即可,如下:
bin/hbase shell < $CHUKWA_CONF_DIR/hbase.schema
3.Collector配置
我们首先要对$CHUKWA_CONF_DIR/chukwa-env.sh进行配置。该文件为Chukwa的环境变量,大部分的脚本需要从该文件中读取关键的全局Chukwa配置信息。
设置JAVA_HOME;注释下面两个
export JAVA_HOME=/usr/local/jdk1.6.0_27
export HADOOP_CONF_DIR=/home/hadoop/hadoop-1.0.4/conf
export HBASE_CONF_DIR=/home/hadoop/hbase-0.92.2/conf
当需要运行多台机器作为收集器时,需要修改$CHUKWA_CONF_DIR/collectors文件,格式与hadoop的slaves一样
hadoop-master
hadoop-slave1
hadoop-slave2
hadoop-slave3
$CHUKWA_CONF_DIR/initial_Adaptors文件主要用于设置Chukwa监控哪些日志,以及什么方式、什么频率来监控等。使用默认配置即可,如下
add sigar.SystemMetrics SystemMetrics 60 0
add SocketAdaptor HadoopMetrics 9095 0
add SocketAdaptor Hadoop 9096 0
add SocketAdaptor ChukwaMetrics 9097 0
add SocketAdaptor JobSummary 9098 0
$CHUKWA_CONF_DIR/chukwa-collector-conf.xml维护了Chukwa的基本配置信息。我们需要通过该文件制定HDFS的位置:如下:
下面的属性设置用于制定sink data地址,/chukwa/logs/就是它在HDFS中的地址。在默认情况下,Collector监听8080端口,不过这是可以修改的,各个Agent将会向该端口发消息。
4.Agent配置
Agent由$CHUKWA_CONF_DIR/agents文件进行配置,与collectors相似:
hadoop-master
hadoop-slave1
hadoop-slave2
hadoop-slave3
另外,$CHUKWA_CONF_DIR/chukwa-agent-conf.xml文件维护了代理的基本配置信息,其中最重要的属性是集群名,用于表示被监控的节点,这个值被存储在每一个被收集到的块中,一区分不同的集群,如设置cluster名称:cluster="chukwa"
另一个可选的节点是chukwaAgent.checkpoint.dir,这个目录是Chukwa运行的Adapter的定期检查点,他是不可共享的目录,并且只能是本地目录,不能是网络文件系统目录。
5.使用Pig进行数据分析
可以使用pig进行数据分析,因此需要额外设置环境变量。要让pig能够读取chukwa收集到的数据,即与HBase和Hadoop进行链接,首先要确保pig已经正确安装,然后在pig的classpath中引入Hadoop和Hbase:
export PIG_CLASSPATH=$HADOOP_CONF_DIR:$HBASE_CONF_DIR
接下来创建HBASE_CONF_DIR的jar文件:
jar cf $CHUKWA_HOME/hbase-env.jar $HBASE_CONF_DIR
创建周期性运行的分析脚本作业:
pig -Dpig.additional.jars=${HBASE_HOME}/hbase-0.90.4.jar:${ZOOKEEPER_HOME}/zookeeper-3.3.2.jar:${PIG_HOME}/pig-0.10.0.jar:${CHUKWA_HOME}/hbase-env.jar${CHUKWA_HOME}/share/chukwa/script/pig/ClusterSummary.pig
7向各个节点复制,然后配置各个节点的环境变量
scp -r chukwa-incubating-0.5.0 hadoop-slave1:/home/hadoop/hadoop-1.0.4
scp -r chukwa-incubating-0.5.0 hadoop-slave2:/home/hadoop/hadoop-1.0.4
scp -r chukwa-incubating-0.5.0 hadoop-slave3:/home/hadoop/hadoop-1.0.4
运行Chukwa
在启动chukwa之前,先启动Hadoop和Hbase,然后分别启动collector和agent
1.collector:
启动:./bin/chukwa collector
停止:./sbin/stop-collectors.sh
2.agent
启动:./bin/chukwa agent
sbin/start-agents.sh
3.启动HICC
启动:./bin/chukwa hicc
启动后可以通过浏览器进行访问:http://
port默认是4080;
默认用户名和密码是:admin
可以根据需要对$CHUKWA_HOME/webapps/hicc.war文件中的/WEB_INF/下的jetty.xml进行修改
4.启动Chukwa过程:
1)启动Hadoop和HBase
2)启动Chukwa:sbin/start-chukwa.sh
3)启动HICC:bin/chukwa hicc
=================================================
cat: /root/chukwa/chukwa-incubating-0.5.0/bin/share/chukwa/VERSION: 没有那个文件或目录
/root/chukwa/chukwa-incubating-0.5.0/bin/chukwa: line 170: /root/java/jdk-1.6.0_20/bin/java: 没有那个文件或目录
/root/chukwa/chukwa-incubating-0.5.0/bin/chukwa: line 170: exec: /root/java/jdk-1.6.0_20/bin/java: cannot execute: 没有那个文件或目录
方法1:
将/root/chukwa/chukwa-incubating-0.5.0/下的share文件夹复制到./bin下面,
问题解决
方法2:
用gedit打开$CHUKWA_HOME/libexec/chukwa-config.sh
修改第30 31行
# the root of the Chukwa installation
export CHUKWA_HOME=`pwd -P ${CHUKWA_LIBEXEC}/..`
为:
# the root of the Chukwa installation
export CHUKWA_HOME=/root/chukwa/chukwa-incubating-0.5.0
其中/root/chukwa/chukwa-incubating-0.5.0为chukwa实际安装路径