一.安装虚拟机及前期准备
1. VMWare 下安装CentOS 6.0系统,网络制式采用NAT
2. 将hadoop添加到sudoers
su root
输入root的口令,成功后就换成了root用户
继续输入命令:
chmod u+w/etc/sudoers
vi/etc/sudoers
在此行:root ALL=(ALL:ALL) ALL 后加一行:
hadoopALL=(ALL:ALL) ALL
意思就是允许hadoop用户sudo运行任何命令
保存
chmod u-w/etc/sudoers
这是把sudoers文件的权限改回440,即root用户通常也只读。Ubuntulinux的sudo命令运行时会检查这个文件权限是否440, 如果不是440, sudo命令都没有办法工作。所以改完之后一定要改回原来的440.
3. 用winscp将jdk-6u24-linux-i586.bin、hadoop-0.20.2.tar.gz、hive-0.11.0.tar.gz、pig-0.5.0.tar.gz、zookeeper-3.4.3分别放入/usr/java、/usr/hadoop、/usr、/usr、/usr文件中。
二.ssh设置
1. Master(NameNode| JobTracker)作为客户端,要实现无密码公钥认证,连接到服务器Salve(DataNode| Tasktracker)上时,需要在Master上生成一个密钥对,包括一个公钥和一个私钥,而后将公钥复制到所有的Slave上。当Master通过SSH连接Salve时,Salve就会生成一个随机数并用Master的公钥对随机数进行加密,并发送给Master。Master收到加密数之后再用私钥解密,并将解密数回传给Slave,Slave确认解密数无误之后就允许Master进行连接了。这就是一个公钥认证过程,其间不需要用户手工输入密码。重要过程是将客户端Master复制到Slave上。
2. Master机器上生成密码对(以hadoop1登入)
ssh-keygen–t rsa –P ''
这条命是生成其无密码密钥对,询问其保存路径时直接回车采用默认路径。生成的密钥对:id_rsa和id_rsa.pub,默认存储在"/home/hadoop/.ssh"目录下。
查看"/home/hadoop/"下是否有".ssh"文件夹,且".ssh"文件下是否有两个刚生产的无密码密钥对。
接着在Master节点上做如下配置,把id_rsa.pub追加到授权的key里面去。
cat~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys
3. 在验证前,需要做两件事儿。第一件事儿是修改文件"authorized_keys"权限(权限的设置非常重要,因为不安全的设置安全设置,会让你不能使用RSA功能),另一件事儿是用root用户设置"/etc/ssh/sshd_config"的内容。使其无密码登录有效。
1)修改文件"authorized_keys"
chmod 600~/.ssh/authorized_keys
备注:如果不进行设置,在验证时,扔提示你输入密码,在这里花费了将近半天时间来查找原因。
2)设置SSH配置
用root用户登录服务器修改SSH配置文件"/etc/ssh/sshd_config"的下列内容。
vim/etc/ssh/sshd_config
RSAAuthentication yes #启用 RSA 认证
PubkeyAuthentication yes# 启用公钥私钥配对认证方式
AuthorizedKeysFile.ssh/authorized_keys # 公钥文件路径(和上面生成的文件同)
设置完之后记得重启SSH服务,才能使刚才设置有效。
service sshdrestart
退出root登录,使用hadoop1普通用户验证是否成功。
sshlocalhost
三.安装JDK(JDK1.6)
(1) #cd /usr/java
#sudo chmod 777 jdk-6u24-linux-i586.bin
使当前用户拥有对jdk-6u24-linux-i586.bin的执行权限;
(2) #sudo ./ jdk-6u24-linux-i586.bin
运行jdk-6u24-linux-i586.bin,这时会显示出JDK的安装许可协议,按空格翻页,最后程序会问你是不是同意上面的协议,当然同意啦,输入“yes”之后开始解压JDK到当前目录。此时屏幕上会显示解压的进度。
解压完成后 /usr/java目录下会新建一个名为“jdk-1.6.0_24”的目录,至此我们已经在CentOS下安装好了JDK。
(3) 以用户hadoop1登录,进入用户主目录/home/hadoop1,命令行中执行命令“vi.bashrc”,加入以下内容,配置用户的人环境变量,对系统的环境变量不会造成影响。
# set java environment
export JAVA_HOME=/usr/java/jdk1.6.0_24
export JRE_HOME=/usr/java/jdk1.6.0_24/jre
export CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH
在vi编辑器增加以上内容后保存退出,并执行以下命令使配置生效
chmod +x/home/hadoop/ .bashrc ;增加执行权限
source/home/hadoop/ .bashrc;
配置完毕后,在命令行中输入java-version,如出现下列信息说明java环境安装成功。
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixedmode)
四.安装hadoop(以root登陆或者修改读写权限)
1. sudo chmod 777hadoop-0.20.2.tar.gz
sudo tar zxvfhadoop-0.20.2.tar.gz 进行hadoop压缩文件解压。
2.将hadoop的用户和用户组都改成创建者
sudo chown –Rhadoop:hadoop /usr/hadoop
这样就可以保存运行过程中产生的datanode和namenode等存储文件;
sudo chmod -Ra+w /usr/local
将hadoop的目录权限设为当前用户可写
3.配置hadoop-env.sh文件
命令为sudo vi hadoop-env.sh
添加 # set java environment
exportJAVA_HOME=/usr/java/jdk1.6.0_24
编辑后保存退出。
4.配置core-site.xml
[hadoop1@masterconf]# vi core-site.xml
5.配置hdfs-site.xml
[hadoop1@masterconf]# vi hdfs-site.xml
6.配置mapred-site.xml
[hadoop@vm10110041conf]$ sudo vi mapred-site.xml
7.配置masters文件和slaves文件
[hadoop@masterconf]# vi masters
192.168.131.131
[hadoop@masterconf]# sudo vi slaves
192.168.131.131
注:因为在伪分布模式下,作为master的namenode与作为slave的datanode是同一台服务器,所以配置文件中的ip是一样的。
8.编辑主机名
[hadoop@master ~]#sudo vi /etc/hosts
# Do not removethe following line, or various programs
that requirenetwork functionality will fail.
127.0.0.1localhost
192.168.131.131master
192.168.131.131slave
9.修改PATH
修改自己的环境变量:
Vi /home/hadoop/.bashrc
未修改前是:Export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH
在$PATH前面加入Hadoop的路径,不能在后面。
改完后是:
Export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:/usr/hadoop/hadoop-0.20.2/bin:$PATH
10.启动hadoop(第一次需要格式化)
启动过程
查看进程(一个都不能少)
五.Windows下,用Eclipse配置与使用hadoop
1. 解压hadoop-0.20.2.tar.gz到windows下本地磁盘,并将目录下/contrib/eclipse-plugin中hadoop-0.20.2-eclipse-plugin复制到eclipse的plugin目录下。
2. Eclipse下openperspective添加Map/Reduce
3. 切换到Map/Reduce下,右下角出现Map/ReduceLocations,点击添加
在Advancedparameters下修改hadoop.job.ugi中第一个用户修改为hadoop(如果没有该选项则重启eclipse继续查看),mapred.System.dir修改为/hadoop/mapred/system。
4. 现在在eclipse下可以看到DFSLocation,查看DFS中的文件夹,并进行新增、修改、删除操作。
1. Eclipse下windows->preference->HadoopMap/Reduce选择解压hadoop-0.20.2.tar.gz到windows下本地磁盘的路径。
2. 新建project,选择Map/ReduceProject命名为WordCount。
3. 在project下新建包wordCount,在包下新建Mapper类型WordCountMapper.class、Reducer类型WordCountReducer.class、Driver类型WordCount.class。
4. WordCountMapper.class
package wordCount;
importjava.io.IOException;
importjava.util.StringTokenizer;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.LongWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.io.Writable;
importorg.apache.hadoop.io.WritableComparable;
importorg.apache.hadoop.mapred.MapReduceBase;
importorg.apache.hadoop.mapred.Mapper;
importorg.apache.hadoop.mapred.OutputCollector;
importorg.apache.hadoop.mapred.Reporter;
public class WordCountMapperextends MapReduceBase
implementsMapper
private final IntWritable one =new IntWritable(1);
private Text word = newText();
public voidmap(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer itr = newStringTokenizer(line.toLowerCase());
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
// found myself having to add this for Eclipse to behappy...
// it matches the definition of the map() function better than whatthe hadoop example
// does... Oh well...
public void map(LongWritablekey, Text value,
OutputCollector
String line = value.toString();
StringTokenizer itr = newStringTokenizer(line.toLowerCase());
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
5. WordCountReducer.class
package wordCount;
importjava.io.IOException;
importjava.util.Iterator;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.io.WritableComparable;
importorg.apache.hadoop.mapred.MapReduceBase;
importorg.apache.hadoop.mapred.OutputCollector;
importorg.apache.hadoop.mapred.Reducer;
importorg.apache.hadoop.mapred.Reporter;
public class WordCountReducerextends MapReduceBase
implements Reducer
public voidreduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException{
intsum = 0;
while(values.hasNext()) {
IntWritable value = (IntWritable) values.next();
sum += value.get(); // process value
}
output.collect(key, new IntWritable(sum));
}
}
6. Wordcount.class
package wordCount;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapred.JobClient;
importorg.apache.hadoop.mapred.JobConf;
importorg.apache.hadoop.mapred.Mapper;
importorg.apache.hadoop.mapred.Reducer;
importorg.apache.hadoop.mapred.TextInputFormat;
importorg.apache.hadoop.mapred.TextOutputFormat;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapred.FileInputFormat;
importorg.apache.hadoop.mapred.FileOutputFormat;
importorg.apache.hadoop.mapred.JobClient;
importorg.apache.hadoop.mapred.JobConf;
public class WordCount{
publicstatic void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(WordCount.class);
//specify output types
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
//specify input and output dirs
//FileInputPath.addInputPath(conf, new Path("input"));
//FileOutputPath.addOutputPath(conf, newPath("output"));
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
//make sure In directory exists in the DFS area
//make sure Out directory does NOT exist in DFS area
FileInputFormat.addInputPath(conf, new Path("input"));
FileOutputFormat.setOutputPath(conf, newPath("output"));
//specify a mapper
conf.setMapperClass(WordCountMapper.class);
//specify a reducer
conf.setReducerClass(WordCountReducer.class);
conf.setCombinerClass(WordCountReducer.class);
client.setConf(conf);
try{
JobClient.runJob(conf);
}catch (Exception e) {
e.printStackTrace();
}
}
}
7. 在usr/hadoop目录下新建两个文件夹分别命名为input和output,往input文件夹上传若干.txt文件。(foo.txt和sss.txt)
8. 选择 RunAs * Run On Hadoop
13. 刷新查看output文件夹得到计数结果。
六.安装配置hive(derby)
1. 在/usr/目录下新建文件夹hive,用winscp将hive.0.11.0.tar.gz上传到该文件夹
2. cd /usr/hive
sudo chmod 777 hive-0.11.0.tar.gz
sudo tar zxvf hive-0.11.0.tar.gz
接着建立软连接
ln –s hive-0.11.0 hive
然后,
vi /home/hadoop1/.bashrc
•添加环境变量
•export HIVE_HOME=/usr/hive/hive-0.11.0
•export PATH=…:$HIVE_HOME/bin:$PATH
3. •进入hive/conf目录
•依据hive-env.sh.template,创建hive-env.sh文件
•cp hive-env.sh.templatehive-env.sh
•修改hive-env.sh
•指定hive配置文件的路径
•exportHIVE_CONF_DIR=/usr/hive/hive-0.11.0/conf
•指定Hadoop路径
•HADOOP_HOME=/usr/hadoop/hadoop-0.20.2
4. 将conf/hive-default.xml.template复制两份,分别命名为hive-default.xml(用于保留默认配置)和hive-site.xml(用于个性化配置,可覆盖默认配置)
5. sudo chown –R hadoop:hadoop /usr/hive
6. 在确定hadoop正常启动的情况下启动hive
7. 建立表格
1. 退出hive
Exit;
七.安装配置hive(mysql)
1. 切换到root用户下,
# yum -y installmysql-server
2. 启动MySQL服务
[root@localhost ~]# chkconfig mysqld on ← 设置MySQL服务随系统启动自启动
[root@ localhost ~]# chkconfig --listmysqld ← 确认MySQL自启动
mysqld 0:off 1:off 2:on 3:on 4:on 5:on6:off ← 如果2--5为on的状态就OK
[root@ localhost ~]#/etc/rc.d/init.d/mysqldstart ← 启动MySQL服务
设置root密码(root)
[root@localhost ~]# mysql -u root ← 用root用户登录MySQL服务器
mysql> set password forroot@localhost=password('在这里填入root密码');root ←
创建hive数据库:createdatabase hive;
创建用户hive,它只能从localhost连接到数据库并可以连接到wordpress数据库:grantall on *.* to hive@localhost identified by 'hive'hive
3. 在Hive的conf目录下修改配置文件hive-site.xml,配置文件修改如下
4. 把MySQL的JDBC驱动包(我使用的是mysql-connector-java-5.0.8-bin.jar,从http://downloads.mysql.com/archives/mysql-connector-java-5.0/mysql-connector-java-5.0.8.tar.gz下载并解压后可以找到)复制到Hive的lib目录下。
5. 启动Hive shell,执行
show tables;
如果不报错,表明基于独立元数据库的Hive已经安装成功了。
查看一下元数据的效果。
在Hive上建立数据表:
CREATE TABLE my(id INT,name string) ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t';
show tables;
select name from my;
然后我们以刚刚建立的hive帐号登录MySQL查看元数据信息。
mysql> use hive
Reading table information for completion of table and columnnames
You can turn off this feature to get a quicker startup with-A
Database changed
mysql> show tables;
+-----------------+
| Tables_in_hive |
+-----------------+
| BUCKETING_COLS |
|COLUMNS |
| DATABASE_PARAMS |
|DBS |
| PARTITION_KEYS |
|SDS |
|SD_PARAMS |
| SEQUENCE_TABLE |
|SERDES |
|SERDE_PARAMS |
|SORT_COLS |
|TABLE_PARAMS |
|TBLS |
+-----------------+
13 rows in set (0.00 sec)
mysql> select * from TBLS;
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME |OWNER | RETENTION | SD_ID | TBL_NAME |TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+
| 1 | 1319445990| 1| 0 | hadoop| 0| 1 |my | MANAGED_TABLE |NULL |NULL |
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+
1 row in set (0.00 sec)
在TBLS中可以看到Hive表的元数据。
6. 应用实例(jdbc)
l 关闭防火墙
#chkconfig --level35 iptables off
(注意中间的是两个英式小短线;重启)
在使用 JDBC 开发 Hive 程序时, 必须首先开启 Hive 的远程服务接口。使用下面命令进行开启:
hive -service hiveserver
1). 测试数据(/usr)
userinfo.txt文件内容(每行数据之间用tab键隔开):
1 xiapi
2 xiaoxue
3 qingqing
2). 在eclipse新建一个javaprojectHiveJdbcClient,packageHiveJdbcClient,classHiveJdbcClient.class。右击工程properties,library添加/usr/hive/hive-0.11.0/lib里的所有jar以及/usr/hadoop/hadoop-0.20.2里的hadoop-0.20.2-core.jar.
3).程序代码
packageHiveJdbcClient;
importjava.sql.Connection;
importjava.sql.DriverManager;
importjava.sql.ResultSet;
importjava.sql.SQLException;
importjava.sql.Statement;
importorg.apache.log4j.Logger;
public classHiveJdbcClient {
private static String driverName ="org.apache.hadoop.hive.jdbc.HiveDriver";
private static String url ="jdbc:hive://192.168.131.131:10000/default";
private static String user = "";
private static String password = "";
private static String sql = "";
private static ResultSet res;
private static final Logger log =Logger.getLogger(HiveJdbcClient.class);
public static void main(String[] args) {
try {
Class.forName(driverName);
Connection conn = DriverManager.getConnection(url, user,password);
Statement stmt = conn.createStatement();
// 创建的表名
String tableName = "testHiveDriverTable";
sql = "drop table " + tableName;
stmt.executeQuery(sql);
sql = "create table " + tableName + " (key int, valuestring) row format delimited fields terminated by'\t'";
stmt.executeQuery(sql);
// 执行“show tables”操作
sql = "show tables '" + tableName + "'";
System.out.println("Running:" + sql);
res = stmt.executeQuery(sql);
System.out.println("执行“showtables”运行结果:");
if (res.next()) {
System.out.println(res.getString(1));
}
// 执行“describe table”操作
sql = "describe " + tableName;
System.out.println("Running:" + sql);
res = stmt.executeQuery(sql);
System.out.println("执行“describetable”运行结果:");
while (res.next()) {
System.out.println(res.getString(1) + "\t" +res.getString(2));
}
// 执行“load data into table”操作
String filepath = "/usr/userinfo.txt";
sql = "load data local inpath '" + filepath + "' into table " +tableName;
System.out.println("Running:" + sql);
res = stmt.executeQuery(sql);
// 执行“select * query”操作
sql = "select * from " + tableName;
System.out.println("Running:" + sql);
res= stmt.executeQuery(sql);
System.out.println("执行“select* query”运行结果:");
while (res.next()) {
System.out.println(res.getInt(1) + "\t" +res.getString(2));
}
// 执行“regular hive query”操作
sql = "select count(1) from " + tableName;
System.out.println("Running:" + sql);
res = stmt.executeQuery(sql);
System.out.println("执行“regularhive query”运行结果:");
while (res.next()) {
System.out.println(res.getString(1));
}
conn.close();
conn = null;
} catch (ClassNotFoundException e) {
e.printStackTrace();
log.error(driverName + " not found!", e);
System.exit(1);
}catch (SQLException e) {
e.printStackTrace();
log.error("Connection error!", e);
System.exit(1);
}
}
}
得到结果:(eclipse)
Running:showtables 'testHiveDriverTable'
执行“show tables”运行结果:
testhivedrivertable
Running:describe testHiveDriverTable
执行“describe table”运行结果:
key int
value string
Running:loaddata local inpath '/usr/userinfo.txt' into tabletestHiveDriverTable
Running:select* from testHiveDriverTable
执行“select * query”运行结果:
1 xiapi
2 xiaoxue
3 qingqing
Running:selectcount(1) from testHiveDriverTable
执行“regular hive query”运行结果:
3
在centos终端显示
八.PIG安装与配置
1. cd /usr
sudo chmod 777 pig-0.5.0.tar.gz
sudo tar zxvf pig-0.5.0.tar.gz 进行压缩文件解压。
2.vim /home/hadoop/.bashrc
添加下列几行
export PIG_HOME=/usr/pig-0.5.0
export PIG_HADOOP_VERSION=20
export PIG_CLASSPATH=/usr/hadoop/hadoop-0.20.2/conf
export PATH=···:$PIG_HOME/bin:$PATH
1. source /home/hadoop/.bashrc
2. %pig,见如下配置成功
九.Zookeeper配置与安装
1. cd /usr
sudo chmod 777 zookeeper-3.4.3.tar.gz
sudo tar zxvf zookeeper-3.4.3.tar.gz 进行压缩文件解压。
chown -R hadoop:hadoop zookeeper-3.4.3
2. vim /home/hadoop/.bashrc
添加下列几行
export ZOOKEEPER_HOME=/usr/zookeeper-3.4.3
export CLASSPATH=···:$ZOOKEEPER_HOME/lib
export PATH=···:$ZOOKEEPER_HOME/bin:$PATH
source /home/hadoop/.bashrc
3. cd /usr/zoopkeeper-3.4.3/conf
将zoo_sample.cfd文件名称改为zoo.cfg
# The number ofmilliseconds of each tick
tickTime=2000
# The number ofticks that the initial
# synchronizationphase can take
initLimit=10
# The number ofticks that can pass between
# sending arequest and getting an acknowledgement
syncLimit=5
# the directorywhere the snapshot is stored.
# do not use /tmpfor storage, /tmp here is just
# examplesakes.
dataDir=/usr/zookeeper-3.4.3/data
# the port atwhich the clients will connect
clientPort=2181
#
# Be sure to readthe maintenance section of the
# administratorguide before turning on autopurge.
#
#http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number ofsnapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge taskinterval in hours
# Set to "0" todisable auto purge feature
#autopurge.purgeInterval=1
4. cd /usr/zoopkeeper-3.4.3
bin/zkServer.shstart启动zookeeper
十.常见问题
1.
2.
问题:
在eclipse上操作DFS时,正常连接,但无法进一步展开查看具体内容
分析与解决方法:
一般情况下,这种情况下是namenode或者datanode未启动或者启动之后又自动停止工作,在虚拟机下用JPS命令行下查看具体情况,如果namenode未启动,则先运行stop-all.sh,重新启动start-all.sh,JPS查看没问题,过一段时间再看如果还是自动停止则重新格式化namenode;如果是datanode未启动,则进入/usr/hadoop/hadoop-0.20.2/tmp删除掉data文件再重新启动hadoop即可解决!
3.安装过程可能不断出现一些问题,上述教程已经确保如果完全按照教程操作不会出错(除了问题1、2),如果出错,请检查是否漏掉某些步骤或者设置。