Hadoop完全分布式安装教程
目录
一、软件版本....................................................... 2
二、安装教程....................................................... 2
1、VMWare安装教程.............................................. 2
2、Ubuntu安装教程.............................................. 2
3、安装VMWare-Tools............................................ 5
4、用户创建..................................................... 8
5、主机配置..................................................... 8
6、SSH无密码验证配置........................................... 9
7、Java环境配置................................................ 9
8、hadoop集群安装............................................. 10
三、运行wordcount程序............................................ 20
Hadoop版本号:hadoop-2.6.0.tar;
VMWare版本号:VMware-workstation-full-11.0.0-2305329
Ubuntu版本号:ubuntu-14.04.1-desktop-i386其他版本也可
Jdk版本号:jdk-6u45-linux-i586.bin
后三项对版本要求不严格,如果使用Hbase1.0.0版本,需要JDK1.8以上版本。
VMWare虚拟机是个软件,安装后可用来创建虚拟机,在虚拟机上再安装系统,在这个虚拟系统上再安装应用软件,所有应用就像操作一台真正的电脑,
请直接到VMWare官方网站下载相关软件
http://www.vmware.com/cn/products/workstation/workstation-evaluation
以上链接如果因为官方网站变动发生变化,可以直接在搜索引擎中搜索VMWare来查找其下载地址,建议不要在非官方网站下载。
安装试用版后有30天的试用期。
打开VMWare点击创建新的虚拟机
选择典型
点击浏览
选择ubuntu
暂时只建两个虚拟机,注意分别给两个虚拟机起名为Ubuntu1和Ubuntu2;也可以按照自己的习惯取名,但是后续的许多配置文件要相应更改,会带来一些麻烦。
密码也请记牢,后面会经常使用。
Ubuntu中会显示有光盘插入了光驱
双击打开光盘将光盘中VMwareTools-9.6.1-1378637.tar.gz复制到桌面,复制方法类似windows系统操作。
点击Extract Here
从菜单打开Ubuntu的控制终端
cdDesktop/vmware-tools-distrib/
sudo./vmware-install.pl
输入root密码,一路回车,重启系统
编辑
在sudo于1980年前后被写出之前,一般用户管理系统的方式是利用su切换为超级用户。但是使用su的缺点之一在于必须要先告知超级用户的密码。
sudo使一般用户不需要知道超级用户的密码即可获得权限。首先超级用户将普通用户的名字、可以执行的特定命令、按照哪种用户或用户组的身份执行等信息,登记在特殊的文件中(通常是/etc/sudoers),即完成对该用户的授权(此时该用户称为“sudoer”);在一般用户需要取得特殊权限时,其可在命令前加上“sudo”,此时sudo将会询问该用户自己的密码(以确认终端机前的是该用户本人),回答后系统即会将该命令的进程以超级用户的权限运行。之后的一段时间内(默认为5分钟,可在/etc/sudoers自定义),使用sudo不需要再次输入密码。
由于不需要超级用户的密码,部分Unix系统甚至利用sudo使一般用户取代超级用户作为管理帐号,例如Ubuntu、Mac OS X等。
注意: ubuntu安装后,root 用户默认是被锁定了的,不允许登录,也不允许“ su” 到root 。
允许 su 到root
非常简单,下面是设置的方法:
注意:ubuntu安装后要更新软件源:
cd /etc/apt
sudo apt-get update
安装各种软件比较方便
创建hadoop用户组:sudo addgroup hadoop
创建hduser用户:sudoadduser -ingroup hadoop hduser
注意这里为hduser用户设置同主用户相同的密码
为hadoop用户添加权限:sudogedit /etc/sudoers,在root ALL=(ALL) ALL下添加
hduser ALL=(ALL) ALL。
执行命令报错切换到目录编辑
设置好后重启机器:sudo reboot
切换到hduser用户登录;
Hadoop集群中包括2个节点:1个Master,2个Salve,其中虚拟机Ubuntu1既做Master,也做Slave;虚拟机Ubuntu2只做Slave。
配置hostname:Ubuntu下修改机器名称:sudo gedit /etc/hostname ,改为Ubuntu1;修改成功后用重启命令:hostname,查看当前主机名是否设置成功;
此时可以用虚拟机克隆的方式再复制一个。(先关机 vmware 菜单--虚拟机-管理--克隆)
注意:修改克隆的主机名为Ubuntu2。
配置hosts文件:查看Ubuntu1和Ubuntu2的ip:ifconfig;
打开hosts文件:sudogedit /etc/hosts
,添加如下内容:
192.168.xxx.xxx Ubuntu1
192.168.xxx.xxx Ubuntu2
注意这里的ip地址需要学员根据自己的电脑的ip设置。
在Ubuntu1上执行命令:pingUbuntu2,若能ping通,则说明执行正确。
$$$$$$$$$$$$配置ssh连接linux 速度快
sudo vi /etc/ssh/sshd_config
2.在sshd_config配置文件末尾中添加:
[java] view plain copy
1. Ciphers aes128-cbc,aes192-cbc,aes256-cbc,aes128-ctr,aes192-ctr,aes256-ctr,3des-cbc,arcfour128,arcfour256,arcfour,blowfish-cbc,cast128-cbc
2. MACs hmac-md5,hmac-sha1,umac-64@openssh.com,hmac-ripemd160,hmac-sha1-96,hmac-md5-96
3. KexAlgorithms diffie-hellman-group1-sha1,diffie-hellman-group14-sha1,diffie-hellman-group-exchange-sha1,diffie-hellman-group-exchange-sha256,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group1-sha1,curve25519-sha256@libssh.org
或
Ciphersaes128-cbc,aes192-cbc,aes256-cbc,aes128-ctr,aes192-ctr,aes256-ctr,3des-cbc,arcfour128,arcfour256,arcfour,blowfish-cbc,cast128-cbc
MACshmac-md5,hmac-sha1,[email protected],hmac-ripemd160,hmac-sha1-96,hmac-md5-96
KexAlgorithms diffie-hellman-group1-sha1,diffie-hellman-group14-sha1,diffie-hellman-group-exchange-sha1,diffie-hellman-group-exchange-sha256,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group1-sha1,[email protected]
3.重启sshd服务后,即可正常连接:
[java] view plain copy
1. sudo /etc/init.d/ssh restart
或
sudo /etc/init.d/ssh restart
$$$$$$$$$$$$配置ssh连接linux 速度快
安装ssh服务器,默认安装了ssh客户端:sudoapt-get install openssh-server;
在Ubuntu1上生成公钥和秘钥:ssh-keygen-t rsa -P "" ;
查看路径/home/hduser/.ssh文件里是否有id_rsa和id_rsa.pub;
将公钥赋给authorized_keys:cat$HOME/hduser/.ssh/id_rsa.pub >> $HOME/hduser/.ssh/authorized_keys;
cat id_rsa.pub >> authorized_keys
无密码登录:sshlocalhost;
无密码登陆到Ubuntu2,在Ubuntu1上执行:ssh-copy-idUbuntu2,查看Ubuntu2的/home/hduser/.ssh文件里是否有authorized_keys;
在Ubuntu1上执行命令:sshUbuntu2,首次登陆需要输入密码,再次登陆则无需密码;
若要使Ubuntu2无密码登录Ubuntu1,则在Ubutu2上执行上述相同操作即可。
注:若无密码登录设置不成功,则很有可能是文件夹/文件权限问题,修改文件夹/文件权限即可。sudochmod 777 “文件夹” 即可。
root@Ubuntu1:/home/hduser#
root@Ubuntu1:/home/hduser# ssh Ubuntu2--不能用root用户连接ubuntu2无权限
root@ubuntu2's password:
Permission denied, please try again.
获取opt文件夹权限:sudo chmod 777 /opt
将java压缩包放在/opt/,root模式执行sudo./jdk-6u45-linux-i586.bin
配置jdk的环境变量:sudo gedit /etc/profile,将一下内容复制进去并保存
# java
exportJAVA_HOME=/opt/jdk1.6.0_45
exportJRE_HOME=$JAVA_HOME/jre
exportCLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
exportPATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
执行命令,使配置生效:source/etc/profile;
执行命令:java-version,若出现java版本号,则说明安装成功。
sudo apt-get install openjdk-7-jdk-headless
http://openjdk.java.net/install/
8.1 安装
将hadoop压缩包hadoop-2.6.0.tar.gz放在/home/hduser目录下,并解压缩到本地,重命名为hadoop;配置hadoop环境变量,执行:sudogedit /etc/profile,将以下复制到profile内:
#hadoop
exportHADOOP_HOME=/home/hduser/hadoop
exportPATH=$HADOOP_HOME/bin:$PATH
执行:source /etc/profile
注意:Ubuntu1、ubuntu2都要配置以上步骤;
8.2 配置
主要涉及的配置文件有7个:都在/hadoop/etc/hadoop文件夹下,可以用gedit命令对其进行编辑。
(1)进去hadoop配置文件目录
cd /home/hduser/hadoop/etc/hadoop/
(2)配置 hadoop-env.sh文件-->修改JAVA_HOME
gedit hadoop-env.sh
添加如下内容
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/
#JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
(3)配置 yarn-env.sh 文件-->>修改JAVA_HOME
添加如下内容
# some Java parameters
exportJAVA_HOME=/opt/jdk1.6.0_45
(4)配置slaves文件-->>增加slave节点
(删除原来的localhost)
添加如下内容
Ubuntu1
Ubuntu2
(5)配置 core-site.xml文件-->>增加hadoop核心配置
(hdfs文件端口是9000、file:/home/hduser/hadoop/tmp)
添加如下内容
(6)配置 hdfs-site.xml 文件-->>增加hdfs配置信息
(namenode、datanode端口和目录位置)
(7)配置 mapred-site.xml 文件-->>增加mapreduce配置
(使用yarn框架、jobhistory使用地址以及web地址)
(8)配置 yarn-site.xml 文件-->>增加yarn功能
(9)将配置好的Ubuntu1中/hadoop/etc/hadoop文件夹复制到到Ubuntu2对应位置(删除Ubuntu2原来的文件夹/hadoop/etc/hadoop)
scp-r /home/hduser/hadoop/etc/hadoop/ hduser@Ubuntu2:/home/hduser/hadoop/etc/
8.3 验证
下面验证Hadoop配置是否正确:
(1)格式化namenode:
hduser@Ubuntu1:~$ cd hadoop
hduser@Ubuntu1:~/hadoop$ ./bin/hdfs namenode -format
hduser@Ubuntu2:~$ cd hadoop
hduser@Ubuntu2:~/hadoop$ ./bin/hdfs namenode -format
(2)启动hdfs:
hduser@Ubuntu1:~/hadoop$ ./sbin/start-dfs.sh
15/04/2704:18:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library foryour platform... using builtin-java classes where applicable
Startingnamenodes on [Ubuntu1]
Ubuntu1:starting namenode, logging to/home/hduser/hadoop/logs/hadoop-hduser-namenode-Ubuntu1.out
Ubuntu1:starting datanode, logging to /home/hduser/hadoop/logs/hadoop-hduser-datanode-Ubuntu1.out
Ubuntu2:starting datanode, logging to/home/hduser/hadoop/logs/hadoop-hduser-datanode-Ubuntu2.out
Startingsecondary namenodes [Ubuntu1]
Ubuntu1:starting secondarynamenode, logging to /home/hduser/hadoop/logs/hadoop-hduser-secondarynamenode-Ubuntu1.out
15/04/2704:19:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library foryour platform... using builtin-java classes where applicable
查看java进程(Java Virtual Machine Process Status Tool)
hduser@Ubuntu1:~/hadoop$ jps
8008 NameNode
8443 Jps
8158 DataNode
8314SecondaryNameNode
使用 jps发现NameNode进程没有正确运行,
停止服务,
重新格式化namenode,hadoop namenode -format
start-all.sh
NameNode进程已运行
(3)停止hdfs:
hduser@Ubuntu1:~/hadoop$ ./sbin/stop-dfs.sh
Stoppingnamenodes on [Ubuntu1]
Ubuntu1:stopping namenode
Ubuntu1:stopping datanode
Ubuntu2:stopping datanode
Stoppingsecondary namenodes [Ubuntu1]
Ubuntu1:stopping secondarynamenode
查看java进程
hduser@Ubuntu1:~/hadoop$ jps
8850 Jps
(4)启动yarn:
hduser@Ubuntu1:~/hadoop$ ./sbin/start-yarn.sh
starting yarndaemons
startingresourcemanager, logging to/home/hduser/hadoop/logs/yarn-hduser-resourcemanager-Ubuntu1.out
Ubuntu2:starting nodemanager, logging to/home/hduser/hadoop/logs/yarn-hduser-nodemanager-Ubuntu2.out
Ubuntu1:starting nodemanager, logging to/home/hduser/hadoop/logs/yarn-hduser-nodemanager-Ubuntu1.out
查看java进程
hduser@Ubuntu1:~/hadoop$jps
8911ResourceManager
9247 Jps
9034NodeManager
(5)停止yarn:
hduser@Ubuntu1:~/hadoop$ ./sbin/stop-yarn.sh
stopping yarndaemons
stoppingresourcemanager
Ubuntu1:stopping nodemanager
Ubuntu2:stopping nodemanager
no proxyserverto stop
查看java进程
hduser@Ubuntu1:~/hadoop$jps
9542 Jps
(6)查看集群状态:
首先启动集群:./sbin/start-dfs.sh
hduser@Ubuntu1:~/hadoop$./bin/hdfs dfsadmin -report
ConfiguredCapacity: 39891361792 (37.15 GB)
Present Capacity:28707627008 (26.74 GB)
DFS Remaining: 28707569664(26.74 GB)
DFS Used: 57344(56 KB)
DFS Used%: 0.00%
Under replicatedblocks: 0
Blocks withcorrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Live datanodes(2):
Name:192.168.159.132:50010 (Ubuntu2)
Hostname: Ubuntu2
DecommissionStatus : Normal
ConfiguredCapacity: 19945680896 (18.58 GB)
DFS Used: 28672(28 KB)
Non DFS Used:5575745536 (5.19 GB)
DFS Remaining:14369906688 (13.38 GB)
DFS Used%: 0.00%
DFS Remaining%:72.05%
Configured CacheCapacity: 0 (0 B)
Cache Used: 0 (0B)
Cache Remaining: 0(0 B)
Cache Used%:100.00%
Cache Remaining%:0.00%
Xceivers: 1
Last contact: MonApr 27 04:26:09 PDT 2015
Name:192.168.159.131:50010 (Ubuntu1)
Hostname: Ubuntu1
DecommissionStatus : Normal
ConfiguredCapacity: 19945680896 (18.58 GB)
DFS Used: 28672(28 KB)
Non DFS Used:5607989248 (5.22 GB)
DFS Remaining:14337662976 (13.35 GB)
DFS Used%: 0.00%
DFS Remaining%:71.88%
Configured CacheCapacity: 0 (0 B)
Cache Used: 0 (0B)
Cache Remaining: 0(0 B)
Cache Used%:100.00%
Cache Remaining%:0.00%
Xceivers: 1
Last contact: MonApr 27 04:26:08 PDT 2015
(7)查看hdfs:http://Ubuntu1:50070/
(1)创建 file目录
hduser@Ubuntu1:~$ mkdir file
(2)在file创建file1.txt、file2.txt并写内容(在图形界面)
分别填写如下内容
file1.txt输入内容:Hello world hiHADOOP
file2.txt输入内容:Hello hadoop hiCHINA
创建后查看:
hduser@Ubuntu1:~ /hadoop $ cat file/file1.txt
Hello world hiHADOOP
hduser@Ubuntu1:~ /hadoop $ cat file/file2.txt
Hello hadoop hiCHINA
(3)在hdfs创建/input2目录
hduser@Ubuntu1:~/hadoop$ ./bin/hadoop fs -mkdir/input2
bin/hdfs dfs-mkdir -p /input
(4)将file1.txt、file2.txt文件copy到hdfs /input2目录
hduser@Ubuntu1:~/hadoop$ ./bin/hadoop fs -putfile/file*.txt /input2
(5)查看hdfs上是否有file1.txt、file2.txt文件
hduser@Ubuntu1:~/hadoop$ bin/hadoop fs -ls /input2/
bin/hdfs dfs-put ../file/file*.txt /input2
Found 2 items
-rw-r--r-- 2 hduser supergroup 21 2015-04-27 05:54 /input2/file1.txt
-rw-r--r-- 2 hduser supergroup 24 2015-04-27 05:54 /input2/file2.txt
(6)执行wordcount程序
先启动hdfs和yarn(注意jar包名)
hduser@Ubuntu1:~/hadoop$ ./bin/hadoop jarshare/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.5.jar wordcount /input2//output2/wordcount1
15/04/27 05:57:17 WARN util.NativeCodeLoader: Unable to load native-hadooplibrary for your platform... using builtin-java classes where applicable
15/04/27 05:57:17 INFO client.RMProxy: Connecting to ResourceManager atUbuntu1/192.168.159.131:8032
15/04/27 05:57:19 INFO input.FileInputFormat: Total input paths to process: 2
15/04/27 05:57:19 INFO mapreduce.JobSubmitter: number of splits:2
15/04/27 05:57:19 INFO mapreduce.JobSubmitter: Submitting tokens for job:job_1430138907536_0001
15/04/27 05:57:20 INFO impl.YarnClientImpl: Submitted applicationapplication_1430138907536_0001
15/04/27 05:57:20 INFO mapreduce.Job: The url to track the job:http://Ubuntu1:8088/proxy/application_1430138907536_0001/
15/04/27 05:57:20 INFO mapreduce.Job: Running job: job_1430138907536_0001
15/04/27 05:57:32 INFO mapreduce.Job: Job job_1430138907536_0001 runningin uber mode : false
15/04/27 05:57:32 INFO mapreduce.Job: map 0% reduce 0%
15/04/27 05:57:43 INFO mapreduce.Job: map 100% reduce 0%
15/04/27 05:57:58 INFO mapreduce.Job: map 100% reduce 100%
15/04/27 05:57:59 INFO mapreduce.Job: Job job_1430138907536_0001 completedsuccessfully
15/04/27 05:57:59 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytesread=84
FILE: Number of byteswritten=317849
FILE: Number of readoperations=0
FILE: Number of largeread operations=0
FILE: Number of writeoperations=0
HDFS: Number of bytesread=247
HDFS: Number of byteswritten=37
HDFS: Number of readoperations=9
HDFS: Number of largeread operations=0
HDFS: Number of writeoperations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by allmaps in occupied slots (ms)=16813
Total time spent by allreduces in occupied slots (ms)=12443
Total time spent by allmap tasks (ms)=16813
Total time spent by allreduce tasks (ms)=12443
Total vcore-secondstaken by all map tasks=16813
Total vcore-secondstaken by all reduce tasks=12443
Total megabyte-secondstaken by all map tasks=17216512
Total megabyte-secondstaken by all reduce tasks=12741632
Map-Reduce Framework
Map input records=2
Map output records=8
Map output bytes=75
Map output materializedbytes=90
Input split bytes=202
Combine input records=8
Combine outputrecords=7
Reduce input groups=5
Reduce shuffle bytes=90
Reduce input records=7
Reduce output records=5
Spilled Records=14
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed(ms)=622
CPU time spent(ms)=2000
Physical memory (bytes)snapshot=390164480
Virtual memory (bytes)snapshot=1179254784
Total committed heapusage (bytes)=257892352
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=45
File Output Format Counters
Bytes Written=37
(7)查看运行结果
hduser@Ubuntu1:~/hadoop$ ./bin/hdfs dfs -cat /output2/wordcount1/*
CHINA 1
Hello 2
hadoop 2
hi 2
world 1
——————————————
显示出以上结果,表明您已经成功安装了Hadoop!
3、环境变量的添加:
|
1, 需要下载eclipse
2, 需要插件,插件的终极解决方案是
https://github.com/winghc/hadoop2x-eclipse-plugin下载并编译。
也可用提供好的插件。
3, 复制编译好的jar到eclipse插件目录,重启eclipse
4, 配置hadoop 安装目录
window ->preference -> hadoopMap/Reduce -> Hadoop installation directory
5, 配置Map/Reduce 视图
window ->Open Perspective ->other->Map/Reduce -> 点击“OK”
windows → show view →other->Map/Reduce Locations-> 点击“OK”
6,在“Map/Reduce Locations”Tab页点击图标<大象+>或者在空白的地方右键,选择“New Hadoop location…”,弹出对话框“New hadoop location…”,
进行相应配置
MR Master和DFS Master配置必须和mapred-site.xml和core-site.xml等配置文件一致
7,打开Project Explorer,查看HDFS文件系统。
8,新建Map/Reduce任务
需要先启动Hadoop服务
File->New->project->Map/ReduceProject->Next
编写WordCount类:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper
private final static IntWritableone = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Contextcontext) throws IOException, InterruptedException {
// Object key,Text value就是输入的key和value, Context记录输入的key和value
StringTokenizer itr = newStringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extendsReducer
private IntWritable result = newIntWritable();
public void reduce(Text key,Iterable
Context context
) throwsIOException, InterruptedException {
//reduce函数与map函数基本相同,但value是一个迭代器的形式Iterable
int sum = 0;
for (IntWritable val : values){
sum += val.get();
}
result.set(sum);
context.write(key, result);//结果例如World, 2
}
}
public static void main(String[]args) throws Exception {
Configuration conf = newConfiguration();
Job job = Job.getInstance(conf,"word count");//指定job名称,及运行对象
job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); //指定map函数
job.setCombinerClass(IntSumReducer.class); // combiner整合
job.setReducerClass(IntSumReducer.class);//设定reduce函数
job.setOutputKeyClass(Text.class);//设定输出key数据类型
job.setOutputValueClass(IntWritable.class);//设定输出value数据类型
FileInputFormat.addInputPath(job,new Path(args[0]));//设定输入目录
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
音乐记录倒排索引
MapReduce程序开发
1、 我们的任务要求是:
有一批音乐播放记录清单,包含歌曲被播放的用户
tom LittleApple
jack YesterdayOnceMore
Rose MyHeartWillGoOn
jack LittleApple
John MyHeartWillGoOn
kissinger LittleApple
kissinger YesterdayOnceMore
2、 我们的任务输出结果是:
完成一个倒排索引形成的文本文件如下
LittleApple tom| jack| kissinger
YesterdayOnceMore jack|kissinger
MyHeartWillGoOn Rose|John
3、 我们的算法思路是:
将源文件按照每行进行分割,在mapper 过程中以歌曲名(LittleApple)作为key,以用户名(Tom)作为value,在reducer过程中是相同个歌曲码汇总,输出为倒排索引。
tom LittleApple
jack YesterdayOnceMore
Rose MyHeartWillGoOn
Map函数对应的
< YesterdayOnceMore, Jack >
< MyHeartWillGoOn, Rose>
Reduce函数将歌曲汇总
输出是
LittleApple tom
Jack
Kissinger
最终输出到HDFS为结果
LittleApple tom| jack| kissinger
YesterdayOnceMore jack|kissinger
MyHeartWillGoOn Rose|John
4、 倒排索引源程序的注释:
importjava.io.IOException;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.conf.Configured;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.*;
importorg.apache.hadoop.mapreduce.*;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
importorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
importorg.apache.hadoop.util.Tool;
importorg.apache.hadoop.util.ToolRunner;
publicclass Test_1 extends Configured implements Tool
{
enum Counter
{
LINESKIP, // 出错的行
}
public static class Map extendsMapper
{
public void map(LongWritable key, Textvalue, Context context) throws IOException, InterruptedException
{
Stringline = value.toString(); // 读取源数据,将其字符串化
try
{
// 数据处理
String[] lineSplit = line.split("");
//将数据用空格进行分割,例如Tom LittleApple
String anum = lineSplit[0]; //此处anum为Tom
String bnum = lineSplit[1]; //此处bnum为 LittleApple
context.write(new Text(bnum), newText(anum));
// 输出到context的键值对为
}
catch(java.lang.ArrayIndexOutOfBoundsException e) //出错保障
{
context.getCounter(Counter.LINESKIP).increment(1);
return;
}
}
}
public static class Reduce extendsReducer
{
public void reduce(Text key,Iterable
{
String valueString;
String out = "";
for (Text value : values)
{
valueString = value.toString();
out += valueString +"|"; //将听同一歌曲用|分隔符隔开累加
//System.out.println("Ruduce:key="+key+" value="+value);
}
context.write(key, new Text(out));
}
}
@Override
public int run(String[] args) throwsException
{
Configuration conf = this.getConf();
Job job = new Job(conf,"Test_1"); // 任务名
job.setJarByClass(Test_1.class); // 指定Class
FileInputFormat.addInputPath(job, new Path(args[0]));// 输入路径
FileOutputFormat.setOutputPath(job, newPath(args[1])); // 输出路径
job.setMapperClass(Map.class); // 调用上面Map类作为Map任务代码
job.setReducerClass(Reduce.class); // 调用上面Reduce类作为Reduce任务代码
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class); // 指定输出的KEY的格式
job.setOutputValueClass(Text.class); // 指定输出的VALUE的格式
job.waitForCompletion(true);
return job.isSuccessful()?0:1;
}
public static void main(String[] args) throwsException
{
// 运行任务
int res = ToolRunner.run(newConfiguration(), new Test_1(), args);
System.exit(res);
}
}
5、 注意设置输入输出的路径:
可以在eclipse上直接运行,也可打成jar包后运行。