开发MR程序一般需要用到JDK,Eclipse,Hadoop集群,网上已经有不少的博文已经有这方面的记载,但是还是想把整个过程好好的整理和记录下来。
一.基于Windows 7 平台搭建hadoop集群及MR开发环境
需要安装的软件及版本:
OS:win 7
shell支持:cygwin
JDK:1.6.0_38
hadoop:0.20.2
eclipse:Juno Service Release 1
软件安装及环境变量设置:
1)cygwin 安装
可以到官网下载最新版安装 http://cygwin.com/setup.exe
安装过程需要安装openssh,openssl
2)cygwin配置
设置cygwin环境变量
把D:\cygwin\bin;D:\cygwin\usr\sbin;D:\cygwin\usr\i686-pc-cygwin\bin加到path变量中
3)无密码ssh配置
wuliufu@wuliufu-PC ~ $ ssh-host-config *** Info: Generating /etc/ssh_host_key *** Info: Generating /etc/ssh_host_rsa_key *** Info: Generating /etc/ssh_host_dsa_key *** Info: Generating /etc/ssh_host_ecdsa_key *** Info: Creating default /etc/ssh_config file *** Info: Creating default /etc/sshd_config file *** Info: Privilege separation is set to yes by default since OpenSSH 3.3. *** Info: However, this requires a non-privileged account called 'sshd'. *** Info: For more info on privilege separation read /usr/share/doc/openssh/README.privsep. *** Query: Should privilege separation be used? (yes/no) no *** Info: Updating /etc/sshd_config file *** Info: Sshd service is already installed. *** Info: Host configuration finished. Have fun!
在控制面板里打开服务:
控制面板\所有控制面板项\管理工具\服务
应该能找到cygwin sshd服务,启动服务
注:win7下可能会出现无法启动sshd服务,提示服务启动后又停止什么的,可以按下面的设置进行设置
在cygwin sshd右键点属性->登录->此账户->浏览->高级->选中administrator,确定,然后返回此账户出填写密码
如果administrator没有启用,请控制面板\所有控制面板项\管理工具\本地安全策略->本地策略->安全选项
,右边选中 账户:管理员账户状态,启用即可
然后重新启动sshd,如果还是无法启动,尝试重新执行ssh-host-config,执行如下的yes or no
wuliufu@wuliufu-PC ~ $ ssh-host-config *** Query: Overwrite existing /etc/ssh_config file? (yes/no) yes *** Info: Creating default /etc/ssh_config file *** Query: Overwrite existing /etc/sshd_config file? (yes/no) yes *** Info: Creating default /etc/sshd_config file *** Info: Privilege separation is set to yes by default since OpenSSH 3.3. *** Info: However, this requires a non-privileged account called 'sshd'. *** Info: For more info on privilege separation read /usr/share/doc/openssh/READ ME.privsep. *** Query: Should privilege separation be used? (yes/no) yes *** Info: Note that creating a new user requires that the current account have *** Info: Administrator privileges. Should this script attempt to create a *** Query: new local account 'sshd'? (yes/no) yes *** Info: Updating /etc/sshd_config file *** Info: Sshd service is already installed. *** Info: Host configuration finished. Have fun!
然后再次重新启动sshd,我的到这步就成功启动了,呵呵
配置无密码ssh登录
$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/wuliufu/.ssh/id_rsa): Created directory '/home/wuliufu/.ssh'. Your identification has been saved in /home/wuliufu/.ssh/id_rsa. Your public key has been saved in /home/wuliufu/.ssh/id_rsa.pub. The key fingerprint is: 1c:c7:f2:e1:11:76:0f:a8:66:44:f3:30:4b:98:08:86 wuliufu@wuliufu-PC The key's randomart image is: +--[ RSA 2048]----+ | .o. . +* o.o | |E. . o..O.o o | | .+.* . | | .+* o | | oS o | | | | | | | | | +-----------------+
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. ECDSA key fingerprint is be:be:31:a7:83:28:66:82:f7:25:33:4c:98:79:4d:47. Are you sure you want to continue connecting (yes/no)? yes
4)JDK和eclipse的安装和环境变量配置(略)
5)hadoop安装
下载:http://archive.apache.org/dist/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
把压缩包放到D:\cygwin\home\wuliufu目录下
wuliufu@wuliufu-PC ~ $ tar -zxvf hadoop-0.20.2.tar.gz $ ln -s ~/hadoop-0.20.2 ~/hadoop
6)hadoop配置
先简单设置一些核心属性如下,其他属性请参考开发文档
$ cd ~/hadoop/conf vi hadoop-env.sh #设置jdk和hadoop home,添加类似如下变量赋值 export JAVA_HOME="/cygdrive/d/Program Files/Java/jdk1.6.0_38" export HADOOP_HOME=/home/wuliufu/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_HOME/bin vi core-site.xmlvi hadf-site.xml fs.default.name hdfs://localhost:9000 hadoop.tmp.dir /home/wuliufu/hadoop/hadoop-root vi mapred-site.xml dfs.namenode.name.dir /home/wuliufu/hadoop/data/dfs/name true dfs.namenode.data.dir /home/wuliufu/hadoop/data/dfs/data true dfs.replication 1 dfs.permission false mapred.job.tracker localhost:9001
7)hadoop格式化及启动
1.格式化namenode
$ hadoop namenode -format cygwin warning: MS-DOS style path detected: D:\cygwin\home\wuliufu\hadoop-0.20.2/build/native Preferred POSIX equivalent is: /home/wuliufu/hadoop-0.20.2/build/native CYGWIN environment variable option "nodosfilewarning" turns off this warning. Consult the user's guide for more details about POSIX paths: http://cygwin.com/cygwin-ug-net/using.html#using-pathnames 13/04/23 22:42:47 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = wuliufu-PC/192.168.1.100 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ Re-format filesystem in \home\wuliufu\hadoop\hadoop-root\dfs\name ? (Y or N) y Format aborted in \home\wuliufu\hadoop\hadoop-root\dfs\name 13/04/23 22:42:55 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at wuliufu-PC/192.168.1.100 ************************************************************/
启动
$ cd hadoop/bin $ ./start-all.sh starting namenode, logging to /home/wuliufu/hadoop/logs/hadoop-wuliufu-namenode-wuliufu-PC.out localhost: starting datanode, logging to /home/wuliufu/hadoop/logs/hadoop-wuliufu-datanode-wuliufu-PC.out localhost: starting secondarynamenode, logging to /home/wuliufu/hadoop/logs/hadoop-wuliufu-secondarynamenode-wuliufu-PC.out starting jobtracker, logging to /home/wuliufu/hadoop/logs/hadoop-wuliufu-jobtracker-wuliufu-PC.out localhost: starting tasktracker, logging to /home/wuliufu/hadoop/logs/hadoop-wuliufu-tasktracker-wuliufu-PC.out
Eclipse 插件编译(试用于eclipse SDK 3.3+):
在cygwin下执行编译,具体见:
http://wliufu.iteye.com/blog/1851164
另附件上传了我针对当前的eclipse编译好的插件
另对cdh3u4的编译可以参见:http://yzyzero.iteye.com/blog/1845396
Eclipse Hadoop MapReduce环境配置
1.把上个步骤编译好的插件hadoop- 0.20.2-eclipse-plugin.jar 拷贝到eclipse的plugins内,重启eclise
window->Preferences,点左侧的Hadoop Map/Reduce,在右侧配置hadoop安装位置,如:
D:\cygwin\home\wuliufu\hadoop-0.20.2
2.点window->show view->other,搜索map,然后点击Map/Reduce Location,点OK
这样就能看到Map/Reduce Location的视图了
在该视图右上角有一个大象的蓝色图标,点击新建一个location
填写上相关信息,具体参数和上述配置hadoop的参数一致
其中 map/reduce master 的后视图对应于mapred-site.xml 里的mapred.job.tracker属性值
DFS Master对应于core-site.xml的fs.default.name属性值
然后确认返回
点击eclipse右上角的open perspective,切换至map/reduce
这时左侧会如下
如果能看到这里,说明插件能够正常连接上hadoop集群了
来个简单的MR程序吧
在eclipse内点击File->NEW->other,选择map/reduce project,随便取个名wordcount
把hadoo-0.20.2里面的Wordcount.java赋值到demo下(D:\cygwin\home\wuliufu\hadoop-0.20.2\src\examples\org\apache\hadoop\examples\WordCount.java)
回到cygwin 下,我们编辑一个文件word.txt(可以选取一段英文,如附件),然后把该文件上传到hdfs
wuliufu@wuliufu-PC ~ $ hadoop fs -ls / Found 2 items drwxr-xr-x - wuliufu-pc\wuliufu supergroup 0 2013-04-23 22:44 /home drwxr-xr-x - wuliufu-pc\wuliufu supergroup 0 2013-04-24 00:42 /tmp wuliufu@wuliufu-PC ~ $ hadoop fs -copyFromLocal ./word.txt /tmp/
在eclipse里面的左侧的DFS Location里面的Hadoop(大象图标)右键刷新就可以看到上传的文件了
接下来准备执行以下WordCount 了
改成需要传入两个参数,分别是输入路径和输出目录
右键run configxx。。。
接着 右键->Run As->Run on hadoop
控制台会出现类似如下的log
13/04/24 00:49:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 13/04/24 00:49:46 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 13/04/24 00:49:47 INFO input.FileInputFormat: Total input paths to process : 1 13/04/24 00:49:48 INFO mapred.JobClient: Running job: job_local_0001 13/04/24 00:49:48 INFO input.FileInputFormat: Total input paths to process : 1 13/04/24 00:49:48 INFO mapred.MapTask: io.sort.mb = 100 13/04/24 00:49:49 INFO mapred.MapTask: data buffer = 79691776/99614720 13/04/24 00:49:49 INFO mapred.MapTask: record buffer = 262144/327680 13/04/24 00:49:49 INFO mapred.JobClient: map 0% reduce 0% 13/04/24 00:49:49 INFO mapred.MapTask: Starting flush of map output 13/04/24 00:49:49 INFO mapred.MapTask: Finished spill 0 13/04/24 00:49:49 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 13/04/24 00:49:49 INFO mapred.LocalJobRunner: 13/04/24 00:49:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 13/04/24 00:49:49 INFO mapred.LocalJobRunner: 13/04/24 00:49:49 INFO mapred.Merger: Merging 1 sorted segments 13/04/24 00:49:49 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 870 bytes 13/04/24 00:49:49 INFO mapred.LocalJobRunner: 13/04/24 00:49:50 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting 13/04/24 00:49:50 INFO mapred.LocalJobRunner: 13/04/24 00:49:50 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now 13/04/24 00:49:50 INFO mapred.JobClient: map 100% reduce 0% 13/04/24 00:49:50 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/tmp/out 13/04/24 00:49:50 INFO mapred.LocalJobRunner: reduce > reduce 13/04/24 00:49:50 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done. 13/04/24 00:49:51 INFO mapred.JobClient: map 100% reduce 100% 13/04/24 00:49:51 INFO mapred.JobClient: Job complete: job_local_0001 13/04/24 00:49:51 INFO mapred.JobClient: Counters: 14 13/04/24 00:49:51 INFO mapred.JobClient: FileSystemCounters 13/04/24 00:49:51 INFO mapred.JobClient: FILE_BYTES_READ=34718 13/04/24 00:49:51 INFO mapred.JobClient: HDFS_BYTES_READ=1108 13/04/24 00:49:51 INFO mapred.JobClient: FILE_BYTES_WRITTEN=70010 13/04/24 00:49:51 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=604 13/04/24 00:49:51 INFO mapred.JobClient: Map-Reduce Framework 13/04/24 00:49:51 INFO mapred.JobClient: Reduce input groups=66 13/04/24 00:49:51 INFO mapred.JobClient: Combine output records=66 13/04/24 00:49:51 INFO mapred.JobClient: Map input records=1 13/04/24 00:49:51 INFO mapred.JobClient: Reduce shuffle bytes=0 13/04/24 00:49:51 INFO mapred.JobClient: Reduce output records=66 13/04/24 00:49:51 INFO mapred.JobClient: Spilled Records=132 13/04/24 00:49:51 INFO mapred.JobClient: Map output bytes=903 13/04/24 00:49:51 INFO mapred.JobClient: Combine input records=87 13/04/24 00:49:51 INFO mapred.JobClient: Map output records=87 13/04/24 00:49:51 INFO mapred.JobClient: Reduce input records=66
再在右侧的DFS Location刷新一下
点击part-r-00000,如上图右侧,这就是最终结果了
基本流程结束。
准备睡觉了。。。。。。