Maven+Eclipse+Hadoop第一个WordCount

之前一直都是在windows下的Eclipse写hadoop,这次打算在Ubuntu下写一次,采用Maven来创建和管理工程。

Maven是一种挺方便的工程管理插件吧,通过写依赖项属性便可以自动加入需要的各个依赖库文件,也让Hadoop程序能够直接在Console这里运行,不需要导出jar包到命令行中去,方便调试代码啦!

首先记录一下Maven工程的创建

Eclipse本身是内嵌了Maven,如果想使用命令行方式创建工程,强力推荐

http://www.cnblogs.com/yjmyzz/p/3495762.html写得非常详细用心哦。

我记录一下用Eclipse直接创建的过程。

首先是Eclipse的Maven插件,如下图所示,在Windows->Preferences->maven->Installations里边。

Eclipse里边已经有一个Embedded的版本,当然也可以自己添加自己想要的maven版本,选择右边的Add添加即可。

Maven+Eclipse+Hadoop第一个WordCount_第1张图片

接着便是新建一个Maven项目

Maven+Eclipse+Hadoop第一个WordCount_第2张图片

catolog我选择的是默认的quickstart


Maven+Eclipse+Hadoop第一个WordCount_第3张图片

  • 关于groupId,可以看做是个包吧,左右图对照便可以看到在src/main/java和src/test/java里边都出现了yjj包。
  • 关于artifactId,可以看做是工程名,也会在yjj包里边生成一个同名的包。
  • 版本号默认即可,packaging则定义了打包后的jar名,这里应该会生成一个jar.jar包吧,,没命名好啊。

Maven+Eclipse+Hadoop第一个WordCount_第4张图片

最后生成的项目结构如上图左图所示。

Maven项目有如下通用约定,

  • src/main/java用于存放源代码
  • src/test/java用于存放单元测试代码
  • target则用于存放编译生成的class以及打包后生成的输出文件

看上图似乎多出了个src目录对吧,不需要去管它的,因为src/main/java和src/test/java其实就是在这个src里边的,我们查看下maven-mahout文件夹就可以很清楚的知道了。


这个工程里边的main是怎样运行的呢?

如果直接右击一个含main函数的类选择run as-> java application会出现找不到这个类的情况。

那该怎么做呢?

右击项目名,然后选择

Maven+Eclipse+Hadoop第一个WordCount_第5张图片

然后跳出如下界面

Maven+Eclipse+Hadoop第一个WordCount_第6张图片

在下面选择需要运行的类,这里我选择工程自动生成的APP类。选完之后,以后再想运行这个类,便可以在Run按钮里边选择这个类来运行啦。

Maven+Eclipse+Hadoop第一个WordCount_第7张图片

控制台成功输出啦!

Maven+Eclipse+Hadoop第一个WordCount_第8张图片

WordCount程序的编写

首先在pom.xml里边添加Hadoop的依赖项。

  4.0.0

  yjj
  maven-mahout
  0.0.1-SNAPSHOT
  jar

  maven-mahout
  http://maven.apache.org
	
  
    UTF-8
  	0.9 
  

    
    
      
      junit  
      junit  
      3.8.1  
      test  
      
     
       org.apache.hadoop
       hadoop-core
       1.2.1
    
    

添加了hadoop-core,版本选用1.2.1版本。
之后运行maven install命令,需要联网,它会自动下载需要的库。

Maven+Eclipse+Hadoop第一个WordCount_第9张图片

Console中显示Build Success,此时刷新下项目,便可以在Maven Dependencies里边发现hadoop-core-1.2.1.jar啦!

Maven+Eclipse+Hadoop第一个WordCount_第10张图片

我添加完这个之后,项目出现了个小×,好无语啊,解决办法参照http://www.cnblogs.com/yjmyzz/p/3495762.html里边说的,

我先在命令行下进入工程,然后运行mvn clean compile命令,运行成功,然后按如下方式

右键点击项目->Maven->Update Project

就ok啦!

Maven+Eclipse+Hadoop第一个WordCount_第11张图片


WordCount.java

package yjj.maven_mahout;


import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 



public class WordCount {

 

    public static class WordCountMapper extends Mapper {

        private final static IntWritable one =new IntWritable(1);

        private Text word =new Text();

 

        public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString());

            while(itr.hasMoreTokens()) {

                word.set(itr.nextToken());
                context.write(word, one);

            }
        }

    }

 

    public static class WordCountReducer extends Reducer {

        private IntWritable result =new IntWritable();

        

        public void reduce(Text key, Iteratorvalues, Context context)throws IOException, InterruptedException {

            int sum = 0;

            while (values.hasNext()){

                sum +=values.next().get();

            }

            result.set(sum);

            context.write(key, result);

        }

      

    }

 

    public static void main(String[] args)throws Exception {
    	
    	String inputStr = "hdfs://127.0.0.1:9000/user/root/word.txt";

        String outputStr = "hdfs://127.0.0.1:9000/user/root/result";

 
       Configuration conf = new Configuration();
       Job job = new Job(conf, "JobName");
       conf.addResource("classpath:/hadoop/core-site.xml");

       conf.addResource("classpath:/hadoop/hdfs-site.xml");

       conf.addResource("classpath:/hadoop/mapred-site.xml");
       
       job.setJarByClass(WordCount.class);
       job.setNumReduceTasks(4);
       job.setInputFormatClass(TextInputFormat.class);
       job.setOutputFormatClass(TextOutputFormat.class);
       job.setOutputKeyClass(Text.class);
       job.setMapperClass(WordCountMapper.class);
       job.setReducerClass(WordCountReducer.class);
       job.setOutputValueClass(IntWritable.class);
       FileInputFormat.addInputPath(job, new Path(inputStr));
       FileOutputFormat.setOutputPath(job, new Path(outputStr));
       job.waitForCompletion(true);
    }

 

}

需要说明的是在main函数里边输入与输出地址的设置,以及3个hadoop配置文件。

首先启动hadoop,然后把文件上传到了hdfs里边,

挺蛋疼的,每次都要让配置文件生效source /etc/profile以后才能直接用hadoop命令。

start-all.sh打开hadoop,,这个Connection closed是肿么回事啊?

居然要用ssh登陆本机呀!!

之后再启动hadoop就ok啦!

看下jps,该启动的都启动啦!可惜不是诶,NameNode去哪了啊!!!

Maven+Eclipse+Hadoop第一个WordCount_第12张图片


Maven+Eclipse+Hadoop第一个WordCount_第13张图片


Jps结果没有Namenode

http://bbs.csdn.net/topics/390428450有大神提到

namenode 默认在/tmp下建立临时文件,但关机后,/tmp下文档自动删除,再次启动Master造成文件不匹配,所以namenode启动失败。

在core-site.xml中指定临时文件位置,然后重新格式化,终极解决!

hadoop.tmp.dir
/usr/grid/hadoop1.7.0_17/hadoop_${user.name}

value中的路径只要不是/tmp 就行。

Maven+Eclipse+Hadoop第一个WordCount_第14张图片

修改完之后格式化下Namenode

root@ubuntu:/usr/local/hadoop-1.2.1# hadoop namenode -format
15/07/20 04:09:26 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.8.0_40
************************************************************/
15/07/20 04:09:26 INFO util.GSet: Computing capacity for map BlocksMap
15/07/20 04:09:26 INFO util.GSet: VM type       = 32-bit
15/07/20 04:09:26 INFO util.GSet: 2.0% max memory = 1013645312
15/07/20 04:09:26 INFO util.GSet: capacity      = 2^22 = 4194304 entries
15/07/20 04:09:26 INFO util.GSet: recommended=4194304, actual=4194304
15/07/20 04:09:27 INFO namenode.FSNamesystem: fsOwner=root
15/07/20 04:09:27 INFO namenode.FSNamesystem: supergroup=supergroup
15/07/20 04:09:27 INFO namenode.FSNamesystem: isPermissionEnabled=false
15/07/20 04:09:27 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
15/07/20 04:09:27 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
15/07/20 04:09:27 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
15/07/20 04:09:27 INFO namenode.NameNode: Caching file names occuring more than 10 times 
15/07/20 04:09:27 INFO common.Storage: Image file /usr/local/hadoop-1.2.1/hadoop_root/dfs/name/current/fsimage of size 110 bytes saved in 0 seconds.
15/07/20 04:09:28 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/usr/local/hadoop-1.2.1/hadoop_root/dfs/name/current/edits
15/07/20 04:09:28 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/usr/local/hadoop-1.2.1/hadoop_root/dfs/name/current/edits
15/07/20 04:09:28 INFO common.Storage: Storage directory /usr/local/hadoop-1.2.1/hadoop_root/dfs/name has been successfully formatted.
15/07/20 04:09:28 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

此时再看Jps就有Namenode啦!
root@ubuntu:/usr/local/hadoop-1.2.1# jps
7027 Jps
4150 DataNode
3710 TaskTracker
3496 JobTracker
6105 NameNode
3426 SecondaryNameNode
root@ubuntu:/usr/local/hadoop-1.2.1# chmod -R 777 tmp
root@ubuntu:/usr/local/hadoop-1.2.1# rm -R tmp


然后上传文件

坑爹的是,又遇到问题了

root@ubuntu:/usr/local/hadoop-1.2.1# hadoop dfs -copyFromLocal /home/user/opencv.txt opencv-word.txt
15/07/20 04:11:04 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/root/opencv-word.txt could only be replicated to 0 nodes, instead of 1

15/07/20 04:11:04 WARN hdfs.DFSClient: Error Recovery for null bad datanode[0] nodes == null


查下日志看看到底又怎么了吧。

在hadoop目录下有个log目录,因为上面说了是datanode的问题,那就看看hadoop-root-datanode-ubuntu.log这个文件吧

Maven+Eclipse+Hadoop第一个WordCount_第15张图片


************************************************************/
2015-07-20 02:12:59,116 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2015-07-20 02:12:59,138 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered.
2015-07-20 02:12:59,144 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2015-07-20 02:12:59,144 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started
2015-07-20 02:12:59,463 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
2015-07-20 02:12:59,644 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
2015-07-20 02:12:59,771 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2015-07-20 02:13:01,683 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2015-07-20 02:13:02,685 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)


hadoop错误INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1

 原因是:hadoop默认配置是把一些tmp文件放在/tmp目录下,重启系统后,tmp目录下的东西被清除,所以报错
    解决方法:在conf/core-site.xml (0.19.2版本的为conf/hadoop-site.xml)中增加以下内容
  
   hadoop.tmp.dir
   /var/log/hadoop/tmp
  A base for other temporary directories
 

这不就是我刚刚做的嘛!!!为毛又出错诶?

好吧,再一次重启,格式化

Maven+Eclipse+Hadoop第一个WordCount_第16张图片

Maven+Eclipse+Hadoop第一个WordCount_第17张图片


还有个权限问题

在命令行下可以进到root用户下去操作,不过在Eclipse下就不行了,会遇到下面的问题

org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="user":hadoop:supergroup:rwxr-xr-x

so,在master节点上修改hdfs-sit.xml
加上以下内容,取消权限功能吧 :

dfs.permissions
false

文件路径的问题

如下,这两个文件的路径到底该怎么写呢?

    	
    	String inputStr = "hdfs://127.0.0.1:9000/user/root/word.txt";

        String outputStr = "hdfs://127.0.0.1:9000/user/root/result";
首先来看下三个配置文件,其中mapred-site.xml和core-stie.xml两个涉及到了路径

我是如下写的,hdfs里边的路径用的都是localhost。

Maven+Eclipse+Hadoop第一个WordCount_第18张图片


Maven+Eclipse+Hadoop第一个WordCount_第19张图片

然后ifconfig看看

Maven+Eclipse+Hadoop第一个WordCount_第20张图片

有两个地址,一个IP地址,一个内网地址,说起来我也挺困惑的,到底该填哪个呢?

实验得知该填下面这个啦。所以

hdfs://127.0.0.1:9000/

也就是hdfs的根目录。

之后将hdfs里边的文件目录加进去就好了。

实验结果

在eclipse下跑一下,Console端便会输出相应结果。

INFO: Total input paths to process : 1
Jul 20, 2015 4:44:04 AM org.apache.hadoop.io.compress.snappy.LoadSnappy
WARNING: Snappy native library not loaded
Jul 20, 2015 4:44:05 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local1284391682_0001
Jul 20, 2015 4:44:05 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
INFO: Waiting for map tasks
Jul 20, 2015 4:44:05 AM org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable run
INFO: Starting task: attempt_local1284391682_0001_m_000000_0
Jul 20, 2015 4:44:06 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%

Jul 20, 2015 4:44:12 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
INFO: Map task executor complete.
Jul 20, 2015 4:44:12 AM org.apache.hadoop.mapred.Task initialize
INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5289d2
Jul 20, 2015 4:44:12 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Jul 20, 2015 4:44:12 AM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Merging 1 sorted segments
Jul 20, 2015 4:44:13 AM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Down to the last merge-pass, with 1 segments left of total size: 75936 bytes
Jul 20, 2015 4:44:13 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: 
Jul 20, 2015 4:44:13 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 100% reduce 0%
Jul 20, 2015 4:44:16 AM org.apache.hadoop.mapred.Counters log
INFO:     Reduce shuffle bytes=0
Jul 20, 2015 4:44:16 AM org.apache.hadoop.mapred.Counters log
INFO:     Physical memory (bytes) snapshot=0
Jul 20, 2015 4:44:16 AM org.apache.hadoop.mapred.Counters log
INFO:     Reduce input groups=2522
Jul 20, 2015 4:44:16 AM org.apache.hadoop.mapred.Counters log
INFO:     Combine output records=0
Jul 20, 2015 4:44:16 AM org.apache.hadoop.mapred.Counters log
INFO:     Reduce output records=5987
Jul 20, 2015 4:44:16 AM org.apache.hadoop.mapred.Counters log
INFO:     Map output records=5987
Jul 20, 2015 4:44:16 AM org.apache.hadoop.mapred.Counters log
INFO:     Combine input records=0
Jul 20, 2015 4:44:16 AM org.apache.hadoop.mapred.Counters log
INFO:     CPU time spent (ms)=0
Jul 20, 2015 4:44:16 AM org.apache.hadoop.mapred.Counters log
INFO:     Total committed heap usage (bytes)=321527808

我们到hdfs里边看看结果如何

Maven+Eclipse+Hadoop第一个WordCount_第21张图片

Maven+Eclipse+Hadoop第一个WordCount_第22张图片


你可能感兴趣的:(hadoop)