教程来自炼数成金第5节,好像没有发现网上有代码,因此自己手打了一遍,同时将过程一步一步写下来了。
如果你也是初学的话,建议你已经做好了如下的准备工作
1 安装好了hadoop,并且运行hadoop伪分布模式成功(单机模式成功也行),jps有namenode,secondnamenode, datanode, jobtracker,tasktracker
好像伪分布模式的hadoop一定要用sudo root权限才能运行成功,虽然我现在也不知道为什么。。。
2 自己用hadoop给eclipse编译好了插件
大家最好是自己编译,因为即使是使用了0.20.2这种版本的hadoop文件里面自带的eclipse插件,也会有可能和最新版的eclipse不匹配造成无法连接。
编译成功后,在eclipse使用访问hdfs的时候也可能遇到如下问题
a 连接hdfs失败,显示access denied,那么可能因为是用户权限的问题,可以在hdfs-site.xml将权限限制关掉,当然还可以提高自己的权限。
b 显示call localhost failed 这个问题在我来说是因为没有启动hadoop引起的也即jps进程木有运行 = = 我原来以为eclipse插件可以自己启动的。
3 编写程序。
炼书城金的老师给的是个很简单的例子,如果大家用其他的做肯定会比hadoop做简单,但是为啥学这个例子大家也清楚。。。。
例子就是给一个文本数据,将其中的日期和硬件的MAC地址单独提取出来就行。
Apr 23 11:49:54 hostapd: wlan0 STA wg:7d:2w:aj:82
Apr 23 11:49:54 hostapd: wlan0 STA 13:7d:2w:zy:82
Apr 23 11:49:52 hostapd: wlan0 STA cc:ur:2w:wj:82
Apr 23 11:49:52 hostapd: wlan0 STA ee:7d:2s:9e:82
Apr 23 11:42:54 hostapd: wlan0 STA 13:7d:2w:9e:ty
Apr 23 11:49:54 hostapd: wlan0 STA 33:7d:2g:he:82
Apr 23 11:49:57 hostapd: wlan0 STA 74:7f:ww:9e:82
提取成
23 wg:7d:2w:aj:82
23 13:7d:2w:zy:82
23 cc:ur:2w:wj:82
23 ee:7d:2s:9e:82
23 13:7d:2w:9e:ty
23 33:7d:2g:he:82
23 74:7f:ww:9e:82
具体的操作只需要在map过程中进行就可以了,不需要reduce进行操作。
4 具体操作
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class extract extends Configured implements Tool { enum Counter { LINESKIP;//计数器出错的行 } //LongWritable, text 实质代表着键值对 public static class Map extends Mapper<LongWritable, Text, NullWritable,Text> { public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException { String line=value.toString(); try { String[] linesplit = line.split(" "); String month=linesplit[0]; String time=linesplit[1]; String mac=linesplit[6]; Text out=new Text(month+" "+time+" "+mac); context.write(NullWritable.get(),out); //控制输出的格式,一般为context.write(key,value); 输出key\tvalue //但是如果我想把keyvalue写在一起不加空格 //我可以将keyvalue先写在一起比如out,然后以context.write(null,out)输出 } //保证开始实现的接口的类型,和我实现了的输入类型和context.write是一致的。 catch(java.lang.ArrayIndexOutOfBoundsException e) { context.getCounter(Counter.LINESKIP).increment(1); } } } @Override public int run(String[] args) throws Exception { // TODO Auto-generated method stub Configuration conf=getConf(); Job job=new Job(conf,"extract");//任务名 job.setJarByClass(extract.class);//指定class FileInputFormat.addInputPath(job, new Path(args[0])); //输入路径 FileOutputFormat.setOutputPath(job,new Path(args[1])); //输出路径 job.setMapperClass(Map.class);//使用map类作为Map任务实现代码 job.setOutputFormatClass(TextOutputFormat.class); job.setOutputKeyClass(NullWritable.class);//指定输入格式 job.setOutputValueClass(Text.class);//指定输出格式 job.waitForCompletion(true); return job.isSuccessful()?0:1; } public static void main(String[] args)throws Exception { //运行任务 int res=ToolRunner.run(new Configuration(), new extract(), args); System.exit(res); } }
具体操作请注意下
1 先把hadoop配置对
2 自己用eclipse访问dfs木有问题
3 将自己的要处理的文件上传到dfs中
4 运行之前设置好自己的输入目录和输出目录参数(在实际文件中输入目录要有,但不要创建输出目录,系统运行应该会自己创建目录同时写入输出文件的)
成功运行记录
14/01/17 03:43:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/01/17 03:43:45 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 14/01/17 03:43:45 INFO input.FileInputFormat: Total input paths to process : 1 14/01/17 03:43:45 INFO mapred.JobClient: Running job: job_local_0001 14/01/17 03:43:45 INFO input.FileInputFormat: Total input paths to process : 1 14/01/17 03:43:46 INFO mapred.MapTask: io.sort.mb = 100 14/01/17 03:43:46 INFO mapred.MapTask: data buffer = 79691776/99614720 14/01/17 03:43:46 INFO mapred.MapTask: record buffer = 262144/327680 14/01/17 03:43:46 INFO mapred.MapTask: Starting flush of map output 14/01/17 03:43:46 INFO mapred.MapTask: Finished spill 0 14/01/17 03:43:46 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 14/01/17 03:43:46 INFO mapred.LocalJobRunner: 14/01/17 03:43:46 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 14/01/17 03:43:46 INFO mapred.LocalJobRunner: 14/01/17 03:43:46 INFO mapred.Merger: Merging 1 sorted segments 14/01/17 03:43:46 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 170 bytes 14/01/17 03:43:46 INFO mapred.LocalJobRunner: 14/01/17 03:43:46 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting 14/01/17 03:43:46 INFO mapred.LocalJobRunner: 14/01/17 03:43:46 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now 14/01/17 03:43:46 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/MACextract/output 14/01/17 03:43:46 INFO mapred.LocalJobRunner: reduce > reduce 14/01/17 03:43:46 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done. 14/01/17 03:43:46 INFO mapred.JobClient: map 100% reduce 100% 14/01/17 03:43:46 INFO mapred.JobClient: Job complete: job_local_0001 14/01/17 03:43:46 INFO mapred.JobClient: Counters: 14 14/01/17 03:43:46 INFO mapred.JobClient: FileSystemCounters 14/01/17 03:43:46 INFO mapred.JobClient: FILE_BYTES_READ=33470 14/01/17 03:43:46 INFO mapred.JobClient: HDFS_BYTES_READ=700 14/01/17 03:43:46 INFO mapred.JobClient: FILE_BYTES_WRITTEN=67578 14/01/17 03:43:46 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=154 14/01/17 03:43:46 INFO mapred.JobClient: Map-Reduce Framework 14/01/17 03:43:46 INFO mapred.JobClient: Reduce input groups=1 14/01/17 03:43:46 INFO mapred.JobClient: Combine output records=0 14/01/17 03:43:46 INFO mapred.JobClient: Map input records=7 14/01/17 03:43:46 INFO mapred.JobClient: Reduce shuffle bytes=0 14/01/17 03:43:46 INFO mapred.JobClient: Reduce output records=7 14/01/17 03:43:46 INFO mapred.JobClient: Spilled Records=14 14/01/17 03:43:46 INFO mapred.JobClient: Map output bytes=154 14/01/17 03:43:46 INFO mapred.JobClient: Combine input records=0 14/01/17 03:43:46 INFO mapred.JobClient: Map output records=7 14/01/17 03:43:46 INFO mapred.JobClient: Reduce input records=7