Hadoop多文件输出:MultipleOutputFormat和MultipleOutputs深究(一)

直到目前,我们看到的所有Mapreduce作业都输出一组文件。但是,在一些场合下,经常要求我们将输出多组文件或者把一个数据集分为多个数据集更为方便;比如将一个log里面属于不同业务线的日志分开来输出,并交给相关的业务线。
  用过旧API的人应该知道,旧API中有 org.apache.hadoop.mapred.lib.MultipleOutputFormat和org.apache.hadoop.mapred.lib.MultipleOutputs,文档对MultipleOutputFormat的解释(MultipleOutputs 解释在后面)是:

  MultipleOutputFormat allowing to write the output data to different output files.

  MultipleOutputFormat可以将相似的记录输出到相同的数据集。在写每条记录之前,MultipleOutputFormat将调用generateFileNameForKeyValue方法来确定需要写入的文件名。通常,我们都是继承MultipleTextOutputFormat类,来重新实现generateFileNameForKeyValue方法以返回每个输出键/值对的文件名。generateFileNameForKeyValue方法的默认实现如下:

1
2
3
protected String generateFileNameForKeyValue(K key, V value, String name) {
     return name;
}

返回默认的name,我们可以在自己的类中重写这个方法,来定义自己的输出路径,比如:

01
02
03
04
05
06
07
08
09
10
11
12
13
public static class PartitionFormat
             extends MultipleTextOutputFormat<NullWritable, Text> {
 
         @Override
         protected String generateFileNameForKeyValue(
                 NullWritable key,
                 Text value,
                 String name) {
             String[] split = value.toString().split( "," , - 1 );
             String country = split[ 4 ].substring( 1 , 3 );
             return country + "/" + name;
         }
}

这样相同country的记录将会输出到同一目录下的name文件中。完整的例子如下:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
package com.wyp;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
 
import java.io.IOException;
 
/**
  * User: http://www.iteblog.com/
  * Date: 13-11-26
  * Time: 上午10:02
  */
public class OutputTest {
     public static class MapClass extends MapReduceBase
             implements Mapper<LongWritable, Text, NullWritable, Text> {
 
         @Override
         public void map(LongWritable key, Text value,
                         OutputCollector<NullWritable, Text> output,
                         Reporter reporter) throws IOException {
             output.collect(NullWritable.get(), value);
         }
     }
 
     public static class PartitionFormat
             extends MultipleTextOutputFormat<NullWritable, Text> {
         //和上面一样,就不写了
     }
 
     public static void main(String[] args) throws IOException {
         Configuration conf = new Configuration();
         JobConf job = new JobConf(conf, OutputTest. class );
         String[] remainingArgs =
                 new GenericOptionsParser(conf, args).getRemainingArgs();
 
         if (remainingArgs.length != 2 ) {
             System.err.println( "Error!" );
             System.exit( 1 );
         }
 
         Path in = new Path(remainingArgs[ 0 ]);
         Path out = new Path(remainingArgs[ 1 ]);
 
         FileInputFormat.setInputPaths(job, in);
         FileOutputFormat.setOutputPath(job, out);
 
         job.setJobName( "Output" );
         job.setMapperClass(MapClass. class );
 
         job.setInputFormat(TextInputFormat. class );
         job.setOutputFormat(PartitionFormat. class );
         job.setOutputKeyClass(NullWritable. class );
         job.setOutputValueClass(Text. class );
 
         job.setNumReduceTasks( 0 );
         JobClient.runJob(job);
     }
 
}

将上面的程序打包成jar文件(具体怎么打包,就不说),并在Hadoop2.2.0上面运行(测试数据请在这里下载:http://pan.baidu.com/s/1td8xN):

1
2
3
4
/home/q/hadoop- 2.2 . 0 /bin/hadoop jar                      \
       /export1/tmp/wyp/OutputText.jar com.wyp.OutputTest \
       /home/wyp/apat63_99.txt                            \
       /home/wyp/out

运行完程序之后,可以去/home/wyp/out目录看下运行结果:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[wyp @l -datalog5.data.cn1 ~]$ /home/q/hadoop- 2.2 . 0 /bin/hadoop fs         \
                                      -ls /home/wyp/out
.............................这里省略了很多...................................
drwxr-xr-x   - wyp  supergroup     0 2013 - 11 - 26 14 : 25 /home/wyp/out/VE
drwxr-xr-x   - wyp  supergroup     0 2013 - 11 - 26 14 : 25 /home/wyp/out/VG
drwxr-xr-x   - wyp  supergroup     0 2013 - 11 - 26 14 : 25 /home/wyp/out/VN
drwxr-xr-x   - wyp  supergroup     0 2013 - 11 - 26 14 : 25 /home/wyp/out/VU
drwxr-xr-x   - wyp  supergroup     0 2013 - 11 - 26 14 : 25 /home/wyp/out/YE
.............................这里省略了很多...................................
-rw-r--r--   3 wyp  supergroup     0 2013 - 11 - 26 14 : 25 /home/wyp/out/_SUCCESS
 
[wyp @l -datalog5.data.cn1 ~]$ /home/q/hadoop- 2.2 . 0 /bin/hadoop fs        \
                                     -ls /home/wyp/out/VN
Found 2 items
-rw-r--r-- 3 wyp supergroup  148 2013 - 11 - 26 14 : 25 /home/wyp/out/VN/part- 00000
-rw-r--r-- 3 wyp supergroup  566 2013 - 11 - 26 14 : 25 /home/wyp/out/VN/part- 00001
 
[wyp @l -datalog5.data.cn1 ~]$ /home/q/hadoop- 2.2 . 0 /bin/hadoop fs        \
                                    -cat /home/wyp/out/VN/part- 00001
3430490 , 1969 , 3350 , 1965 , "VN" , "" , 597185 , 6 ,, 73 , 4 , 43 ,, 0 ,,,,,,,,,
3630470 , 1971 , 4379 , 1970 , "VN" , "" ,, 1 ,, 244 , 5 , 55 ,, 4 ,, 0.375 ,, 22.5 ,,,,,
3654325 , 1972 , 4477 , 1969 , "VN" , "" ,, 1 ,, 554 , 1 , 14 ,, 0 ,,,,,,,,,
3665081 , 1972 , 4526 , 1970 , "VN" , "" ,, 1 ,, 373 , 6 , 66 ,, 1 ,, 0 ,, 3 ,,,,,
3772710 , 1973 , 5072 , 1972 , "VN" , "" ,, 1 ,, 4 , 6 , 65 ,, 1 ,, 0 ,, 8 ,,,,,
3821853 , 1974 , 5296 , 1971 , "VN" , "" ,, 1 ,, 33 , 6 , 69 ,, 1 ,, 0 ,, 23 ,,,,,
3824277 , 1974 , 5310 , 1970 , "VN" , "" , 347650 , 3 ,, 562 , 1 , 14 ,, 2 ,, 0.5 ,, 9 ,,,, 0 , 0
3918104 , 1975 , 5793 , 1972 , "VN" , "" ,, 1 , 2 , 4 , 6 , 65 , 5 , 0 , 0.4 ,, 0 ,, 18.2 ,,,,

  从上面的结果可以看出,所有country相同的结果都输出到同一个文件夹下面了。MultipleOutputFormat对完全控制文件名和目录名很方便。大家也看到了上面的程序是基于行的split,如果我们要基于列的split,MultipleOutputFormat就无能为力了。这时MultipleOutputs就用上场了。MultipleOutputs在很早的版本就存在,那么我们先看看官方文档是怎么解释MultipleOutputs的:

  MultipleOutputs creates multiple OutputCollectors. Each OutputCollector can have its own OutputFormat and types for the key/value pair. Your MapReduce program will decide what to output to each OutputCollector.
  由于本文比较长,考虑到篇幅问题,所以将本文拆分为二,第二部分请参见本博客 《Hadoop多文件输出:MultipleOutputFormat和MultipleOutputs深究(二)》,给你带来不便请原谅。
转载请注明: 转载自过往记忆(http://www.iteblog.com/)
本文链接地址: Hadoop多文件输出:MultipleOutputFormat和MultipleOutputs深究(一)(http://www.iteblog.com/archives/842)

你可能感兴趣的:(Hadoop多文件输出:MultipleOutputFormat和MultipleOutputs深究(一))