直到目前,我们看到的所有Mapreduce作业都输出一组文件。但是,在一些场合下,经常要求我们将输出多组文件或者把一个数据集分为多个数据集更为方便;比如将一个log里面属于不同业务线的日志分开来输出,并交给相关的业务线。
用过旧API的人应该知道,旧API中有 org.apache.hadoop.mapred.lib.MultipleOutputFormat和org.apache.hadoop.mapred.lib.MultipleOutputs,文档对MultipleOutputFormat的解释(MultipleOutputs 解释在后面)是:
MultipleOutputFormat可以将相似的记录输出到相同的数据集。在写每条记录之前,MultipleOutputFormat将调用generateFileNameForKeyValue方法来确定需要写入的文件名。通常,我们都是继承MultipleTextOutputFormat类,来重新实现generateFileNameForKeyValue方法以返回每个输出键/值对的文件名。generateFileNameForKeyValue方法的默认实现如下:
1
2
3
|
protected
String generateFileNameForKeyValue(K key, V value, String name) {
return
name;
}
|
返回默认的name,我们可以在自己的类中重写这个方法,来定义自己的输出路径,比如:
01
02
03
04
05
06
07
08
09
10
11
12
13
|
public
static
class
PartitionFormat
extends
MultipleTextOutputFormat<NullWritable, Text> {
@Override
protected
String generateFileNameForKeyValue(
NullWritable key,
Text value,
String name) {
String[] split = value.toString().split(
","
, -
1
);
String country = split[
4
].substring(
1
,
3
);
return
country +
"/"
+ name;
}
}
|
这样相同country的记录将会输出到同一目录下的name文件中。完整的例子如下:
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
package
com.wyp;
import
org.apache.hadoop.conf.Configuration;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.LongWritable;
import
org.apache.hadoop.io.NullWritable;
import
org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapred.*;
import
org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
import
org.apache.hadoop.util.GenericOptionsParser;
import
java.io.IOException;
/**
* User: http://www.iteblog.com/
* Date: 13-11-26
* Time: 上午10:02
*/
public
class
OutputTest {
public
static
class
MapClass
extends
MapReduceBase
implements
Mapper<LongWritable, Text, NullWritable, Text> {
@Override
public
void
map(LongWritable key, Text value,
OutputCollector<NullWritable, Text> output,
Reporter reporter)
throws
IOException {
output.collect(NullWritable.get(), value);
}
}
public
static
class
PartitionFormat
extends
MultipleTextOutputFormat<NullWritable, Text> {
//和上面一样,就不写了
}
public
static
void
main(String[] args)
throws
IOException {
Configuration conf =
new
Configuration();
JobConf job =
new
JobConf(conf, OutputTest.
class
);
String[] remainingArgs =
new
GenericOptionsParser(conf, args).getRemainingArgs();
if
(remainingArgs.length !=
2
) {
System.err.println(
"Error!"
);
System.exit(
1
);
}
Path in =
new
Path(remainingArgs[
0
]);
Path out =
new
Path(remainingArgs[
1
]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName(
"Output"
);
job.setMapperClass(MapClass.
class
);
job.setInputFormat(TextInputFormat.
class
);
job.setOutputFormat(PartitionFormat.
class
);
job.setOutputKeyClass(NullWritable.
class
);
job.setOutputValueClass(Text.
class
);
job.setNumReduceTasks(
0
);
JobClient.runJob(job);
}
}
|
将上面的程序打包成jar文件(具体怎么打包,就不说),并在Hadoop2.2.0上面运行(测试数据请在这里下载:http://pan.baidu.com/s/1td8xN):
1
2
3
4
|
/home/q/hadoop-
2.2
.
0
/bin/hadoop jar \
/export1/tmp/wyp/OutputText.jar com.wyp.OutputTest \
/home/wyp/apat63_99.txt \
/home/wyp/out
|
运行完程序之后,可以去/home/wyp/out目录看下运行结果:
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
[wyp
@l
-datalog5.data.cn1 ~]$ /home/q/hadoop-
2.2
.
0
/bin/hadoop fs \
-ls /home/wyp/out
.............................这里省略了很多...................................
drwxr-xr-x - wyp supergroup
0
2013
-
11
-
26
14
:
25
/home/wyp/out/VE
drwxr-xr-x - wyp supergroup
0
2013
-
11
-
26
14
:
25
/home/wyp/out/VG
drwxr-xr-x - wyp supergroup
0
2013
-
11
-
26
14
:
25
/home/wyp/out/VN
drwxr-xr-x - wyp supergroup
0
2013
-
11
-
26
14
:
25
/home/wyp/out/VU
drwxr-xr-x - wyp supergroup
0
2013
-
11
-
26
14
:
25
/home/wyp/out/YE
.............................这里省略了很多...................................
-rw-r--r--
3
wyp supergroup
0
2013
-
11
-
26
14
:
25
/home/wyp/out/_SUCCESS
[wyp
@l
-datalog5.data.cn1 ~]$ /home/q/hadoop-
2.2
.
0
/bin/hadoop fs \
-ls /home/wyp/out/VN
Found
2
items
-rw-r--r--
3
wyp supergroup
148
2013
-
11
-
26
14
:
25
/home/wyp/out/VN/part-
00000
-rw-r--r--
3
wyp supergroup
566
2013
-
11
-
26
14
:
25
/home/wyp/out/VN/part-
00001
[wyp
@l
-datalog5.data.cn1 ~]$ /home/q/hadoop-
2.2
.
0
/bin/hadoop fs \
-cat /home/wyp/out/VN/part-
00001
3430490
,
1969
,
3350
,
1965
,
"VN"
,
""
,
597185
,
6
,,
73
,
4
,
43
,,
0
,,,,,,,,,
3630470
,
1971
,
4379
,
1970
,
"VN"
,
""
,,
1
,,
244
,
5
,
55
,,
4
,,
0.375
,,
22.5
,,,,,
3654325
,
1972
,
4477
,
1969
,
"VN"
,
""
,,
1
,,
554
,
1
,
14
,,
0
,,,,,,,,,
3665081
,
1972
,
4526
,
1970
,
"VN"
,
""
,,
1
,,
373
,
6
,
66
,,
1
,,
0
,,
3
,,,,,
3772710
,
1973
,
5072
,
1972
,
"VN"
,
""
,,
1
,,
4
,
6
,
65
,,
1
,,
0
,,
8
,,,,,
3821853
,
1974
,
5296
,
1971
,
"VN"
,
""
,,
1
,,
33
,
6
,
69
,,
1
,,
0
,,
23
,,,,,
3824277
,
1974
,
5310
,
1970
,
"VN"
,
""
,
347650
,
3
,,
562
,
1
,
14
,,
2
,,
0.5
,,
9
,,,,
0
,
0
3918104
,
1975
,
5793
,
1972
,
"VN"
,
""
,,
1
,
2
,
4
,
6
,
65
,
5
,
0
,
0.4
,,
0
,,
18.2
,,,,
|
从上面的结果可以看出,所有country相同的结果都输出到同一个文件夹下面了。MultipleOutputFormat对完全控制文件名和目录名很方便。大家也看到了上面的程序是基于行的split,如果我们要基于列的split,MultipleOutputFormat就无能为力了。这时MultipleOutputs就用上场了。MultipleOutputs在很早的版本就存在,那么我们先看看官方文档是怎么解释MultipleOutputs的: