map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
可以从源代码中看出为什么是这样的类型:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
public
class
Mapper
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
public
class
Context
extends
MapContext
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
// ...
}
protected
void
map
(
KEYIN
key
,
VALUEIN
value
,
Context
context
)
throws
IOException
,
InterruptedException
{
// ...
}
}
public
class
Reducer
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
public
class
Context
extends
ReducerContext
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
// ...
}
protected
void
reduce
(
KEYIN
key
,
Iterable
<
VALUEIN
>
values
,
Context
context
)
throws
IOException
,
InterruptedException
{
// ...
}
}
context用来接收输出键值对,写出的方法是:
public
void
write
(
KEYOUT
key
,
VALUEOUT
value
)
throws
IOException
,
InterruptedException
如果有combiner :这里的 combiner就是默认的reducer
map: (K1, V1) → list(K2, V2)
combiner: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
如果partitioner被使用:
partition: (K2, V2) → integer(很多时候只取决于key 值被忽略来进行分区)
以及combiner 甚至partitioner让相同的key聚合到一起
public
abstract
class
Partitioner
<
KEY
,
VALUE
>
{
public
abstract
int
getPartition
(
KEY
key
,
VALUE
value
,
int
numPartitions
);
}
一个实现类:
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
输入数据的类型是通过输入格式进行设定的。例如,对于TextlnputFormat ,它的键类型就是LongWritable ,而值类型就是Text 。其他的类型可以通过调用JobConf 中的方法来进行显式地设置。如果没有显式地设置, 中阔的类型将被默认设置为(最终的)输出类型,也就是LongWritable 和Text.综上所述,如果K2 与K3是相同类型,就不需要手工调用setMapOutputKeyClass,因为它将被自动设置每一个步骤的输入和输出类型.一定很奇怪,为什么不能从最初输入的类型推导出每个步骤的输入/输出类型呢?原来Java 的泛型机制具有很多限制,类型擦除导致了运行时类型并不一直可见.所以需要Hadoop 时不时地"提醒"一下。这也导致了可能在某些MapReduce 任务中出现不兼容的输入和输出类型,因为这些配置在编译时无法检查出来。与MapReduce 任务兼容的类型已经在下面列出。所有的类型不兼容将在任务真正执行的时候被发现,所以一个比较聪明的做法是在执行任务前先用少量的数据跑一次测试任务,以发现所有的类型不兼容问题。
Table 8-1. Configuration of MapReduce types in the new API
Property |
Job setter method |
Input types |
Intermediate types |
Output types |
K1 |
V1 |
K2 |
V2 |
K3 |
V3 |
Properties for configuring types: |
mapreduce.job.inputformat.class |
setInputFormatClass() |
• |
• |
|
|
|
|
mapreduce.map.output.key.class |
setMapOutputKeyClass() |
|
|
• |
|
|
|
mapreduce.map.output.value.class |
setMapOutputValueClass() |
|
|
|
• |
|
|
mapreduce.job.output.key.class |
setOutputKeyClass() |
|
|
|
|
• |
|
mapreduce.job.output.value.class |
setOutputValueClass() |
|
|
|
|
|
• |
Properties that must be consistent with the types: |
mapreduce.job.map.class |
setMapperClass() |
• |
• |
• |
• |
|
|
mapreduce.job.combine.class |
setCombinerClass() |
|
|
• |
• |
|
|
mapreduce.job.partitioner.class |
setPartitionerClass() |
|
|
• |
• |
|
|
mapreduce.job.output.key.comparator.class |
setSortComparatorClass() |
|
|
• |
|
|
|
mapreduce.job.output.group.comparator.class |
setGroupingComparatorClass() |
|
|
• |
|
|
|
mapreduce.job.reduce.class |
setReducerClass() |
|
|
• |
• |
• |
• |
mapreduce.job.outputformat.class |
setOutputFormatClass() |
|
|
|
|
• |
• |
Table 8-2. Configuration of MapReduce types in the old API
Property |
JobConf setter method |
Input types |
Intermediate types |
Output types |
K1 |
V1 |
K2 |
V2 |
K3 |
V3 |
Properties for configuring types: |
mapred.input.format.class |
setInputFormat() |
• |
• |
|
|
|
|
mapred.mapoutput.key.class |
setMapOutputKeyClass() |
|
|
• |
|
|
|
mapred.mapoutput.value.class |
setMapOutputValueClass() |
|
|
|
• |
|
|
mapred.output.key.class |
setOutputKeyClass() |
|
|
|
|
• |
|
mapred.output.value.class |
setOutputValueClass() |
|
|
|
|
|
• |
Properties that must be consistent with the types: |
mapred.mapper.class |
setMapperClass() |
• |
• |
• |
• |
|
|
mapred.map.runner.class |
setMapRunnerClass() |
• |
• |
• |
• |
|
|
mapred.combiner.class |
setCombinerClass() |
|
|
• |
• |
|
|
mapred.partitioner.class |
setPartitionerClass() |
|
|
• |
• |
|
|
mapred.output.key.comparator.class |
setOutputKeyComparatorClass() |
|
|
• |
|
|
|
mapred.output.value.groupfn.class |
setOutputValueGroupingComparator() |
|
|
• |
|
|
|
mapred.reducer.class |
setReducerClass() |
|
|
• |
• |
• |
• |
mapred.output.format.class |
setOutputFormat() |
|
|
|
|
• |
• |
一个最简单的hadoop mapreduce:
public
class
MinimalMapReduce
extends
Configured
implements
Tool
{
@Override
public
int
run
(
String
[]
args
)
throws
Exception
{
if
(
args
.
length
!=
2
)
{
System
.
err
.
printf
(
"Usage: %s [generic options] <input> <output>\n"
,
getClass
().
getSimpleName
());
ToolRunner
.
printGenericCommandUsage
(
System
.
err
);
return
-
1
;
}
Job
job
=
new
Job
(
getConf
());
job
.
setJarByClass
(
getClass
());
FileInputFormat
.
addInputPath
(
job
,
new
Path
(
args
[
0
]));
FileOutputFormat
.
setOutputPath
(
job
,
new
Path
(
args
[
1
]));
return
job
.
waitForCompletion
(
true
)
?
0
:
1
;
}
public
static
void
main
(
String
[]
args
)
throws
Exception
{
int
exitCode
=
ToolRunner
.
run
(
new
MinimalMapReduce
(),
args
);
System
.
exit
(
exitCode
);
}
}
执行方法:
hadoop MinimalMapReduce "input/ncdc/all/190{1,2}.gz" output
输出结果:
0→0029029070999991901010106004+64333+023450FM-12+000599999V0202701N01591...
0→0035029070999991902010106004+64333+023450FM-12+000599999V0201401N01181...
135→0029029070999991901010113004+64333+023450FM-12+000599999V0202901N00821...
141→0035029070999991902010113004+64333+023450FM-12+000599999V0201401N01181...
270→0029029070999991901010120004+64333+023450FM-12+000599999V0209991C00001...
282→0035029070999991902010120004+64333+023450FM-12+000599999V0201401N01391...
改默认最简mapreduce等同于一下的程序:
public
class
MinimalMapReduceWithDefaults
extends
Configured
implements
Tool
{
@Override
public
int
run
(
String
[]
args
)
throws
Exception
{
Job
job
=
JobBuilder
.
parseInputAndOutput
(
this
,
getConf
(),
args
);
if
(
job
==
null
)
{
return
-
1
;
}
job
.
setInputFormatClass
(
TextInputFormat
.
class
);
job
.
setMapperClass
(
Mapper
.
class
);
job
.
setMapOutputKeyClass
(
LongWritable
.
class
);
job
.
setMapOutputValueClass
(
Text
.
class
);
job
.
setPartitionerClass
(
HashPartitioner
.
class
);
job
.
setNumReduceTasks
(
1
);
job
.
setReducerClass
(
Reducer
.
class
);
job
.
setOutputKeyClass
(
LongWritable
.
class
);
job
.
setOutputValueClass
(
Text
.
class
);
job
.
setOutputFormatClass
(
TextOutputFormat
.
class
);
return
job
.
waitForCompletion
(
true
)
?
0
:
1
;
}
public
static
void
main
(
String
[]
args
)
throws
Exception
{
int
exitCode
=
ToolRunner
.
run
(
new
MinimalMapReduceWithDefaults
(),
args
);
System
.
exit
(
exitCode
);
}
}
那么,默认使用的mapreduce是:Mapper
.
class
HashPartitioner
.
class
Reducer
.
class
默认map代码,就是读取key value输出
public
class
Mapper
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
protected
void
map
(
KEYIN
key
,
VALUEIN
value
,
Context
context
)
throws
IOException
,
InterruptedException
{
context
.
write
((
KEYOUT
)
key
,
(
VALUEOUT
)
value
);
}
}
默认Partitioner:hash分割,默认只有一个reducer因此我们这里只有一个分区
class
HashPartitioner
<
K
,
V
>
extends
Partitioner
<
K
,
V
>
{
public
int
getPartition
(
K
key
,
V
value
,
int
numReduceTasks
)
{
return
(
key
.
hashCode
()
&
Integer
.
MAX_VALUE
)
%
numReduceTasks
;
}
}
默认Reduce 输出传进来的数据:
public
class
Reducer
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
protected
void
reduce
(
KEYIN
key
,
Iterable
<
VALUEIN
>
values
,
Context
context
Context
context
)
throws
IOException
,
InterruptedException
{
for
(
VALUEIN
value:
values
)
{
context
.
write
((
KEYOUT
)
key
,
(
VALUEOUT
)
value
);
}
}
}
因为什么都没做,只是在map中读取了偏移量和value,分区使用的hash,一个reduce输出的便是我们上面看到的样子。
相对于java api,hadoop流也有最简的mapreduce:
%
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper /bin/cat
等于下面的命令:
%
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input input/ncdc/sample.txt \
-output output \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-mapper /bin/cat \
-partitioner org.apache.hadoop.mapred.lib.HashPartitioner \
-numReduceTasks 1 \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
-io text
流操作的键与值
一个文本文件流怎么知道哪里是一个记录的结束呢?
一个流操作的程序可以修改输入的分隔符(用于将键与值从输入文件中分开并且传
入mapper) 。默认情况下是Tab ,但是如果输入的键或值中本身有Tab 分隔符的
话,最好将分隔符修改成其他符号。
类似地,当map 和reduc e 将结果输出的时候, 也需要一个可以配置的分隔符选
项。更进一步, 键可以不仅仅是每一条记录的第1 个字段,它可以是一条记录的前
n 个字段(可以在stream.num.map.output.key.fields和stream.num.reduce.
output.key.fields 中进行设置) ,而剩下的字段就是值。比如有一条记录是a ,
b , C , 且用逗号分隔,如果n 设为2 ,那么键就是a 、b ,而值就是c 。
流分隔符:
Table 8-3. Streaming separator properties
Property name |
Type |
Default value |
Description |
stream.map.input.field.separator |
String |
\t |
The separator to use when passing the input key and value strings to the stream map process as a stream of bytes |
stream.map.output.field.separator |
String |
\t |
The separator to use when splitting the output from the stream map process into key and value strings for the map output |
stream.num.map.output.key.fields |
int |
1 |
The number of fields separated bystream.map. output.field.separator to treat as the map output key |
stream.reduce.input.field.separator |
String |
\t |
The separator to use when passing the input key and value strings to the stream reduce process as a stream of bytes |
stream.reduce.output.field.separator |
String |
\t |
The separator to use when splitting the output from the stream reduce process into key and value strings for the final reduce output |
stream.num.reduce.output.key.fields |
int |
1 |
The number of fields separated bystream.reduce.output.field. separator to treat as the reduce output key |
mapreduce中分隔符使用的地方,在标准输入输出和map-reducer之间。