文件是由很多行组成的,这些行组成一个列表,python提供了处理列表很有用的三个函数:map、reduce、filter。因此在文本处理中,可以使用这三个函数达到代码的更加精简清晰。
这里的map、reduce是python的内置函数,跟hadoop的map、reduce函数没有关系,不过使用的目的有点类似,map函数做预处理、reduce函数一般做聚合。
map、reduce、filter在文本处理中的使用
下面是一个文本文件的内容,第1列是ID,第4列是权重,我们的目标是获取所有ID是奇数的行,将这些行的权重翻倍,最后返回权重值的总和。
ID | 键 | 值 | 权重 |
1 | name1 | value1 | 11 |
2 | name2 | value2 | 12 |
3 | name3 | value3 | 13 |
4 | name4 | value4 | 14 |
5 | name5 | value5 | 15 |
6 | name6 | value6 | 16 |
7 | name7 | value7 | 17 |
8 | name8 | value8 | 18 |
9 | name9 | value9 | 19 |
10 | name10 | value10 | 20 |
使用filter、map、reduce函数的代码如下;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
|
#coding=utf8
''
'
Created on 2013-12-15
@author: www.crazyant.net
'
''
import
pprint
def
read_file
(
file_path
)
:
''
'
读取文件的每一行,按\t分割后返回字段列表;
'
''
with
open
(
file_path
,
"r"
)
as
fp
:
for
line
in
fp
:
fields
=
line
[
:
-
1
]
.
split
(
"\t"
)
yield
fields
fp
.
close
(
)
def
is_even_lines
(
fields
)
:
''
'
判断该行是否第一列的数字为偶数;
'
''
return
int
(
fields
[
0
]
)
%
2
==
0
def
double_weights
(
fields
)
:
''
'
将每一行的权重这一字段的值翻倍
'
''
fields
[
-
1
]
=
int
(
fields
[
-
1
]
)
*
2
return
fields
def
sum_weights
(
sum_value
,
fields
)
:
''
'
累加数字x到数字sum_value上面;
返回新的sum_value值;
'
''
sum_value
+=
int
(
fields
[
-
1
]
)
return
sum_value
if
__name__
==
"__main__"
:
#读取文件中的所有行
file_lines
=
[
x
for
x
in
read_file
(
"test_data"
)
]
print
'文件中原始的行:'
pprint
.
pprint
(
file_lines
)
print
'----'
#过滤掉ID为偶数的行
odd_lines
=
filter
(
is_even_lines
,
file_lines
)
print
'过滤掉ID为偶数的行:'
pprint
.
pprint
(
odd_lines
)
print
'----'
#将每行的权重值翻倍
double_weights_lines
=
map
(
double_weights
,
odd_lines
)
print
'将每行的权重值翻倍:'
pprint
.
pprint
(
double_weights_lines
)
print
'----'
#计算所有的权重值的和
#由于传给sum函数的每个元素都是一个列表,所以需要先提供累加的初始值,这里指定为0
sum_val
=
reduce
(
sum_weights
,
double_weights_lines
,
0
)
print
'计算每行权重值的综合:'
print
sum
_val
|
运行结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
文件中原始的行:
[
[
'1'
,
'name1'
,
'value1'
,
'11'
]
,
[
'2'
,
'name2'
,
'value2'
,
'12'
]
,
[
'3'
,
'name3'
,
'value3'
,
'13'
]
,
[
'4'
,
'name4'
,
'value4'
,
'14'
]
,
[
'5'
,
'name5'
,
'value5'
,
'15'
]
,
[
'6'
,
'name6'
,
'value6'
,
'16'
]
,
[
'7'
,
'name7'
,
'value7'
,
'17'
]
,
[
'8'
,
'name8'
,
'value8'
,
'18'
]
,
[
'9'
,
'name9'
,
'value9'
,
'19'
]
,
[
'10'
,
'name10'
,
'value10'
,
'20'
]
]
--
--
过滤掉
ID为偶数的行:
[
[
'2'
,
'name2'
,
'value2'
,
'12'
]
,
[
'4'
,
'name4'
,
'value4'
,
'14'
]
,
[
'6'
,
'name6'
,
'value6'
,
'16'
]
,
[
'8'
,
'name8'
,
'value8'
,
'18'
]
,
[
'10'
,
'name10'
,
'value10'
,
'20'
]
]
--
--
将每行的权重值翻倍:
[
[
'2'
,
'name2'
,
'value2'
,
24
]
,
[
'4'
,
'name4'
,
'value4'
,
28
]
,
[
'6'
,
'name6'
,
'value6'
,
32
]
,
[
'8'
,
'name8'
,
'value8'
,
36
]
,
[
'10'
,
'name10'
,
'value10'
,
40
]
]
--
--
计算每行权重值的综合:
160
|
map、reduce、filter函数的特点
- filter函数:以列表为参数,返回满足条件的元素组成的列表;类似于SQL中的where a=1
- map函数:以列表为参数,对每个元素做处理,返回这些处理后元素组成的列表;类似于sql中的select a*2
- reduce函数:以列表为参数,对列表进行累计、汇总、平均等聚合函数;类似于sql中的select sum(a),average(b)
这些函数官方的解释
map(function, iterable, …)
Apply function to every item of iterable and return a list of the results. If additional iterable arguments are passed, function must take that many arguments and is applied to the items from all iterables in parallel. If one iterable is shorter than another it is assumed to be extended with None items. If function is None, the identity function is assumed; if there are multiple arguments, map() returns a list consisting of tuples containing the corresponding items from all iterables (a kind of transpose operation). The iterable arguments may be a sequence or any iterable object; the result is always a list.
reduce(function, iterable[, initializer])
Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value. For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5). The left argument, x, is the accumulated value and the right argument, y, is the update value from the iterable. If the optional initializer is present, it is placed before the items of the iterable in the calculation, and serves as a default when the iterable is empty. If initializer is not given and iterable contains only one item, the first item is returned. Roughly equivalent to:
def reduce(function, iterable, initializer=None):
it = iter(iterable)
if initializer is None:
try:
initializer = next(it)
except StopIteration:
raise TypeError(‘reduce() of empty sequence with no initial value’)
accum_value = initializer
for x in it:
accum_value = function(accum_value, x)
return accum_value
filter(function, iterable)
Construct a list from those elements of iterable for which function returns true. iterable may be either a sequence, a container which supports iteration, or an iterator. If iterable is a string or a tuple, the result also has that type; otherwise it is always a list. If function is None, the identity function is assumed, that is, all elements of iterable that are false are removed.
Note that filter(function, iterable) is equivalent to [item for item in iterable if function(item)] if function is not None and [item for item in iterable if item] if function is None.
See itertools.ifilter() and itertools.ifilterfalse() for iterator versions of this function, including a variation that filters for elements where the function returns false.
参考资料:
http://docs.python.org/2/library/functions.html
http://www.oschina.net/code/snippet_111708_16145
转载请注明来源: http://www.crazyant.net/1390.html