fruits.txt
apple
banana
canary melon
grap
lemon
orange
pineapple
strawberry
fruits = sc.textFile('/Users/huangluyu/data/fruits.txt')
numFruitsByLength = fruits.map(lambda fruit: (len(fruit), 1)).reduceByKey(lambda x, y: x + y)
print(numFruitsByLength.take(10))
对字母长度相同的水果进行一次统计。结果:
[(6, 2), (12, 1), (4, 1), (10, 1), (5, 2), (9, 1)]
内部执行顺序:
apple — 5,1
banana — 6,1
canary melon —12,1
grap —4,1
lemon —5,1( 前面有相同key5,往前合并)
orange —6,1( 前面有相同key6,往前合并)
…
最终显示的顺序,似乎是内部决定。
这里数据集被我改掉了
fruits.txt
apple
apple
apple
banana
canary melon
grap
lemon
orange
lemon
pineapple
strawberry
分开字符串并赋值1,统计词频
from operator import add
lines = sc.textFile('/Users/huangluyu/data/fruits.txt')
counts = lines.flatMap(lambda x: x.split()) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
print(counts.sortByKey().take(20))
[(‘apple’, 3), (‘banana’, 1), (‘canary’, 1), (‘grap’, 1), (‘lemon’, 2), (‘melon’, 1), (‘orange’, 1), (‘pineapple’, 1), (‘strawberry’, 1)]
挑出字母数量大于6的单词。
from operator import add
lines = sc.textFile('/Users/huangluyu/data/fruits.txt')
counts = lines.flatMap(lambda x: x.split()).filter(lambda x:len(x)>6).map(lambda x: (x, 1)).reduceByKey(add)
print(counts.sortByKey().collect())
[(‘pineapple’, 1), (‘strawberry’, 1)]