pyspark中RDD常用操作

准备工作:

import pyspark
from pyspark import SparkContext
from pyspark import SparkConf

conf=SparkConf().setAppName("lg").setMaster('local[4]')    #local[4]表示用4个内核在本地运行  
sc=SparkContext.getOrCreate(conf)

1. parallelize和collect

parallelize函数将list对象转换为RDD对象;collect()函数返回rdd对象对应的list数据类型

words = sc.parallelize(
    ["scala",
     "java",
     "spark",
     "hadoop",
     "spark",
     "akka",
     "spark vs hadoop",
     "pyspark",
     "pyspark and spark"
     ])
print(words)
print(words.collect())
ParallelCollectionRDD[139] at parallelize at PythonRDD.scala:184
['scala', 'java', 'spark', 'hadoop', 'spark', 'akka', 'spark vs hadoop', 'pyspark', 'pyspark and spark']

2. 定义/生成rdd的两种方式:parallelize函数和textFile函数

          第一种方式即通过parallelize方法,第二种就是使用textFile函数直接读取文件,注意如果传的是文件夹,则会读取该文件夹下所有的文件(文件夹下有文件夹会报错)。

path = 'G:\\pyspark\\rddText.txt'  
rdd = sc.textFile(path)
rdd.collect()

3. 分区设置和展示:repartition,defaultParallelism和glom

       可通过SparkContext.defaultParallelism设置全局默认的分区数量;也可通过repartition设置某个具体rdd的分区数量。

在调用collect()函数前调用glom(),则结果会按分区展示

SparkContext.defaultParallelism=5
print(sc.parallelize([0, 2, 3, 4, 6]).glom().collect())
SparkContext.defaultParallelism=8
print(sc.parallelize([0, 2, 3, 4, 6]).glom().collect())
rdd = sc.parallelize([0, 2, 3, 4, 6])
rdd.repartition(2).glom().collect()
[[0], [2], [3], [4], [6]]
[[], [0], [], [2], [3], [], [4], [6]]
Out[105]:
[[2, 4], [0, 3, 6]]

注意:设置SparkContext.defaultParallelism只对之后定义的rdd有影响,对之前生成的rdd没有影响

pyspark中RDD常用操作_第1张图片

4. count和countByValue

     count返回rdd中元素的个数,返回一个int。countByValue返回rdd中不同元素值出现的个数,返回的是一个字典。如果rdd中的元素不是字典类型,如本案例中是字符串(如果是int会报错),countByKey会将各个元素的首字母作为key来统计不同key的个数.

counts = words.count()
print("Number of elements in RDD -> %i" % counts)
print("Number of every elements in RDD -> %s" % words.countByKey())
print("Number of every elements in RDD -> %s" % words.countByValue())
Number of elements in RDD -> 9
Number of every elements in RDD -> defaultdict(, {'s': 4, 'j': 1, 'h': 1, 'a': 1, 'p': 2})
Number of every elements in RDD -> defaultdict(, {'scala': 1, 'java': 1, 'spark': 2, 'hadoop': 1, 'akka': 1, 'spark vs hadoop': 1, 'pyspark': 1, 'pyspark and spark': 1})

5. filter过滤函数

      filter(func)按func函数对rdd每个分区中的元素(每个分区作为一个整体)进行过滤

words_filter = words.filter(lambda x: 'spark' in x)
filtered = words_filter.glom().collect()
print("Fitered RDD -> %s" % (filtered))
Fitered RDD -> [[], ['spark'], ['spark'], ['spark vs hadoop', 'pyspark', 'pyspark and spark']]

6.map和flatMap

通过将map函数应用于RDD中的每个元素来返回新的RDD。 flatMap() 和map() 的区别:flatMap()返回一个由各列表中的元素组成的RDD,而不是一个由列表组成的RDD。

words_map = words.map(lambda x: (x, len(x)))
mapping = words_map.collect()
print("Key value pair -> %s" % (mapping))
words.flatMap(lambda x: (x, len(x))).collect()
Key value pair -> [('scala', 5), ('java', 4), ('spark', 5), ('hadoop', 6), ('spark', 5), ('akka', 4), ('spark vs hadoop', 15), ('pyspark', 7), ('pyspark and spark', 17)]

['scala',5,'java', 4,'spark', 5,'hadoop',6,'spark',5,'akka',4,'spark vs hadoop',15,'pyspark',7,'pyspark and spark',17]

7 reduce和fold

reduce函数;执行指定的可交换和关联二元操作后,将返回RDD中的元素.

假如有一组整数[x1,x2,x3],利用reduce执行加法操作add,对第一个元素执行add后,结果为sum=x1,

然后再将sum和x2执行add,sum=x1+x2,最后再将x2和sum执行add,此时sum=x1+x2+x3。

fold和reduce的区别:fold比reduce多传一个参数,下面的实例中nums.fold(1,add)表示nums中的每个元素先执行add(e,1)后再执行reduce.

def add(a,b):
    c = a + b
    print(str(a) + ' + ' + str(b) + ' = ' + str(c))
    return c
nums = sc.parallelize([1, 2, 3, 4, 5])
adding = nums.reduce(add)
print("Adding all the elements -> %i" % (adding))
adding2 = nums.fold(1,add)   #第一个参数1表示nums中的每个元素先执行add(e,1)后再执行fold
print("Adding all the elements -> %i" % (adding2))
1 + 2 = 3
3 + 3 = 6
6 + 4 = 10
10 + 5 = 15
Adding all the elements -> 15
1 + 1 = 2
2 + 2 = 4
4 + 1 = 5
5 + 3 = 8
8 + 4 = 12
12 + 1 = 13
13 + 5 = 18
18 + 6 = 24
Adding all the elements -> 24

8. distinct 去重

9. 多个rdd之间的union,intersection,subtract和cartesian

相当于并、交、差和笛卡儿积操作

rdd1 = sc.parallelize(["spark","hadoop","hive","spark"])
rdd2 = sc.parallelize(["spark","hadoop","hbase","hadoop"])
rdd3 = rdd1.union(rdd2)
rdd3.collect()

['spark', 'hadoop', 'hive', 'spark', 'spark', 'hadoop', 'hbase', 'hadoop']
rdd3 = rdd1.intersection(rdd2)
rdd3.collect()

['spark', 'hadoop']
rdd3 = rdd1.subtract(rdd2)
rdd3.collect()

['hive']
rdd3 = rdd1.cartesian(rdd2)
rdd3.collect()

[('spark', 'spark'),
 ('spark', 'hadoop'),
 ('spark', 'hbase'),
 ('spark', 'hadoop'),
 ('hadoop', 'spark'),
 ('hadoop', 'hadoop'),
 ('hadoop', 'hbase'),
 ('hadoop', 'hadoop'),
 ('hive', 'spark'),
 ('hive', 'hadoop'),
 ('hive', 'hbase'),
 ('hive', 'hadoop'),
 ('spark', 'spark'),
 ('spark', 'hadoop'),
 ('spark', 'hbase'),
 ('spark', 'hadoop')]

10 top,take,takeOrdered

返回值就是list,无需再collect()。其中take不按排序取元素,即取原始rdd中前n个位置的原始;top默认按大到小排序取前n个;takeOrdered默认按小到大排序取前n个.

rdd1 = sc.parallelize(["spark","hadoop","hive","spark","kafka"])
print(rdd1.top(3))
print(rdd1.take(3))
print(rdd1.takeOrdered(3))

['spark', 'spark', 'kafka']
['spark', 'hadoop', 'hive']
['hadoop', 'hive', 'kafka']

11. join操作

内连接join(other, numPartitions = None) 返回RDD,其中包含一对带有匹配键的元素以及该特定键的所有值.

x = sc.parallelize([("spark", 1), ("hadoop", 4), ("hive", 3)])
y = sc.parallelize([("spark", 2), ("hadoop", 5), ("hbase", 6)])
joined = x.join(y)
final = joined.collect()
print( "Join RDD -> %s" % (final))

Join RDD -> [('spark', (1, 2)), ('hadoop', (4, 5))]

外连接outer join:left/right/full 左连接/右连接/全连接(leftOuterJoin,rightOuterJoin和fullOuterJoin)

left_joined = x.leftOuterJoin(y)
print( "Left Join RDD -> %s" % (left_joined.collect()))
right_joined = x.rightOuterJoin(y)
print( "Left Join RDD -> %s" % (right_joined.collect()))
full_joined = x.fullOuterJoin(y)
print( "Full Join RDD -> %s" % (full_joined.collect()))

Left Outer Join RDD -> [('spark', (1, 2)), ('hive', (3, None)), ('hadoop', (4, 5))]
Right Outer Join RDD -> [('spark', (1, 2)), ('hbase', (None, 6)), ('hadoop', (4, 5))]
Full Outer Join RDD -> [('spark', (1, 2)), ('hive', (3, None)), ('hbase', (None, 6)), ('hadoop', (4, 5))]

 

12 aggregate

aggregate中前一个函数是在各分区内计算的函数,后一个函数是聚合个分区结果的函数

def add2(a,b):
    c = a + b
    print(str(a) + " add " + str(b) + ' = ' + str(c))
    return c
def mul(a,b):
    c = a*b
    print(str(a) + " mul " + str(b) + ' = ' + str(c))
    return c
print(nums.glom().collect())

#相当于先将各分区的数值求和加上2,即转化为[[3], [4], [5], [11]]
#再使用2和各分区的数字连乘,即2*3=6,6*4=24,24*5=120,120*11=1320
print(nums.aggregate(2,add2,mul))   
#相当于先将各分区的数值乘以2,即转化为[[2], [4], [6], [40]],其中40=2*4*5
#再使用2和各分区的数字连加,即2+2=4,4+4=8,8+6=14,14+40=54
print(nums.aggregate(2,mul,add2)) 

[[1], [2], [3], [4, 5]]
2 mul 3 = 6
6 mul 4 = 24
24 mul 5 = 120
120 mul 11 = 1320
1320
2 add 2 = 4
4 add 4 = 8
8 add 6 = 14
14 add 40 = 54
54

 

你可能感兴趣的:(pyspark)