RDD和DataFrame为Spark中经常用到的两个数据结构,对于两个数据结构的比较,简而言之,Dataframe比RDD的速度快,对于结构化的数据,使用DataFrame编写的代码更简洁,因为DataFrame本身对应的就是一个表结构。
RDD是Spark面向用户的主要API。核心层面,RDD是数据元素的分布式集合,在集群中的节点之间进行分区,提供了并行转换和操作的底层API。
通常来说,如下情况使用RDD比较方便:
1.需要对数据集进行底层转换、操作和控制;
2.被结构化的数据,例如各种文件的stream。
下面给出一个DataFrame转换为RDD进行底层转换后有转回DataFrame的例子:
from pyspark.sql import SparkSession
import sys
from pyspark.sql.functions import greatest
from pyspark.sql.types import *
def picewise_func(x):
if x < 0:
return 0
elif x>=0 and x<=5:
return 1
elif x>=6 and x<=10:
return 2
elif x>=11 and x<=15:
return 3
elif x>=16 and x<=20:
return 4
else:
return 5
if __name__ == "__main__":
spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate()
valuesB = [('A',1),('B',-8),('C',7),('D',13),('E',18),('F',23)]
TableB = spark.createDataFrame(valuesB,['name','id'])
TableB.show()
rdd2 = TableB.rdd
rdd1 = rdd2.map(lambda x: (x[0]+"_rdd",picewise_func(x[1])))
for element in rdd1.collect():
print(element)
schema1 = StructType([StructField('col1', StringType(), True),StructField('col2', IntegerType(), True)])
TableC = spark.createDataFrame(rdd1, schema = schema1)
TableC.show()
可以看到DataFrame中的成员rdd直接可以获取RDD,之后通过RDD的map对应的lambda操作实现转换(这里实现了分段函数),最后再通过createDataFrame生成新的DataFrame。
输出结果:
+----+---+
|name| id|
+----+---+
| A| 1|
| B| -8|
| C| 7|
| D| 13|
| E| 18|
| F| 23|
+----+---+
('A', 11)
('B', 2)
('C', 17)
('D', 23)
('E', 28)
('F', 33)
('A_rdd', 1)
('B_rdd', 0)
('C_rdd', 2)
('D_rdd', 3)
('E_rdd', 4)
('F_rdd', 5)
可以看到Rdd可以方便实现各种转换,这里再给一个求任意两列和的例子:
from pyspark.sql import SparkSession
import sys
from pyspark.sql.functions import greatest
from pyspark.sql.types import *
def picewise_func(x):
if x < 0:
return 0
elif x>=0 and x<=5:
return 1
elif x>=6 and x<=10:
return 2
elif x>=11 and x<=15:
return 3
elif x>=16 and x<=20:
return 4
else:
return 5
if __name__ == "__main__":
spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate()
valuesB = [('A',1,1),('B',-8,2),('C',7,3),('D',13,4),('E',18,5),('F',23,6)]
TableB = spark.createDataFrame(valuesB,['name','id'])
print('********TableB********')
TableB.show()
rdd2 = TableB.rdd
rdd1 = rdd2.map(lambda x: (x[0]+"_rdd",picewise_func(x[1])))
print('*********rdd1**********')
for element in rdd1.collect():
print(element)
schema1 = StructType([StructField('col1', StringType(), True),StructField('col2', IntegerType(), True)])
TableC = spark.createDataFrame(rdd1, schema = schema1)
print('*********TableC********')
TableC.show()
rdd3 = rdd2.map(lambda x: (x[0]+"_rdd1",x[1]+x[2]))
print('*********rdd3**********')
for element in rdd3.collect():
print(element)
TableD = spark.createDataFrame(rdd3, schema = schema1)
TableD.show()
下面再给一个更复杂的批处理的例子:
from pyspark.sql import SparkSession
import sys
from pyspark.sql.functions import greatest
from pyspark.sql.types import *
def picewise_func(xx):
tmp_list = []
tmp_list.append(xx[0]+"_rdd")
for x in xx[1:]:
if x < 0:
tmp_list.append(0)
elif x>=0 and x<=5:
tmp_list.append(1)
elif x>=6 and x<=10:
tmp_list.append(2)
elif x>=11 and x<=15:
tmp_list.append(3)
elif x>=16 and x<=20:
tmp_list.append(4)
else:
tmp_list.append(5)
return tmp_list
if __name__ == "__main__":
spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate()
valuesB = [('A',1,2),('B',-8,-7),('C',7,8),('D',13,14),('E',18,19),('F',23,24)]
TableB = spark.createDataFrame(valuesB,['name','id1','id2'])
print('********TableB********')
TableB.show()
rdd2 = TableB.rdd
rdd1 = rdd2.map(lambda x: (picewise_func(x)))
print('*********rdd1**********')
for element in rdd1.collect():
print(element)
schema1 = StructType([StructField('col1', StringType(), True),StructField('col2', IntegerType(), True),StructField('col3', IntegerType(), True)])
TableC = spark.createDataFrame(rdd1, schema = schema1)
print('*********TableC********')
TableC.show()
~