pyspark中RDD和DataFrame之间的转换利用RDD处理DataFrame:数据分段等功能实现

RDD和DataFrame为Spark中经常用到的两个数据结构,对于两个数据结构的比较,简而言之,Dataframe比RDD的速度快,对于结构化的数据,使用DataFrame编写的代码更简洁,因为DataFrame本身对应的就是一个表结构。

RDD是Spark面向用户的主要API。核心层面,RDD是数据元素的分布式集合,在集群中的节点之间进行分区,提供了并行转换和操作的底层API。

通常来说,如下情况使用RDD比较方便:

1.需要对数据集进行底层转换、操作和控制;

2.被结构化的数据,例如各种文件的stream。

下面给出一个DataFrame转换为RDD进行底层转换后有转回DataFrame的例子:

from pyspark.sql import SparkSession
import sys
from pyspark.sql.functions import greatest
from pyspark.sql.types import *
def picewise_func(x):
    if x < 0:
        return 0
    elif x>=0 and x<=5:
        return 1
    elif x>=6 and x<=10:
        return 2
    elif x>=11 and x<=15:
        return 3
    elif x>=16 and x<=20:
        return 4
    else:
        return 5
if __name__ == "__main__":
  spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate()
  valuesB = [('A',1),('B',-8),('C',7),('D',13),('E',18),('F',23)]
  TableB = spark.createDataFrame(valuesB,['name','id'])
  TableB.show()
  rdd2 = TableB.rdd
  rdd1 = rdd2.map(lambda x: (x[0]+"_rdd",picewise_func(x[1])))
  for element in rdd1.collect():
      print(element)
  schema1 = StructType([StructField('col1', StringType(), True),StructField('col2', IntegerType(), True)])
  TableC = spark.createDataFrame(rdd1, schema = schema1)
  TableC.show()

可以看到DataFrame中的成员rdd直接可以获取RDD,之后通过RDD的map对应的lambda操作实现转换(这里实现了分段函数),最后再通过createDataFrame生成新的DataFrame。

输出结果:

+----+---+
|name| id|
+----+---+
|   A|  1|
|   B| -8|
|   C|  7|
|   D| 13|
|   E| 18|
|   F| 23|
+----+---+

('A', 11)
('B', 2)
('C', 17)
('D', 23)
('E', 28)
('F', 33)

('A_rdd', 1)
('B_rdd', 0)
('C_rdd', 2)
('D_rdd', 3)
('E_rdd', 4)
('F_rdd', 5)

可以看到Rdd可以方便实现各种转换,这里再给一个求任意两列和的例子:

from pyspark.sql import SparkSession
import sys
from pyspark.sql.functions import greatest
from pyspark.sql.types import *
def picewise_func(x):
    if x < 0:
        return 0
    elif x>=0 and x<=5:
        return 1
    elif x>=6 and x<=10:
        return 2
    elif x>=11 and x<=15:
        return 3
    elif x>=16 and x<=20:
        return 4
    else:
        return 5
if __name__ == "__main__":
  spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate()
  valuesB = [('A',1,1),('B',-8,2),('C',7,3),('D',13,4),('E',18,5),('F',23,6)]
  TableB = spark.createDataFrame(valuesB,['name','id'])
  print('********TableB********')
  TableB.show()
  rdd2 = TableB.rdd
  rdd1 = rdd2.map(lambda x: (x[0]+"_rdd",picewise_func(x[1])))
  print('*********rdd1**********')
  for element in rdd1.collect():
      print(element)
  schema1 = StructType([StructField('col1', StringType(), True),StructField('col2', IntegerType(), True)])
  TableC = spark.createDataFrame(rdd1, schema = schema1)
  print('*********TableC********')
  TableC.show()
  rdd3 = rdd2.map(lambda x: (x[0]+"_rdd1",x[1]+x[2]))
  print('*********rdd3**********')
  for element in rdd3.collect():
      print(element)
  TableD = spark.createDataFrame(rdd3, schema = schema1)
  TableD.show()

下面再给一个更复杂的批处理的例子:

from pyspark.sql import SparkSession
import sys
from pyspark.sql.functions import greatest
from pyspark.sql.types import *
def picewise_func(xx):
    tmp_list = []
    tmp_list.append(xx[0]+"_rdd")
    for x in xx[1:]:
        if x < 0:
            tmp_list.append(0)
        elif x>=0 and x<=5:
            tmp_list.append(1)
        elif x>=6 and x<=10:
            tmp_list.append(2)
        elif x>=11 and x<=15:
            tmp_list.append(3)
        elif x>=16 and x<=20:
            tmp_list.append(4)
        else:
            tmp_list.append(5)
    return tmp_list
if __name__ == "__main__":
  spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate()
  valuesB = [('A',1,2),('B',-8,-7),('C',7,8),('D',13,14),('E',18,19),('F',23,24)]
  TableB = spark.createDataFrame(valuesB,['name','id1','id2'])
  print('********TableB********')
  TableB.show()
  rdd2 = TableB.rdd
  rdd1 = rdd2.map(lambda x: (picewise_func(x)))
  print('*********rdd1**********')
  for element in rdd1.collect():
      print(element)
  schema1 = StructType([StructField('col1', StringType(), True),StructField('col2', IntegerType(), True),StructField('col3', IntegerType(), True)])
  TableC = spark.createDataFrame(rdd1, schema = schema1)
  print('*********TableC********')
  TableC.show()
~                   

你可能感兴趣的:(大数据,spark,scala,big,data)