使用Spark SQL在对数据进行处理的过程中,可能会遇到对一列数据拆分为多列,或者把多列数据合并为一列。这里记录一下目前想到的对DataFrame列数据进行合并和拆分的几种方法。
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("dataframe_split") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
sc = spark.sparkContext
df = spark.read.csv('hdfs://master:9000/dataset/dataframe_split.csv', inferSchema=True, header=True)
df.show(3)
原始数据如下所示
from pyspark.sql.functions import split, explode, concat, concat_ws
df_split = df.withColumn("s", split(df['score'], " "))
df_split.show()
排序首先基于分区索引,然后是每个分区内的项目顺序.因此,第一个分区中的第一个item索引为0,最后一个分区中的最后一个item的索引最大.当RDD包含多个分区时此方法需要触发spark作业.
first_row = df.first()
numAttrs = len(first_row['score'].split(" "))
print("新增列的个数", numAttrs)
attrs = sc.parallelize(["score_" + str(i) for i in range(numAttrs)]).zipWithIndex().collect()
print("列名:", attrs)
for name, index in attrs:
df_split = df_split.withColumn(name, df_split['s'].getItem(index))
df_split.show()
df_explode = df.withColumn("e", explode(split(df['score'], " ")))
df_explode.show()
列的合并有两个函数:一个不添加分隔符concat(),一个添加分隔符concat_ws()
df_concat = df_split.withColumn("score_concat", concat(df_split['score_0'], \
df_split['score_1'], df_split['score_2'], df_split['score_3']))
df_concat.show()
df_ws = df_split.withColumn("score_concat", concat_ws('-', df_split['score_0'], \
df_split['score_1'], df_split['score_2'], df_split['score_3']))
df_ws.show()
#DataFrame 数据格式:每个用户对每部电影的评分 userID 用户ID,movieID 电影ID,rating评分
df=spark.sparkContext.parallelize([[15,399,2], \
[15,1401,5], \
[15,1608,4], \
[15,20,4], \
[18,100,3], \
[18,1401,3], \
[18,399,1]])\
.toDF(["userID","movieID","rating"])
#pivot 多行转多列
resultDF = df.groupBy("userID").pivot("movieID").sum("rating").na.fill(-1)
#结果
resultDF.show()
参考文献:
Spark DataFrame 列的合并与拆分
Spark DataFrame 多行转多列