pyspark的dataframe与rdd使用示例

pyspark rdd使用示例官网:http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html#pyspark.RDD

1、读数据

schema = ('user_id', 'item_id','click_lis','scores')
data = [('u1', 'i1',['i1','i3','i2'],'0.6'),('u1', 'i2',['i1','i3','i2'],'0.7'),('u1', 'i4',['i1','i3','i2'],'0.2'),('u1', 'i5',['i1','i3','i2'],'0.3'),('u2', 'i1',['i4','i3','i2'],'0.6'),('u2', 'i6',['i4','i3','i2'],'0.7'),('u2', 'i4',['i4','i3','i2'],'0.2'),('u2', 'i5',['i4','i3','i2'],'0.3')]
df1 = spark.createDataFrame(data, schema)
df1.show()

+-------+-------+------------+------+
|user_id|item_id|   click_lis|scores|
+-------+-------+------------+------+
|     u1|     i1|[i1, i3, i2]|   0.6|
|     u1|     i2|[i1, i3, i2]|   0.7|
|     u1|     i4|[i1, i3, i2]|   0.2|
|     u1|     i5|[i1, i3, i2]|   0.3|
|     u2|     i1|[i4, i3, i2]|   0.6|
|     u2|     i6|[i4, i3, i2]|   0.7|
|     u2|     i4|[i4, i3, i2]|   0.2|
|     u2|     i5|[i4, i3, i2]|   0.3|
+-------+-------+------------+------+

2、列与列之间相互作用

#组合item_id与scores字段
df1 = df1.rdd.map(lambda line:[line.user_id,line.click_lis,(line.item_id,line.scores)]).toDF(['user_id','click_lis','item2scrore'])
df1.show()
+-------+------------+-----------+
|user_id|   click_lis|item2scrore|
+-------+------------+-----------+
|     u1|[i1, i3, i2]|  [i1, 0.6]|
|     u1|[i1, i3, i2]|  [i2, 0.7]|
|     u1|[i1, i3, i2]|  [i4, 0.2]|
|     u1|[i1, i3, i2]|  [i5, 0.3]|
|     u2|[i4, i3, i2]|  [i1, 0.6]|
|     u2|[i4, i3, i2]|  [i6, 0.7]|
|     u2|[i4, i3, i2]|  [i4, 0.2]|
|     u2|[i4, i3, i2]|  [i5, 0.3]|
+-------+------------+-----------+
两个字段的交互的操作有很多,如:两个列,它们的值都是list类型,现在要求交集,则可以使用 list(set(line[1])&set(line[2])) (注意:需转成list,因为pyspark 不支持set类型,若交集为空,则它的值是空列表[]而不是null)

#以user_id作为key进行group by(需要注意的是user_id的值得类型不能是list或tuple,否则会报unhashable type: 'list')
df2=df1.rdd.map(lambda line:(line.user_id,[line.item2scrore])).reduceByKey(lambda x,y:x+y).toDF(['user_id','item2scrore'])#x+y表示value累加,同一个user下的[line.item2scrore]相加
df2.show()

+-------+--------------------+
|user_id|         item2scrore|
+-------+--------------------+
|     u1|[[i1, 0.6], [i2, ...|
|     u2|[[i1, 0.6], [i6, ...|
+-------+--------------------+

3、join

用rdd join很不方便,建议转化成dataframe再join

result=df2.join(df3,['user_id'],'left') # []中也可以放多个key 如:['user_id','col1',...]

4、求一列的mean

rdd=rdd.map(lambda line:line[0]).mean()
或用dataframe
acc=result.select(F.mean('acc').alias('acc')).collect()[0].acc

5、两个行数相同的dataframe合并成一个

merge=df1.rdd.zip(df2.rdd)

df=merge.map(lambda line:line[0]+line[1]).toDF #line[0] 表示是df1所有列,此时line[0]是元组类型

line[1] 同理,需要注意的是:当一个dataframe和series 合并时,line[1]不是元组类型

6、过滤

rdd=rdd.filter(lambda line: line[0]!='0') #不等于‘0’的则留下来

7、去掉重复数据

rdd=rdd.distinct()

8、列转行

x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])])
>>> def f(x): return x
>>> x.flatMapValues(f).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')] #其中 a是key

9、判断某列是否全为空

rdd=rdd.map(lambda line: line[0])

print(rdd.isEmpty())

10、排序

rdd=rdd.sortBy(lambda line:line[0],ascending=True)

你可能感兴趣的:(pyspark,pyspark,rdd,数据处理)