pyspark rdd使用示例官网:http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html#pyspark.RDD
1、读数据
schema = ('user_id', 'item_id','click_lis','scores')
data = [('u1', 'i1',['i1','i3','i2'],'0.6'),('u1', 'i2',['i1','i3','i2'],'0.7'),('u1', 'i4',['i1','i3','i2'],'0.2'),('u1', 'i5',['i1','i3','i2'],'0.3'),('u2', 'i1',['i4','i3','i2'],'0.6'),('u2', 'i6',['i4','i3','i2'],'0.7'),('u2', 'i4',['i4','i3','i2'],'0.2'),('u2', 'i5',['i4','i3','i2'],'0.3')]
df1 = spark.createDataFrame(data, schema)
df1.show()
+-------+-------+------------+------+
|user_id|item_id| click_lis|scores|
+-------+-------+------------+------+
| u1| i1|[i1, i3, i2]| 0.6|
| u1| i2|[i1, i3, i2]| 0.7|
| u1| i4|[i1, i3, i2]| 0.2|
| u1| i5|[i1, i3, i2]| 0.3|
| u2| i1|[i4, i3, i2]| 0.6|
| u2| i6|[i4, i3, i2]| 0.7|
| u2| i4|[i4, i3, i2]| 0.2|
| u2| i5|[i4, i3, i2]| 0.3|
+-------+-------+------------+------+
2、列与列之间相互作用
#组合item_id与scores字段
df1 = df1.rdd.map(lambda line:[line.user_id,line.click_lis,(line.item_id,line.scores)]).toDF(['user_id','click_lis','item2scrore'])
df1.show()
+-------+------------+-----------+
|user_id| click_lis|item2scrore|
+-------+------------+-----------+
| u1|[i1, i3, i2]| [i1, 0.6]|
| u1|[i1, i3, i2]| [i2, 0.7]|
| u1|[i1, i3, i2]| [i4, 0.2]|
| u1|[i1, i3, i2]| [i5, 0.3]|
| u2|[i4, i3, i2]| [i1, 0.6]|
| u2|[i4, i3, i2]| [i6, 0.7]|
| u2|[i4, i3, i2]| [i4, 0.2]|
| u2|[i4, i3, i2]| [i5, 0.3]|
+-------+------------+-----------+
两个字段的交互的操作有很多,如:两个列,它们的值都是list类型,现在要求交集,则可以使用 list(set(line[1])&set(line[2])) (注意:需转成list,因为pyspark 不支持set类型,若交集为空,则它的值是空列表[]而不是null)
#以user_id作为key进行group by(需要注意的是user_id的值得类型不能是list或tuple,否则会报unhashable type: 'list')
df2=df1.rdd.map(lambda line:(line.user_id,[line.item2scrore])).reduceByKey(lambda x,y:x+y).toDF(['user_id','item2scrore'])#x+y表示value累加,同一个user下的[line.item2scrore]相加
df2.show()
+-------+--------------------+
|user_id| item2scrore|
+-------+--------------------+
| u1|[[i1, 0.6], [i2, ...|
| u2|[[i1, 0.6], [i6, ...|
+-------+--------------------+
3、join
用rdd join很不方便,建议转化成dataframe再join
result=df2.join(df3,['user_id'],'left') # []中也可以放多个key 如:['user_id','col1',...]
4、求一列的mean
rdd=rdd.map(lambda line:line[0]).mean() 或用dataframe acc=result.select(F.mean('acc').alias('acc')).collect()[0].acc
5、两个行数相同的dataframe合并成一个
merge=df1.rdd.zip(df2.rdd)
df=merge.map(lambda line:line[0]+line[1]).toDF #line[0] 表示是df1所有列,此时line[0]是元组类型
line[1] 同理,需要注意的是:当一个dataframe和series 合并时,line[1]不是元组类型
6、过滤
rdd=rdd.filter(lambda line: line[0]!='0') #不等于‘0’的则留下来
7、去掉重复数据
rdd=rdd.distinct()
8、列转行
x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])]) >>> def f(x): return x >>> x.flatMapValues(f).collect() [('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')] #其中 a是key
9、判断某列是否全为空
rdd=rdd.map(lambda line: line[0])
print(rdd.isEmpty())
10、排序
rdd=rdd.sortBy(lambda line:line[0],ascending=True)