1、map用法示例
PySpark map() Transformation - Spark By {Examples}
1.1 比较map和foreach的功能异同
PySpark foreach() Usage with Examples - Spark By {Examples}
1.2 比较map和apply的功能异同
PySpark apply Function to Column - Spark By {Examples}
1.3 比较map和transform的功能异同
PySpark transform() Function with Example - Spark By {Examples}
2、 flatMap的用法示例
PySpark flatMap() Transformation - Spark By {Examples}
1、map用法示例
语法:
map(f, preservesPartitioning=False)
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
.appName("SparkByExamples.com").getOrCreate()
# 1对于rdd
data = ["Project","Gutenberg’s","Alice’s","Adventures",
"in","Wonderland","Project","Gutenberg’s","Adventures",
"in","Wonderland","Project","Gutenberg’s"]
rdd=spark.sparkContext.parallelize(data)
rdd2=rdd.map(lambda x: (x,1))
for element in rdd2.collect():
print(element)
# 2对于DataFrame。
data = [('James','Smith','M',30),
('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| M| 30|
| Anna| Rose| F| 41|
| Robert|Williams| M| 62|
+---------+--------+------+------+
# pyspark的DataFrame没有map()方法,需要先转成rdd再用
# 2.1 Refering columns by index. 2.1列引用可用index索引
rdd2=df.rdd.map(lambda x:
(x[0]+","+x[1],x[2],x[3]*2)
)
df2=rdd2.toDF(["name","gender","new_salary"] )
df2.show()
+---------------+------+----------+
| name|gender|new_salary|
+---------------+------+----------+
| James,Smith| M| 60|
| Anna,Rose| F| 82|
|Robert,Williams| M| 124|
+---------------+------+----------+
#2.2 也可以使用列名进行索引
# Referring Column Names
rdd2=df.rdd.map(lambda x:
(x["firstname"]+","+x["lastname"],x["gender"],x["salary"]*2)
)
# Referring Column Names 或者如下索引
rdd2=df.rdd.map(lambda x:
(x.firstname+","+x.lastname,x.gender,x.salary*2)
)
# 2.3 调用函数
# By Calling function
def func1(x):
firstName=x.firstname
lastName=x.lastname
name=firstName+","+lastName
gender=x.gender.lower()
salary=x.salary*2
return (name,gender,salary)
rdd2=df.rdd.map(lambda x: func1(x))
1.1 比较map和foreach的功能异同
1.2 比较map和apply的功能异同
1.3 比较map和transform的功能异同
2、 flatMap的用法示例
语法:
flatMap(f, preservesPartitioning=False)
# 1 对于RDD应用flatMap函数
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = ["Project Gutenberg’s",
"Alice’s Adventures in Wonderland",
"Project Gutenberg’s",
"Adventures in Wonderland",
"Project Gutenberg’s"]
rdd=spark.sparkContext.parallelize(data)
#Flatmap方法
rdd2=rdd.flatMap(lambda x: x.split(" "))
for element in rdd2.collect():
print(element)
# 2 对于DataFrame使用flatMap函数
cols_tmp = ["user_id", "cate_cd", "shop_id", "sku_id", "window_type"]
df = df_tmp.rdd.flatMap(lambda x: [(x.user_id, x.cate_cd, x.shop_id, x.sku_id, window_type) for window_type in x.window_type_str.split(",")]).toDF(cols_tmp)
# 3 使用pyspark.sql中的explode函数来代替相同功能
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('pyspark-by-examples').getOrCreate()
arrayData = [
('James',['Java','Scala'],{'hair':'black','eye':'brown'}),
('Michael',['Spark','Java',None],{'hair':'brown','eye':None}),
('Robert',['CSharp',''],{'hair':'red','eye':''}),
('Washington',None,None),
('Jefferson',['1','2'],{})]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages','properties'])
from pyspark.sql.functions import explode
df2 = df.select(df.name,explode(df.knownLanguages))
df2.printSchema()
df2.show()