近期,在使用spark的时候,发现spark在python下的使用,pyspark还挺好用的。
你甚至可以把它当作pandas来使用,众所周知,pandas在数据处理方面是很强大的,不谈性能,它提供了许多的内置方法,非常的方便,极大的减少我们的开发时间。
下面,简答来得展示一下它的具体使用。
首先,我们初设一个SparkSession,并开启pandas的支持:Pandas with Apache Arrow;
接着,简单的新建一个spark的DataFrame对象:df。
import pandas as pd
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
@pandas_udf(“double”)
def multiply_func(a: pd.Series, b: pd.Series) -> pd.Series:
return a * b
df.select(multiply_func(col("id"), col("v")))
上面,实现了一个两列的相乘。每一列在处理的都是pandas的Series,这个时候你就get到了吧,它能用上所有pandas的技巧。
@pandas_udf("double")
def mean_udf(v: pd.Series) -> float:
return v.mean()
# 对字段'v'进行求均值
df.select(mean_udf(df['v'])).show()
# +-----------+
# |mean_udf(v)|
# +-----------+
# | 4.2|
# +-----------+
# 按照‘id’分组,求'v'的均值
df.groupby("id").agg(mean_udf(df['v'])).show()
# +---+-----------+
# | id|mean_udf(v)|
# +---+-----------+
# | 1| 1.5|
# | 2| 6.0|
# +---+-----------+
# 按照‘id’分组,求'v'的均值,并赋值给新的一列,但保持df的形式不变
w = Window \
.partitionBy('id') \
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
# +---+----+------+
# | id| v|mean_v|
# +---+----+------+
# | 1| 1.0| 1.5|
# | 1| 2.0| 1.5|
# | 2| 3.0| 6.0|
# | 2| 5.0| 6.0|
# | 2|10.0| 6.0|
# +---+----+------+
这里的聚合的单位是一列(pd.Series):即要么只能是某一列如mean_udf(df['v'])
,要么是拆开的某几列如mean_udf(df['v'], df['id'])
,
所以返回的只能int、float、str这种非迭代的类型。
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
# 这里根据id分组,那么每次传入的就是对应id的dataframe数据
v = pdf.v
pdf['v1'] = pdf['v'] - v.mean()
pdf['v2'] = pdf['v'] + v.mean()
return pdf
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, v1 double, v2 double").show()
# +---+----+----+----+
# | id | v | v1 | v2 |
# +---+----+----+----+
# | 1 | 1.0 | -0.5 | 2.5 |
# | 1 | 2.0 | 0.5 | 3.5 |
# | 2 | 3.0 | -3.0 | 9.0 |
# | 2 | 5.0 | -1.0 | 11.0 |
# | 2 | 10.0 | 4.0 | 16.0 |
# +---+----+----+----+
这里的聚合跟上面的就不一样了,它处理的单位是pandas的dataframe,所以它的自由度就非常的高了。
在这里,你可以自由的选择返回多行数据,甚至可以新增字段。
df1 = spark.createDataFrame(
[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
("time", "id", "v1"))
df2 = spark.createDataFrame(
[(20000101, 1, "x"), (20000101, 2, "y")],
("time", "id", "v2"))
def asof_join(l, r):
# l、r is a pandas.DataFrame
# 这里是按照id分组
# 那么,l和r分别是对应id的df1和df2数据
return pd.merge_asof(l, r, on="time", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="time int, id int, v1 double, v2 string").show()
# +--------+---+---+---+
# | time| id| v1| v2|
# +--------+---+---+---+
# |20000101| 1|1.0| x|
# |20000102| 1|3.0| x|
# |20000101| 2|2.0| y|
# |20000102| 2|4.0| y|
# +--------+---+---+---+
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
def filter_func(iterator):
# 传入的是一个pdf迭代器
# 每一个pdf都是pandas的dataframe
# 即按照df,一行数据对应一个pdf
for pdf in iterator:
id = int(pdf.id)
age = int(pdf.age)
pdf = pdf.append([{"id": id*10, "age": age*10}])
yield pdf
df.mapInPandas(filter_func, schema=df.schema).show()
# +---+---+
# | id | age |
# +---+---+
# | 1 | 21 |
# | 10 | 210 |
# | 2 | 30 |
# | 20 | 300 |
# +---+---+