类型提示可以表示为pandas.Series,…->pandas.Series
通过pandas_udf()与具有上述类型提示的函数一起使用,它会创建一个Pandas UDF,其中给定的函数采用一个或多个pandas.Series并输出一个pandas.Series。函数的输出应始终与输入的长度相同。在内部,PySpark将通过拆分为批次并作为数据的子集调用每个批次的函数,然后将结果连接在一起来执行Pandas UDF
import os
from pyspark.sql.types import LongType
from pyspark.sql import SparkSession
import pandas as pd
from pyspark.sql.functions import pandas_udf,col
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = '/root/anaconda3/envs/pyspark_env/bin/python'
os.environ['PYSPARK_PYTHON'] = PYSPARK_PYTHON
os.environ['PYSPARK_DRIVER_PYTHON'] = PYSPARK_PYTHON
if __name__ == '__main__':
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
sc = spark.sparkContext
# 方式1:普通方式创建pandas_func
def multiply_func(a:pd.Series,b:pd.Series)->pd.Series:
return a*b
multiply = pandas_udf(multiply_func,returnType=LongType())
x = pd.Series([1,2,3])
print(multiply_func(x,x))
df = spark.createDataFrame(pd.DataFrame(x,columns=['x']))
df.select(multiply(col("x"),col("x"))).show()
print("="*100)
# 方式2:装饰器方法
@pandas_udf(LongType())
def multiply_func1(a:pd.Series,b:pd.Series) -> pd.Series:
return a*b
df.select(multiply_func(col("x"),col("x")))\
.withColumnRenamed("multiply_func1(x,x)","xxx").show()
spark.stop()
import os
from typing import Iterator
from pyspark.sql import SparkSession
import pandas as pd
from pyspark.sql.functions import pandas_udf
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = '/root/anaconda3/envs/pyspark_env/bin/python'
os.environ['PYSPARK_PYTHON'] = PYSPARK_PYTHON
os.environ['PYSPARK_DRIVER_PYTHON'] = PYSPARK_PYTHON
if __name__ == '__main__':
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
sc = spark.sparkContext
pdf = pd.DataFrame([1,2,3],columns=["x"])
df = spark.createDataFrame(pdf)
@pandas_udf("long")
def plus_one(iterator:Iterator[pd.Series])->Iterator[pd.Series]:
for x in iterator:
yield x + 1
df.select(plus_one("x")).show()
spark.stop()
import os
from typing import Iterator,Tuple
from pyspark.sql import SparkSession
import pandas as pd
from pyspark.sql.functions import pandas_udf
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = '/root/anaconda3/envs/pyspark_env/bin/python'
os.environ['PYSPARK_PYTHON'] = PYSPARK_PYTHON
os.environ['PYSPARK_DRIVER_PYTHON'] = PYSPARK_PYTHON
if __name__ == '__main__':
spark = SparkSession.builder\
.appName("test")\
.getOrCreate()
sc = spark.sparkContext
pdf = pd.DataFrame([1,2,3],columns=["x"])
df = spark.createDataFrame(pdf)
@pandas_udf("long")
def multiply_two_cols(iterator:Iterator[Tuple[pd.Series,pd.Series]])->Iterator[pd.Series]:
for a, b in iterator:
yield a*b
df.select(multiply_two_cols("x","x")).show()
spark.stop()
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
from pyspark.sql import Window
import pandas as pd
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = '/root/anaconda3/envs/pyspark_env/bin/python'
os.environ['PYSPARK_PYTHON'] = PYSPARK_PYTHON
os.environ['PYSPARK_DRIVER_PYTHON'] = PYSPARK_PYTHON
if __name__ == '__main__':
spark = SparkSession.builder\
.appName('test')\
.getOrCreate()
sc = spark.sparkContext
df = spark.createDataFrame([(1,1.0),(1,2.0),(2,3.0),(2,5.0),(2,10.0)],("id","v"))
@pandas_udf("double")
def mean_udf(v:pd.Series) -> float:
return v.mean()
df.select(mean_udf(df['v'])).show()
df.groupby("id").agg(mean_udf(df['v'])).show()
w = Window\
.partitionBy("id")\
.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df.withColumn('mean_v',mean_udf(df['v']).over(w)).show()
spark.stop()
支持Pandas实例的分组映射操作,DataFrame.groupby().applyInPandas() 它需要一个 Python 函数,该函数接受一个pandas.DataFrame并返回另一个pandas.DataFrame。它将每个组映射到pandas.DataFrame的Python函数中每个组。
这个 API 实现了“split-apply-combine”模式,它包括三个步骤:
要使用DataFrame.groupBy().applyInPandas(),用户需要定义以下内容:
请注意,在应用该函数之前,组的所有数据都将加载到内存中。这可能会导致内存不足异常,尤其是在组大小有偏差的情况下。maxRecordsPerBatch的配置不适用于组,由用户来确保分组数据适合可用内存。
import os
from pyspark.sql import SparkSession
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = '/root/anaconda3/envs/pyspark_env/bin/python'
os.environ['PYSPARK_PYTHON'] = PYSPARK_PYTHON
os.environ['PYSPARK_DRIVER_PYTHON'] = PYSPARK_PYTHON
if __name__ == '__main__':
spark = SparkSession\
.builder\
.appName('test')\
.getOrCreate()
sc = spark.sparkContext
df = spark.createDataFrame(
[(1,1.0),(1,2.0),(2,3.0),(2,5.0),(2,10.0)],
("id","v")
)
def subtract_mean(pdf):
v = pdf.v
return pdf.assign(v=v-v.mean())
df.groupby("id").applyInPandas(subtract_mean,schema="id long,v double").show()
spark.stop()