pyspark pandas 自定义聚合函数

1. pyspark自定义聚合函数

import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
list_data={
 'label_id':['001','001','002','001','001','002','004','001','001']
,'action_num':[3,4,5,1,2,34,5,9,2]
}
df1 = pd.DataFrame(list_data)
pyspark_name ='test'
spark = SparkSession.builder.appName(pyspark_name).enableHiveSupport().getOrCreate() 
df = spark.createDataFrame(df1)
df.show()

@F.udf(IntegerType())
def mycount(value):
    d = 0
    for i in value:
        d = d + 1
    return d
groupdf = df.groupby('label_id').agg(F.collect_list('action_num').alias('data')).withColumn('new_data',mycount('data'))
groupdf.show()

展示结果 

 pyspark pandas 自定义聚合函数_第1张图片

2. pandas自定义聚合函数

import pandas as pd

def mycount(value):
    d = 0
    for i in value:
        d = d + 1
    return d   

df2 = df1.groupby('label_id')['action_num'].apply(lambda x: x.tolist()).reset_index()
df2['new_data'] = df2['action_num'].apply(mycount)
df2

展示结果

pyspark pandas 自定义聚合函数_第2张图片

你可能感兴趣的:(spark,自定义聚合函数,pyspark,DataFrame)