PySpark常见操作

DataFrame创建

1、RDD转换DataFrame

首先创建一个rdd对象

from pyspark.sql import SparkSession
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)

1.1 使用toDF()函数转换

df = rdd.toDF()
df.show()

可以看出,现在dataFrame没有列名,可以通过toDF()中指定列名

columns = ["language","users_count"]
df = rdd.toDF(columns)
df.show()

1.2 利用SparkSession创建DataFrame

dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)
df.show()

综上来看,根据rdd对象最简便的方法就是通过调用toDF函数实现DataFrame的转换,但createDataFrame创建DataFrame的方法更加的灵活,可以接受多种类型的输入下面的case将详细介绍。

2、列表数据创建DataFrame

2.1 使用createDataFrame()方法

df = spark.createDataFrame(data).toDF(*columns)
df.show()

通过查看createDataFrame()函数的参数说明,可以看出此函数可以接受以下参数类型创建DataFrame
• rdd
• list
• pandas.DataFrame

2.2 Row类型创建

Row是pyspark的一种数据类型,key-value的形式记录每一行数据。

from pyspark.sql import Row
rowData = map(lambda x: Row(*x), data) 
df = spark.createDataFrame(rowData,columns)
df.show()

2.3 利用StructType schema创建

这种方法的好处在于,可以指定每一列的数据类型。

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]
schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)

3、从数据源中读取并创建DataFrame

这里的数据源包括两种含义,
1)数据文件,比如json,csv,xml,txt等文件,读取进来的数据格式默认为DataFrame对象,示例如下

df2 = spark.read.csv("/src/resources/file.csv")
df2 = spark.read.text("/src/resources/file.txt")
df2 = spark.read.json("/src/resources/file.json")

2)数据库,可以通过调用read.jdbc()从mysql数据库中读取相应的数据表,默认为DataFrame对象

df_data = spark.read.jdbc(url=url,table=table_name,properties=prop)

4、DataFrame转换Pandas

1、调用toPandas()方法
将pyspark的DataFrame对象,转换成pandas下的dataframe

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","","Smith","36636","M",60000),
        ("Michael","Rose","","40288","M",70000),
        ("Robert","","Williams","42114","",400000),
        ("Maria","Anne","Jones","39192","F",500000),
        ("Jen","Mary","Brown","","F",0)]
columns = ["first_name","middle_name","last_name","dob","gender","salary"]
pysparkDF = spark.createDataFrame(data = data, schema = columns)
pysparkDF.printSchema()
pysparkDF.show(truncate=False)
pandasDF = pysparkDF.toPandas()
print(pandasDF)

2、Spark Nested Struct DataFrame to Pandas

from pyspark.sql.types import StructType, StructField, StringType,IntegerType
dataStruct = [(("James","","Smith"),"36636","M","3000"), \
      (("Michael","Rose",""),"40288","M","4000"), \
      (("Robert","","Williams"),"42114","M","4000"), \
      (("Maria","Anne","Jones"),"39192","F","4000"), \
      (("Jen","Mary","Brown"),"","F","-1") \
]

3 嵌套结构

schemaStruct = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
          StructField('dob', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', StringType(), True)
         ])
df = spark.createDataFrame(data=dataStruct, schema = schemaStruct)
df.printSchema()
pandasDF2 = df.toPandas()
print(pandasDF2)

StructType和StructField
正如前面创建DataFrame所使用到的StructType和StructField一样,当我们需要自定义我们列名,列数据类型,以及列空值是否为null时,需要用到pyspark所提供的StructType对象。
• StructField定义列名,数据类型,空值是否为null
• StructType是StructField的集合
1、创建DataFrame

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
data = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]
schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)

2、定义嵌套的StructType

structureData = [
    (("James","","Smith"),"36636","M",3100),
    (("Michael","Rose",""),"40288","M",4300),
    (("Robert","","Williams"),"42114","M",1400),
    (("Maria","Anne","Jones"),"39192","F",5500),
    (("Jen","Mary","Brown"),"","F",-1)
  ]
structureSchema = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
         StructField('id', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', IntegerType(), True)
         ])
df2 = spark.createDataFrame(data=structureData,schema=structureSchema)
df2.printSchema()
df2.show(truncate=False)

完整的case

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,ArrayType,MapType
from pyspark.sql.functions import col,struct,when
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
data = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]
schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("id", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True) 
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)

嵌套schema

structureData = [
    (("James","","Smith"),"36636","M",3100),
    (("Michael","Rose",""),"40288","M",4300),
    (("Robert","","Williams"),"42114","M",1400),
    (("Maria","Anne","Jones"),"39192","F",5500),
    (("Jen","Mary","Brown"),"","F",-1)
  ]
structureSchema = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
         StructField('id', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', IntegerType(), True)
         ])
df2 = spark.createDataFrame(data=structureData,schema=structureSchema)
df2.printSchema()
df2.show(truncate=False)

根据df2创建新的DF

updatedDF = df2.withColumn("OtherInfo", 
    struct(col("id").alias("identifier"),
    col("gender").alias("gender"),
    col("salary").alias("salary"),
    when(col("salary").cast(IntegerType()) < 2000,"Low")
      .when(col("salary").cast(IntegerType()) < 4000,"Medium")
      .otherwise("High").alias("Salary_Grade")
  )).drop("id","gender","salary")
updatedDF.printSchema()
updatedDF.show(truncate=False)

ArrayType和MapType

""" Array & Map"""
arrayStructureSchema = StructType([
    StructField('name', StructType([
       StructField('firstname', StringType(), True),
       StructField('middlename', StringType(), True),
       StructField('lastname', StringType(), True)
       ])),
       StructField('hobbies', ArrayType(StringType()), True),
       StructField('properties', MapType(StringType(),StringType()), True)
    ])

Row
Row Class表示DataFrame的一行记录。
1、创建Row对象
可以类似与python tuple的形式,通过index访问,也可以通过属性进行访问。

from pyspark.sql import Row
row=Row("James",40)
print(row[0] +","+str(row[1]))
row=Row(name="Alice", age=11)
print(row.name) 

2、Row对象构造RDD

from pyspark.sql import SparkSession, Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [Row(name="James,,Smith",lang=["Java","Scala","C++"],state="CA"), 
    Row(name="Michael,Rose,",lang=["Spark","Java","C++"],state="NJ"),
    Row(name="Robert,,Williams",lang=["CSharp","VB"],state="NV")]
rdd=spark.sparkContext.parallelize(data)
print(rdd.collect())

遍历rdd元素

dataSet = rdd.collect()
for r in dataSet:
    print(r.name,r.lang)

创建DataFrame

df = spark.createDataFrame(rdd)
df.show()

3、完整code

from pyspark.sql import SparkSession, Row
row=Row("James",40)
print(row[0] +","+str(row[1]))
row2=Row(name="Alice", age=11)
print(row2.name)
Person = Row("name", "age")
p1=Person("James", 40)
p2=Person("Alice", 35)
print(p1.name +","+p2.name)
#PySpark Example
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [Row(name="James,,Smith",lang=["Java","Scala","C++"],state="CA"), 
    Row(name="Michael,Rose,",lang=["Spark","Java","C++"],state="NJ"),
    Row(name="Robert,,Williams",lang=["CSharp","VB"],state="NV")]
#RDD Example 1
rdd=spark.sparkContext.parallelize(data)
collData=rdd.collect()
print(collData)
for row in collData:
    print(row.name + "," +str(row.lang))
# RDD Example 2
Person=Row("name","lang","state")
data = [Person("James,,Smith",["Java","Scala","C++"],"CA"), 
    Person("Michael,Rose,",["Spark","Java","C++"],"NJ"),
    Person("Robert,,Williams",["CSharp","VB"],"NV")]
rdd=spark.sparkContext.parallelize(data)
collData=rdd.collect()
print(collData)
for person in collData:
    print(person.name + "," +str(person.lang))
# DataFrame Example 1
columns = ["name","languagesAtSchool","currentState"]
df=spark.createDataFrame(data)
df.printSchema()
df.show()
collData=df.collect()
print(collData)
for row in collData:
    print(row.name + "," +str(row.lang))
# DataFrame Example 2
columns = ["name","languagesAtSchool","currentState"]
df=spark.createDataFrame(data).toDF(*columns)
df.printSchema()

DataFrame基础操作

1、select()

select函数选择DataFrame的一列或者多列,返回新的DataFrame

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Smith","USA","CA"),
    ("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Maria","Jones","USA","FL")
  ]
columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)

• 选择单列

df.select("firstname").show()

• 选择多列

df.select("firstname","lastname").show()

• 嵌套列的选择

data = [
        (("James",None,"Smith"),"OH","M"),
        (("Anna","Rose",""),"NY","F"),
        (("Julia","","Williams"),"OH","F"),
        (("Maria","Anne","Jones"),"NY","M"),
        (("Jen","Mary","Brown"),"NY","M"),
        (("Mike","Mary","Williams"),"OH","M")
        ]
from pyspark.sql.types import StructType,StructField, StringType        
schema = StructType([
    StructField('name', StructType([
         StructField('firstname', StringType(), True),
         StructField('middlename', StringType(), True),
         StructField('lastname', StringType(), True)
         ])),
     StructField('state', StringType(), True),
     StructField('gender', StringType(), True)
     ])
df2 = spark.createDataFrame(data = data, schema = schema)
df2.printSchema()
df2.show(truncate=False) # shows all columns

指定嵌套列元素

df2.select("name.firstname","name.lastname").show(truncate=False)

访问嵌套列所有元素

df2.select("name.*").show(truncate=False)

2、collect()

collect将收集DataFrame的所有元素,因此,此操作需要在较小的数据集上操作,如果DataFrame很大,使用collect可能会造成内存溢出。

df2.collect()

3、withColumn()

withColumn函数可以更新或者给DataFrame添加新的列,并返回新的DataFrame。

data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)

• 更新列的数据类型

df2 = df.withColumn("salary",col("salary").cast("Integer"))
df2.printSchema()

• 更新列值

df3 = df.withColumn("salary",col("salary")*100)
df3.printSchema()

• 添加新的列

df4 = df.withColumn("CopiedColumn",col("salary")* -1)
df4.printSchema()

• lit添加常值

df5 = df.withColumn("Country", lit("USA"))
df5.printSchema()

• rename列名

df.withColumnRenamed("gender","sex").show(truncate=False)

• 删除列

df4.drop("CopiedColumn") \
.show(truncate=False) 

4、where() & filter()

where和filter函数是相同的操作,对DataFrame的列元素进行筛选。

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, ArrayType
from pyspark.sql.functions import col,array_contains
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
arrayStructureData = [
        (("James","","Smith"),["Java","Scala","C++"],"OH","M"),
        (("Anna","Rose",""),["Spark","Java","C++"],"NY","F"),
        (("Julia","","Williams"),["CSharp","VB"],"OH","F"),
        (("Maria","Anne","Jones"),["CSharp","VB"],"NY","M"),
        (("Jen","Mary","Brown"),["CSharp","VB"],"NY","M"),
        (("Mike","Mary","Williams"),["Python","VB"],"OH","M")
        ]
        
arrayStructureSchema = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
         StructField('languages', ArrayType(StringType()), True),
         StructField('state', StringType(), True),
         StructField('gender', StringType(), True)
         ])
df = spark.createDataFrame(data = arrayStructureData, schema = arrayStructureSchema)
df.printSchema()
df.show(truncate=False)
df.filter(df.state == "OH") \
    .show(truncate=False)
df.filter(col("state") == "OH") \
    .show(truncate=False)    
    
df.filter("gender  == 'M'") \
    .show(truncate=False)    
df.filter( (df.state  == "OH") & (df.gender  == "M") ) \
    .show(truncate=False)        
df.filter(array_contains(df.languages,"Java")) \
    .show(truncate=False)        
df.filter(df.name.lastname == "Williams") \
    .show(truncate=False) 

5、dropDuplicates & distinct

二者用法相同,去重函数,即能对整体去重,也能按照指定列进行去重

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James", "Sales", 3000), \
    ("Michael", "Sales", 4600), \
    ("Robert", "Sales", 4100), \
    ("Maria", "Finance", 3000), \
    ("James", "Sales", 3000), \
    ("Scott", "Finance", 3300), \
    ("Jen", "Finance", 3900), \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000), \
    ("Saif", "Sales", 4100) \
  ]
columns= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

• 整体去重

# 整体去重,返回新的DataFrame
distinctDF = df.distinct()
print("Distinct count: "+str(distinctDF.count()))
distinctDF.show(truncate=False)
df2 = df.dropDuplicates()
print("Distinct count: "+str(df2.count()))
df2.show(truncate=False)

• 按照指定列去重

# 对部分列元素进行去重,返回新的DataFrame
dropDisDF = df.dropDuplicates(["department","salary"])
print("Distinct count of department salary : "+str(dropDisDF.count()))
dropDisDF.show(truncate=False)

6、orderBy & sort

orderBy和sort都可以指定列排序,默认为升序

df.sort("department","state").show(truncate=False)
df.sort(col("department"),col("state")).show(truncate=False)
df.orderBy("department","state").show(truncate=False)
df.orderBy(col("department"),col("state")).show(truncate=False)

• 指定排序方式

df.sort(df.department.asc(),df.state.asc()).show(truncate=False)
df.sort(col("department").asc(),col("state").asc()).show(truncate=False)
df.orderBy(col("department").asc(),col("state").asc()).show(truncate=False)
df.sort(df.department.desc(),df.state.desc()).show(truncate=False)
df.sort(col("department").desc(),col("state").desc()).show(truncate=False)
df.orderBy(col("department").desc(),col("state").desc()).show(truncate=False)

• 通过sql排序,先创建表视图,然后再编写sql语句进行排序

df.createOrReplaceTempView("EMP")
spark.sql("select employee_name,department,state,salary,age,bonus from EMP ORDER BY department asc").show(truncate=False)

7、groupBy

通常与聚合函数一起使用

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,sum,avg,max
from pyspark.sql.functions import sum,avg,max,min,mean,count
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
## 普通的聚合函数 sum,count,max,min等
df.groupBy("department").sum("salary").show(truncate=False)
df.groupBy("department").count().show(truncate=False)
df.groupBy("department","state") \
    .sum("salary","bonus") \
   .show(truncate=False)
# agg可以同时聚合多列
df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
         avg("salary").alias("avg_salary"), \
         sum("bonus").alias("sum_bonus"), \
         max("bonus").alias("max_bonus") \
     ) \
    .show(truncate=False)
# 聚合的同时进行过滤操作
df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
      avg("salary").alias("avg_salary"), \
      sum("bonus").alias("sum_bonus"), \
      max("bonus").alias("max_bonus")) \
    .where(col("sum_bonus") >= 50000) \
    .show(truncate=False)

8、join

join类型,主要有

df = df1.join(df2,df1.key_id == df2.key_id,'inner')

高级操作

1.1 自定义udf

1)首先创建DataFrame

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
columns = ["Seqno","Name"]
data = [("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df.show(truncate=False)

2)自定义的python函数,对df的Name列中的名字转换成大写字母开头

def convertCase(str):
    resStr=""
    arr = str.split(" ")
    for x in arr:
        resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
    return resStr 

3)将自定义的convertCase函数注册为udf

from pyspark.sql.functions import udf
udf1 = udf(convertCase,StringType())

4)将自定义udf运用到dataframe中

df.select(col("Seqno"), \
    udf1(col("Name")).alias("Name") ) \
   .show(truncate=False)

1.2 注册udf,在sql中使用

有时候,需要将自定义的udf在sql中使用,可以使用下面的方法注册udf

""" Using UDF on SQL """
spark.udf.register("udf1", convertCase,StringType())
df.createOrReplaceTempView("NAME_TABLE")
spark.sql("select Seqno, udf1(Name) as Name from NAME_TABLE") \
     .show(truncate=False)

1.3 注解形式更方便

@udf(returnType=StringType()) 
def convertCase(str):
    resStr=""
    arr = str.split(" ")
    for x in arr:
        resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
    return resStr 
df.withColumn("Cureated Name", convertCase(col("Name"))).show(truncate=False)

SQL Functions

1.1 聚合函数

DataFrame内置了很多标准的聚合函数,可以满足大部分场景下数据分析操作,函数列表如下
函数。
案例说明
approx_count_distinct:返回聚合列不同值的个数

df.groupBy('Seqno').agg(approx_count_distinct('Name')).show()

collect_list:返回聚合列的所有值,包含重复值

df.groupBy('Seqno').agg(collect_list('Name')).show()

collect_set:返回聚合列的所有值,不包含重复值

1.2 窗函数

pyspark提供窗函数操作,其类似SQL提供的窗函数,主要有:
• Ranking Functions
• row_number(),rank()
• dense_rank(),percent_rank(),ntile()
• Analytic functions
• cume_dist(),lag()
• lead()
• Aggregate functions
• sum(),first(),last(),max()
• min(),mean()

窗函数,使用说明

  • row_number,返回每个分区排序后的序号,从1开始递增
  • rank,返回每个分区排序后的序号,从1开始递增,值相同的情况下,具有相同的序号,1到n不连续,举例:1,1,3,3,5
  • percent_rank
  • dense_rank,返回每个分区排序后的序号,从1开始递增,值相同的情况下,具有相同的序号,1到n连续,举例:1,1,2,2,3
  • ntile,可以看作是把有序的数据集合平均分配到指定的数量n的桶中,将桶号分配给每一行,排序对应的数字为桶号。如果不能平均分配,则较小桶号的桶分配额外的行,并且各个桶中能放的数据条数最多相差1
  • cume_dist,分区累加分布值,0.4,0.4,0.8,1.0
  • lag,列元素后移
  • lead,列元素前移

窗函数对分区group数据进行处理,并每行返回一个特定的值。

分区group操作一般使用Window.partitionBy()操作,对于rank_number和rank需要额外对每个分区的数据进行orderBy排序操作。
• 原始DataFrame
• row_number
• rank
• dense_rank
• percent_rank
• ntile
• cume_dist
• lag
• lead

完整代码

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James", "Sales", 3000), \
    ("Michael", "Sales", 4600),  \
    ("Robert", "Sales", 4100),   \
    ("Maria", "Finance", 3000),  \
    ("James", "Sales", 3000),    \
    ("Scott", "Finance", 3300),  \
    ("Jen", "Finance", 3900),    \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000),\
    ("Saif", "Sales", 4100) \
  )
 
columns= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec  = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_number",row_number().over(windowSpec)) \
    .show(truncate=False)
from pyspark.sql.functions import rank
df.withColumn("rank",rank().over(windowSpec)) \
    .show()
from pyspark.sql.functions import dense_rank
df.withColumn("dense_rank",dense_rank().over(windowSpec)) \
    .show()
from pyspark.sql.functions import percent_rank
df.withColumn("percent_rank",percent_rank().over(windowSpec)) \
    .show()
    
from pyspark.sql.functions import ntile
df.withColumn("ntile",ntile(2).over(windowSpec)) \
    .show()
from pyspark.sql.functions import cume_dist    
df.withColumn("cume_dist",cume_dist().over(windowSpec)) \
   .show()
from pyspark.sql.functions import lag    
df.withColumn("lag",lag("salary",2).over(windowSpec)) \
      .show()
from pyspark.sql.functions import lead    
df.withColumn("lead",lead("salary",2).over(windowSpec)) \
    .show()
    
windowSpecAgg  = Window.partitionBy("department")
from pyspark.sql.functions import col,avg,sum,min,max,row_number 
df.withColumn("row",row_number().over(windowSpec)) \
  .withColumn("avg", avg(col("salary")).over(windowSpecAgg)) \
  .withColumn("sum", sum(col("salary")).over(windowSpecAgg)) \
  .withColumn("min", min(col("salary")).over(windowSpecAgg)) \
  .withColumn("max", max(col("salary")).over(windowSpecAgg)) \
  .where(col("row")==1).select("department","avg","sum","min","max") \
  .show()

参考链接

https://sparkbyexamples.com/pyspark/pyspark-map-transformation/
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types
https://blog.csdn.net/hejp_123/category_8708607.html

你可能感兴趣的:(Spark,大数据,机器学习)