pySpark DataFrames常用操作指南
前1, 2步是环境数据集操作,如果只想看常用操作请跳到3
1. 运行环境配置
欲善其功,必先利其器,通常来说光一个spark安装就可以出一个教程,在你看完安装教程填完坑后估计就不想继续看下去了..., 这里使用了google colab作为运行环境简化了安装的麻烦,前提是你需要一个梯子。具体操作你需要注册一个google账号让后登录google colab新建一个jupyter notebook,如下图:
具体代码附上,将下面的代码直接粘贴在colab notebook上一路运行下去即可
# 下载必要的数据,安装必要的包
# 这里使用了spark3.0.0版本(目前来说是最新的)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
!tar xf spark-3.0.0-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark
# 2.配置环境变量
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop2.7"
# 启动spark
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
这时我们在新窗口中输入sc就可以显示spark的版本
SparkContext
Spark UI
Versionv3.0.0
Masterlocal[*]
AppNamepyspark-shell
2. 数据集
这里使用kaggle新冠病毒数据集,我将数据集下载下来后传到了google drive,colab可以直接读取google drive数据(点击装载Google云端硬盘),不过你也可以手动上传(点击上传按钮)。
我已经把数据上传到了百度云, 下载后解压之后就可以了
链接:https://pan.baidu.com/s/1b8nqJ2fyoF6BjjkZqTuccA 密码:7yd2
3.基本功能
读取数据
我们可以先使用spark.read.load命令将文件加载到数据集中。
该命令默认读取parquet格式文件,这是spark的默认文件格式,但是您可以添加参数format以使用它读取.csv文件
注意: 这里spark可以读取gzip压缩的csv文件,与读取csv文件没有区别
# 这里读取COVID数据集中Case.csv这个数据
path = '/content/drive/My Drive/spark/COVID/Case.csv.gz'
# 读法一
cases = spark.read.load(path, format="csv", sep=",", inferSchema="true", header="true")
#读法二
cases = (spark.read
.option('header', 'true')
.option('inferschema', 'true')
.csv(path))
cases.show(2) # 显示前两列
+-------+--------+----------+-----+-------------------+---------+---------+----------+
|case_id|province| city|group| infection_case|confirmed| latitude| longitude|
+-------+--------+----------+-----+-------------------+---------+---------+----------+
|1000001| Seoul|Yongsan-gu| true| Itaewon Clubs| 133|37.538621|126.992652|
|1000002| Seoul| Guro-gu| true|Guro-gu Call Center| 99|37.508163|126.884387|
+-------+--------+----------+-----+-------------------+---------+---------+----------+
转化为pandas
# 取前10列转化为pandas
cases.limit(10).toPandas()
修改列名
有时我们想更改spark dataframe中列的名称。我们只需使用以下命令更改一列即可完成此操作:
# withColumnRenamed('旧列名‘,'新列名’)
cases = cases.withColumnRenamed("infection_case", "infection_source")
或者你想修改所有列名
cases2 = cases.toDF(*['case_id2', 'province2', 'city2', 'group2', 'infection_case2', 'confirmed2',
'latitude2', 'longitude2'])
选择列
我们可以使用select关键字选择列的子集
cases = cases.select('province','city','infection_case','confirmed')
cases.show()
+--------+----------+-------------------+---------+
|province| city| infection_case|confirmed|
+--------+----------+-------------------+---------+
| Seoul|Yongsan-gu| Itaewon Clubs| 133|
| Seoul| Guro-gu|Guro-gu Call Center| 99|
+--------+----------+-------------------+---------+
排序
这里排序需要显示赋值,cases.sort("confirmed")不会改变cases,必须
cases = cases.sort("confirmed")
# 升序排列
cases = cases.sort("confirmed")
cases.show(10)
+-----------------+---------------+--------------------+---------+
| province| city| infection_case|confirmed|
+-----------------+---------------+--------------------+---------+
| Seoul|Yeongdeungpo-gu|Yeongdeungpo Lear...| null|
| Seoul| Jung-gu| KB Life Insurance| null|
| Seoul|form other city|Guri Collective I...| null|
| Gwangju| -| etc| 0|
| Jeju-do| -|contact with patient| 0|
| Gangwon-do| -|contact with patient| 0|
| Busan|from other city|Cheongdo Daenam H...| 1|
| Gwangju| -|contact with patient| 1|
|Chungcheongnam-do| -|contact with patient| 1|
| Jeollabuk-do|from other city| Shincheonji Church| 1|
+-----------------+---------------+--------------------+---------+
相反, 降序排列需要用到import pyspark.sql.functions
# 降序
import pyspark.sql.functions as F
cases = cases.sort(F.desc("confirmed"))
cases.show(10)
+----------------+---------------+--------------------+---------+
| province| city| infection_case|confirmed|
+----------------+---------------+--------------------+---------+
| Daegu| Nam-gu| Shincheonji Church| 4510|
| Daegu| -|contact with patient| 920|
| Daegu| -| etc| 737|
|Gyeongsangbuk-do|from other city| Shincheonji Church| 566|
| Seoul| -| overseas inflow| 346|
| Gyeonggi-do| -| overseas inflow| 252|
| Daegu| Dalseong-gun|Second Mi-Ju Hosp...| 196|
|Gyeongsangbuk-do| -|contact with patient| 190|
| Seoul| Yongsan-gu| Itaewon Clubs| 133|
|Gyeongsangbuk-do| -| etc| 132|
+----------------+---------------+--------------------+---------+
类型转换
尽管我们在这个数据集中没有遇到这种情况,但可能会出现Pyspark将double读取为integer或string的情况,在这种情况下,您可以使用cast函数来转换类型。
下面将confirmed列和city列重新定义数据类型
from pyspark.sql.types import DoubleType, IntegerType, StringType
cases = cases.withColumn('confirmed', F.col('confirmed').cast(IntegerType()))
cases = cases.withColumn('city', F.col('city').cast(StringType()))
过滤
我们可以使用AND(&), OR(|) and NOT(~) 条件使用多个条件过滤数据
这里做了三个操作:
confirmed > 10
province=='Daegu'
city不包括['Nam-gu', 'Dalseong-gun']这两个城市
cases.filter((cases.confirmed>10) &
(cases.province=='Daegu') &
(~cases.city.isin(['Nam-gu', 'Dalseong-gun']))).show()
+--------+-------+--------------------+---------+
|province| city| infection_case|confirmed|
+--------+-------+--------------------+---------+
| Daegu| -|contact with patient| 920|
| Daegu| -| etc| 737|
| Daegu| Seo-gu|Hansarang Convale...| 128|
| Daegu|Dong-gu| Fatima Hospital| 37|
| Daegu| -| overseas inflow| 32|
+--------+-------+--------------------+---------+
分组
我们也可以将groupBy函数与spark dataframe一起使用。与pandas groupBy基本相同,只是需要导入pyspark.sql.functions函数. 以下是可用于此功能模块的功能列表。
from pyspark.sql import functions as F
cases.groupBy(["province","city"]).agg(F.sum("confirmed") ,F.max("confirmed")).show(2)
+----------------+--------+--------------+--------------+
| province| city|sum(confirmed)|max(confirmed)|
+----------------+--------+--------------+--------------+
|Gyeongsangnam-do|Jinju-si| 10| 10|
| Seoul| Guro-gu| 140| 99|
+----------------+--------+--------------+--------------+
如果不喜欢新列名,可以在agg命令本身中使用alias关键字重命名列。
cases.groupBy(["province","city"]).agg(
F.sum("confirmed").alias("TotalConfirmed"),\
F.max("confirmed").alias("MaxFromOneConfirmedCase")\
).show(2)
+----------------+--------+--------------+-----------------------+
| province| city|TotalConfirmed|MaxFromOneConfirmedCase|
+----------------+--------+--------------+-----------------------+
|Gyeongsangnam-do|Jinju-si| 10| 10|
| Seoul| Guro-gu| 140| 99|
+----------------+--------+--------------+-----------------------+
拼接
首先,我们需要引入COVID数据集中另一个文件Region.csv。我们将使用包含地区信息的地区文件,如小学人数、老年人口比率等。
regions = spark.read.load("/content/drive/My Drive/spark/COVID/Region.csv.gz",
format="csv",
sep=",",
inferSchema="true",
header="true")
regions.limit(10).show()
+-----+--------+------------+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
| code|province| city| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count|
+-----+--------+------------+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
|10000| Seoul| Seoul|37.566953|126.977977| 607| 830| 48| 1.44| 15.38| 5.8| 22739|
|10010| Seoul| Gangnam-gu|37.518421|127.047222| 33| 38| 0| 4.18| 13.17| 4.3| 3088|
|10020| Seoul| Gangdong-gu|37.530492|127.123837| 27| 32| 0| 1.54| 14.55| 5.4| 1023|
|10030| Seoul| Gangbuk-gu|37.639938|127.025508| 14| 21| 0| 0.67| 19.49| 8.5| 628|
|10040| Seoul| Gangseo-gu|37.551166|126.849506| 36| 56| 1| 1.17| 14.39| 5.7| 1080|
|10050| Seoul| Gwanak-gu| 37.47829|126.951502| 22| 33| 1| 0.89| 15.12| 4.9| 909|
|10060| Seoul| Gwangjin-gu|37.538712|127.082366| 22| 33| 3| 1.16| 13.75| 4.8| 723|
|10070| Seoul| Guro-gu|37.495632| 126.88765| 26| 34| 3| 1.0| 16.21| 5.7| 741|
|10080| Seoul|Geumcheon-gu|37.456852|126.895229| 18| 19| 0| 0.96| 16.15| 6.7| 475|
|10090| Seoul| Nowon-gu|37.654259|127.056294| 42| 66| 6| 1.39| 15.4| 7.4| 952|
+-----+--------+------------+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
接下来使用join方法合并拼接两个数据集
# 以cases的 'province' 和 'city' 拼接
cases = cases.join(regions, ['province','city'],how='left')
cases.show()
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
|province| city| infection_case|confirmed| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count|
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
| Seoul|Yongsan-gu| Itaewon Clubs| 133|10210|37.532768|126.990021| 15| 13| 1| 0.68| 16.87| 6.5| 435|10210|37.532768|126.990021| 15| 13| 1| 0.68| 16.87| 6.5| 435|
| Seoul| Guro-gu|Guro-gu Call Center| 99|10070|37.495632| 126.88765| 26| 34| 3| 1.0| 16.21| 5.7| 741|10070|37.495632| 126.88765| 26| 34| 3| 1.0| 16.21| 5.7| 741|
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
有时,您可能会遇到这样的情况:您需要将一个非常大的表与一个非常小的表连接起来。该场景还可能涉及增加数据库的大小,如下面的示例所示:
在Spark中,这样的操作非常多,在Spark中,您可能希望将多个操作应用于particular key。但是,假设大表中每个键的数据都很大,它将涉及大量的数据移动。有时会导致应用程序本身崩溃。在这样大的表上进行连接时(假设另一个表很小),可以做的一个小优化是在执行连接时将小表广播(broadcast)到每个计算机/节点
from pyspark.sql.functions import broadcast
cases = cases.join(broadcast(regions), ['province','city'],how='left')
4.使用sql
如果需要,还可以对spark dataframe使用SQL。让我们试着在cases表上运行一些SQL。
我们首先将cases dataframe注册到一个临时表cases_表中,我们可以在该表上运行SQL操作。如您所见,SQL select语句的结果再次是一个spark dataframe。
注意: registerTempTable 方法已经在 spark 2.0.0+ 启用了取而代之的是createOrReplaceTempView
dataframe的操作通常会使用sql语句完成,下面有四个创建表的方法
#df.createGlobalTempView("tempViewName") 创建一个全局临时表,生命周期为程序的生命周期 **使用的时候 global_temp.tempViewName
#df.createOrReplaceGlobalTempView("tempViewName") 创建或者替换一个全局临时表,生命周期为程序的生命周期
#df.createOrReplaceTempView("tempViewName") 创建一个临时表,生命周期为当前SparkSession的生命周期
#df.createTempView("tempViewName") 创建或者替换一个临时表,生命周期为当前SparkSession的生命周期
# 这里使用createOrReplaceTempView创建一个临时表
cases.createOrReplaceTempView('cases_table')
newDF = spark.sql('select * from cases_table where confirmed>100')
newDF.show()
+-----------------+---------------+--------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
| province| city| infection_case|confirmed| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count|
+-----------------+---------------+--------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
| Seoul| Yongsan-gu| Itaewon Clubs| 133|10210|37.532768|126.990021| 15| 13| 1| 0.68| 16.87| 6.5| 435|10210|37.532768|126.990021| 15| 13| 1| 0.68| 16.87| 6.5| 435|
| Seoul| -| overseas inflow| 346| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| Daegu| Nam-gu| Shincheonji Church| 4510|12010| 35.8463|128.597723| 11| 15| 2| 0.85| 22.49| 10.4| 345|12010| 35.8463|128.597723| 11| 15| 2| 0.85| 22.49| 10.4| 345|
| Daegu| Dalseong-gun|Second Mi-Ju Hosp...| 196|12030| 35.77475|128.431314| 32| 47| 1| 1.51| 12.11| 5.4| 361|12030| 35.77475|128.431314| 32| 47| 1| 1.51| 12.11| 5.4| 361|
| Daegu| Seo-gu|Hansarang Convale...| 128|12060|35.871993|128.559182| 17| 23| 0| 0.83| 21.29| 10.1| 374|12060|35.871993|128.559182| 17| 23| 0| 0.83| 21.29| 10.1| 374|
| Daegu| -|contact with patient| 920| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| Daegu| -| etc| 737| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| Gyeonggi-do| -| overseas inflow| 252| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
|Chungcheongnam-do| Cheonan-si|gym facility in C...| 103|41120| 36.81498|127.113868| 75| 112| 6| 1.91| 10.42| 4.5| 1069|41120| 36.81498|127.113868| 75| 112| 6| 1.91| 10.42| 4.5| 1069|
| Gyeongsangbuk-do|from other city| Shincheonji Church| 566| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| Gyeongsangbuk-do| Cheongdo-gun|Cheongdo Daenam H...| 120|60200|35.647361|128.734382| 11| 14| 0| 0.63| 36.55| 21.0| 85|60200|35.647361|128.734382| 11| 14| 0| 0.63| 36.55| 21.0| 85|
| Gyeongsangbuk-do| -|contact with patient| 190| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| Gyeongsangbuk-do| -| etc| 132| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
+-----------------+---------------+--------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
上面展示了一个最小的例子,但是你可以使用非常复杂的SQL查询,包括GROUP BY、HAVING和ORDER BY子句,以及上面查询中的别名。
5.创建新列
有许多方法可用于在PySpark dataframe中创建列。我们可以使用.withcolumn和PySpark SQL函数创建一个新列。
(1).使用内置函数functions
import pyspark.sql.functions as F
casesWithNewConfirmed = cases.withColumn("NewConfirmed", 100 + F.col("confirmed"))
casesWithNewConfirmed.show(2)
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+------------+
|province| city| infection_case|confirmed| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count|NewConfirmed|
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+------------+
| Seoul|Yongsan-gu| Itaewon Clubs| 133|10210|37.532768|126.990021| 15| 13| 1| 0.68| 16.87| 6.5| 435|10210|37.532768|126.990021| 15| 13| 1| 0.68| 16.87| 6.5| 435| 233|
| Seoul| Guro-gu|Guro-gu Call Center| 99|10070|37.495632| 126.88765| 26| 34| 3| 1.0| 16.21| 5.7| 741|10070|37.495632| 126.88765| 26| 34| 3| 1.0| 16.21| 5.7| 741| 199|
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+------------+
=================过几天填坑=======================