pyspark dataframe常用操作

pySpark DataFrames常用操作指南

前1, 2步是环境数据集操作,如果只想看常用操作请跳到3

1. 运行环境配置

欲善其功,必先利其器,通常来说光一个spark安装就可以出一个教程,在你看完安装教程填完坑后估计就不想继续看下去了..., 这里使用了google colab作为运行环境简化了安装的麻烦,前提是你需要一个梯子。具体操作你需要注册一个google账号让后登录google colab新建一个jupyter notebook,如下图:

colab上安装spark

具体代码附上,将下面的代码直接粘贴在colab notebook上一路运行下去即可

# 下载必要的数据,安装必要的包
# 这里使用了spark3.0.0版本(目前来说是最新的)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
!tar xf spark-3.0.0-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark
# 2.配置环境变量
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop2.7"
# 启动spark
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext

这时我们在新窗口中输入sc就可以显示spark的版本

SparkContext
Spark UI
Versionv3.0.0
Masterlocal[*]
AppNamepyspark-shell

2. 数据集

COVID数据集

这里使用kaggle新冠病毒数据集,我将数据集下载下来后传到了google drive,colab可以直接读取google drive数据(点击装载Google云端硬盘),不过你也可以手动上传(点击上传按钮)。
我已经把数据上传到了百度云, 下载后解压之后就可以了

链接:https://pan.baidu.com/s/1b8nqJ2fyoF6BjjkZqTuccA  密码:7yd2
上传数据

3.基本功能

读取数据

我们可以先使用spark.read.load命令将文件加载到数据集中。
该命令默认读取parquet格式文件,这是spark的默认文件格式,但是您可以添加参数format以使用它读取.csv文件
注意: 这里spark可以读取gzip压缩的csv文件,与读取csv文件没有区别

# 这里读取COVID数据集中Case.csv这个数据
path = '/content/drive/My Drive/spark/COVID/Case.csv.gz'
# 读法一
cases = spark.read.load(path, format="csv", sep=",", inferSchema="true", header="true")
#读法二
cases = (spark.read
         .option('header', 'true')
         .option('inferschema', 'true')
         .csv(path))

cases.show(2) # 显示前两列
+-------+--------+----------+-----+-------------------+---------+---------+----------+
|case_id|province|      city|group|     infection_case|confirmed| latitude| longitude|
+-------+--------+----------+-----+-------------------+---------+---------+----------+
|1000001|   Seoul|Yongsan-gu| true|      Itaewon Clubs|      133|37.538621|126.992652|
|1000002|   Seoul|   Guro-gu| true|Guro-gu Call Center|       99|37.508163|126.884387|
+-------+--------+----------+-----+-------------------+---------+---------+----------+
转化为pandas
# 取前10列转化为pandas
cases.limit(10).toPandas()
转化为pandas
修改列名

有时我们想更改spark dataframe中列的名称。我们只需使用以下命令更改一列即可完成此操作:

# withColumnRenamed('旧列名‘,'新列名’)
cases = cases.withColumnRenamed("infection_case", "infection_source")

或者你想修改所有列名

cases2 = cases.toDF(*['case_id2', 'province2', 'city2', 'group2', 'infection_case2', 'confirmed2',
       'latitude2', 'longitude2'])
选择列

我们可以使用select关键字选择列的子集

cases = cases.select('province','city','infection_case','confirmed')
cases.show()
+--------+----------+-------------------+---------+
|province|      city|     infection_case|confirmed|
+--------+----------+-------------------+---------+
|   Seoul|Yongsan-gu|      Itaewon Clubs|      133|
|   Seoul|   Guro-gu|Guro-gu Call Center|       99|
+--------+----------+-------------------+---------+
排序

这里排序需要显示赋值,cases.sort("confirmed")不会改变cases,必须
cases = cases.sort("confirmed")

# 升序排列
cases = cases.sort("confirmed")
cases.show(10)
+-----------------+---------------+--------------------+---------+
|         province|           city|      infection_case|confirmed|
+-----------------+---------------+--------------------+---------+
|            Seoul|Yeongdeungpo-gu|Yeongdeungpo Lear...|     null|
|            Seoul|        Jung-gu|   KB Life Insurance|     null|
|            Seoul|form other city|Guri Collective I...|     null|
|          Gwangju|              -|                 etc|        0|
|          Jeju-do|              -|contact with patient|        0|
|       Gangwon-do|              -|contact with patient|        0|
|            Busan|from other city|Cheongdo Daenam H...|        1|
|          Gwangju|              -|contact with patient|        1|
|Chungcheongnam-do|              -|contact with patient|        1|
|     Jeollabuk-do|from other city|  Shincheonji Church|        1|
+-----------------+---------------+--------------------+---------+

相反, 降序排列需要用到import pyspark.sql.functions

# 降序
import pyspark.sql.functions as F
cases = cases.sort(F.desc("confirmed"))
cases.show(10)
+----------------+---------------+--------------------+---------+
|        province|           city|      infection_case|confirmed|
+----------------+---------------+--------------------+---------+
|           Daegu|         Nam-gu|  Shincheonji Church|     4510|
|           Daegu|              -|contact with patient|      920|
|           Daegu|              -|                 etc|      737|
|Gyeongsangbuk-do|from other city|  Shincheonji Church|      566|
|           Seoul|              -|     overseas inflow|      346|
|     Gyeonggi-do|              -|     overseas inflow|      252|
|           Daegu|   Dalseong-gun|Second Mi-Ju Hosp...|      196|
|Gyeongsangbuk-do|              -|contact with patient|      190|
|           Seoul|     Yongsan-gu|       Itaewon Clubs|      133|
|Gyeongsangbuk-do|              -|                 etc|      132|
+----------------+---------------+--------------------+---------+
类型转换

尽管我们在这个数据集中没有遇到这种情况,但可能会出现Pyspark将double读取为integer或string的情况,在这种情况下,您可以使用cast函数来转换类型。
下面将confirmed列和city列重新定义数据类型

from pyspark.sql.types import DoubleType, IntegerType, StringType
cases = cases.withColumn('confirmed', F.col('confirmed').cast(IntegerType()))
cases = cases.withColumn('city', F.col('city').cast(StringType()))

过滤

我们可以使用AND(&), OR(|) and NOT(~) 条件使用多个条件过滤数据
这里做了三个操作:
confirmed > 10
province=='Daegu'
city不包括['Nam-gu', 'Dalseong-gun']这两个城市

cases.filter((cases.confirmed>10) & 
             (cases.province=='Daegu') & 
             (~cases.city.isin(['Nam-gu', 'Dalseong-gun']))).show()
+--------+-------+--------------------+---------+
|province|   city|      infection_case|confirmed|
+--------+-------+--------------------+---------+
|   Daegu|      -|contact with patient|      920|
|   Daegu|      -|                 etc|      737|
|   Daegu| Seo-gu|Hansarang Convale...|      128|
|   Daegu|Dong-gu|     Fatima Hospital|       37|
|   Daegu|      -|     overseas inflow|       32|
+--------+-------+--------------------+---------+
分组

我们也可以将groupBy函数与spark dataframe一起使用。与pandas groupBy基本相同,只是需要导入pyspark.sql.functions函数. 以下是可用于此功能模块的功能列表。

from pyspark.sql import functions as F
cases.groupBy(["province","city"]).agg(F.sum("confirmed") ,F.max("confirmed")).show(2)
+----------------+--------+--------------+--------------+
|        province|    city|sum(confirmed)|max(confirmed)|
+----------------+--------+--------------+--------------+
|Gyeongsangnam-do|Jinju-si|            10|            10|
|           Seoul| Guro-gu|           140|            99|
+----------------+--------+--------------+--------------+

如果不喜欢新列名,可以在agg命令本身中使用alias关键字重命名列。

cases.groupBy(["province","city"]).agg(
    F.sum("confirmed").alias("TotalConfirmed"),\
    F.max("confirmed").alias("MaxFromOneConfirmedCase")\
    ).show(2)
+----------------+--------+--------------+-----------------------+
|        province|    city|TotalConfirmed|MaxFromOneConfirmedCase|
+----------------+--------+--------------+-----------------------+
|Gyeongsangnam-do|Jinju-si|            10|                     10|
|           Seoul| Guro-gu|           140|                     99|
+----------------+--------+--------------+-----------------------+

拼接

首先,我们需要引入COVID数据集中另一个文件Region.csv。我们将使用包含地区信息的地区文件,如小学人数、老年人口比率等。

regions = spark.read.load("/content/drive/My Drive/spark/COVID/Region.csv.gz", 
                          format="csv", 
                          sep=",", 
                          inferSchema="true", 
                          header="true")
regions.limit(10).show()
+-----+--------+------------+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
| code|province|        city| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count|
+-----+--------+------------+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
|10000|   Seoul|       Seoul|37.566953|126.977977|                    607|               830|              48|         1.44|                   15.38|                5.8|             22739|
|10010|   Seoul|  Gangnam-gu|37.518421|127.047222|                     33|                38|               0|         4.18|                   13.17|                4.3|              3088|
|10020|   Seoul| Gangdong-gu|37.530492|127.123837|                     27|                32|               0|         1.54|                   14.55|                5.4|              1023|
|10030|   Seoul|  Gangbuk-gu|37.639938|127.025508|                     14|                21|               0|         0.67|                   19.49|                8.5|               628|
|10040|   Seoul|  Gangseo-gu|37.551166|126.849506|                     36|                56|               1|         1.17|                   14.39|                5.7|              1080|
|10050|   Seoul|   Gwanak-gu| 37.47829|126.951502|                     22|                33|               1|         0.89|                   15.12|                4.9|               909|
|10060|   Seoul| Gwangjin-gu|37.538712|127.082366|                     22|                33|               3|         1.16|                   13.75|                4.8|               723|
|10070|   Seoul|     Guro-gu|37.495632| 126.88765|                     26|                34|               3|          1.0|                   16.21|                5.7|               741|
|10080|   Seoul|Geumcheon-gu|37.456852|126.895229|                     18|                19|               0|         0.96|                   16.15|                6.7|               475|
|10090|   Seoul|    Nowon-gu|37.654259|127.056294|                     42|                66|               6|         1.39|                    15.4|                7.4|               952|
+-----+--------+------------+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+

接下来使用join方法合并拼接两个数据集

# 以cases的 'province' 和 'city' 拼接
cases = cases.join(regions, ['province','city'],how='left')
cases.show()
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
|province|      city|     infection_case|confirmed| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count|
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
|   Seoul|Yongsan-gu|      Itaewon Clubs|      133|10210|37.532768|126.990021|                     15|                13|               1|         0.68|                   16.87|                6.5|               435|10210|37.532768|126.990021|                     15|                13|               1|         0.68|                   16.87|                6.5|               435|
|   Seoul|   Guro-gu|Guro-gu Call Center|       99|10070|37.495632| 126.88765|                     26|                34|               3|          1.0|                   16.21|                5.7|               741|10070|37.495632| 126.88765|                     26|                34|               3|          1.0|                   16.21|                5.7|               741|
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+

有时,您可能会遇到这样的情况:您需要将一个非常大的表与一个非常小的表连接起来。该场景还可能涉及增加数据库的大小,如下面的示例所示:

Broadcast

在Spark中,这样的操作非常多,在Spark中,您可能希望将多个操作应用于particular key。但是,假设大表中每个键的数据都很大,它将涉及大量的数据移动。有时会导致应用程序本身崩溃。在这样大的表上进行连接时(假设另一个表很小),可以做的一个小优化是在执行连接时将小表广播(broadcast)到每个计算机/节点

from pyspark.sql.functions import broadcast
cases = cases.join(broadcast(regions), ['province','city'],how='left')

4.使用sql

如果需要,还可以对spark dataframe使用SQL。让我们试着在cases表上运行一些SQL。
我们首先将cases dataframe注册到一个临时表cases_表中,我们可以在该表上运行SQL操作。如您所见,SQL select语句的结果再次是一个spark dataframe。
注意: registerTempTable 方法已经在 spark 2.0.0+ 启用了取而代之的是createOrReplaceTempView

dataframe的操作通常会使用sql语句完成,下面有四个创建表的方法
#df.createGlobalTempView("tempViewName")           创建一个全局临时表,生命周期为程序的生命周期  **使用的时候 global_temp.tempViewName
#df.createOrReplaceGlobalTempView("tempViewName")  创建或者替换一个全局临时表,生命周期为程序的生命周期
#df.createOrReplaceTempView("tempViewName")        创建一个临时表,生命周期为当前SparkSession的生命周期
#df.createTempView("tempViewName")                 创建或者替换一个临时表,生命周期为当前SparkSession的生命周期
# 这里使用createOrReplaceTempView创建一个临时表
cases.createOrReplaceTempView('cases_table')
newDF = spark.sql('select * from cases_table where confirmed>100')
newDF.show()
+-----------------+---------------+--------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
|         province|           city|      infection_case|confirmed| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count|
+-----------------+---------------+--------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+
|            Seoul|     Yongsan-gu|       Itaewon Clubs|      133|10210|37.532768|126.990021|                     15|                13|               1|         0.68|                   16.87|                6.5|               435|10210|37.532768|126.990021|                     15|                13|               1|         0.68|                   16.87|                6.5|               435|
|            Seoul|              -|     overseas inflow|      346| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null|
|            Daegu|         Nam-gu|  Shincheonji Church|     4510|12010|  35.8463|128.597723|                     11|                15|               2|         0.85|                   22.49|               10.4|               345|12010|  35.8463|128.597723|                     11|                15|               2|         0.85|                   22.49|               10.4|               345|
|            Daegu|   Dalseong-gun|Second Mi-Ju Hosp...|      196|12030| 35.77475|128.431314|                     32|                47|               1|         1.51|                   12.11|                5.4|               361|12030| 35.77475|128.431314|                     32|                47|               1|         1.51|                   12.11|                5.4|               361|
|            Daegu|         Seo-gu|Hansarang Convale...|      128|12060|35.871993|128.559182|                     17|                23|               0|         0.83|                   21.29|               10.1|               374|12060|35.871993|128.559182|                     17|                23|               0|         0.83|                   21.29|               10.1|               374|
|            Daegu|              -|contact with patient|      920| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null|
|            Daegu|              -|                 etc|      737| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null|
|      Gyeonggi-do|              -|     overseas inflow|      252| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null|
|Chungcheongnam-do|     Cheonan-si|gym facility in C...|      103|41120| 36.81498|127.113868|                     75|               112|               6|         1.91|                   10.42|                4.5|              1069|41120| 36.81498|127.113868|                     75|               112|               6|         1.91|                   10.42|                4.5|              1069|
| Gyeongsangbuk-do|from other city|  Shincheonji Church|      566| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null|
| Gyeongsangbuk-do|   Cheongdo-gun|Cheongdo Daenam H...|      120|60200|35.647361|128.734382|                     11|                14|               0|         0.63|                   36.55|               21.0|                85|60200|35.647361|128.734382|                     11|                14|               0|         0.63|                   36.55|               21.0|                85|
| Gyeongsangbuk-do|              -|contact with patient|      190| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null|
| Gyeongsangbuk-do|              -|                 etc|      132| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null| null|     null|      null|                   null|              null|            null|         null|                    null|               null|              null|
+-----------------+---------------+--------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+

上面展示了一个最小的例子,但是你可以使用非常复杂的SQL查询,包括GROUP BY、HAVING和ORDER BY子句,以及上面查询中的别名。

5.创建新列

有许多方法可用于在PySpark dataframe中创建列。我们可以使用.withcolumnPySpark SQL函数创建一个新列。
(1).使用内置函数functions

import pyspark.sql.functions as F
casesWithNewConfirmed = cases.withColumn("NewConfirmed", 100 + F.col("confirmed"))
casesWithNewConfirmed.show(2)
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+------------+
|province|      city|     infection_case|confirmed| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count| code| latitude| longitude|elementary_school_count|kindergarten_count|university_count|academy_ratio|elderly_population_ratio|elderly_alone_ratio|nursing_home_count|NewConfirmed|
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+------------+
|   Seoul|Yongsan-gu|      Itaewon Clubs|      133|10210|37.532768|126.990021|                     15|                13|               1|         0.68|                   16.87|                6.5|               435|10210|37.532768|126.990021|                     15|                13|               1|         0.68|                   16.87|                6.5|               435|         233|
|   Seoul|   Guro-gu|Guro-gu Call Center|       99|10070|37.495632| 126.88765|                     26|                34|               3|          1.0|                   16.21|                5.7|               741|10070|37.495632| 126.88765|                     26|                34|               3|          1.0|                   16.21|                5.7|               741|         199|
+--------+----------+-------------------+---------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+-----+---------+----------+-----------------------+------------------+----------------+-------------+------------------------+-------------------+------------------+------------+

=================过几天填坑=======================

你可能感兴趣的:(pyspark dataframe常用操作)