官方解释:
用大白话讲:
原文链接:https://blog.csdn.net/qq_17685725/article/details/123069407
DataFrame编程步骤:
step 1: 引入相关模块
step 2: 创建SparkSession对象
step 3: 通过SparkSession对象读取数据源,生成DataFrame对象
step 4: 对DataFrame进行Transformation操作(有两种方式)
方式(1) 通过DataFrame API 提供的方法
方式(2) 通过Spark SQL
step 5: 对DataFrame进行Action操作
# # 以下三行为新增内容
# import os
# import os
# os.environ["PYSPARK_PYTHON"]="D:/develop/anaconda3/python.exe"
# os.environ["PYSPARK_DRIVER_PYTHON"]="D:/develop/anaconda3/python.exe"
# !pip install widgetsnbextension
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
实验1:已知 scores = [(“Tom”, 89, 80, 77), (“Mike”, 68, 73, 90), (“Rose”, 88, 65, 70), (“Lucy”, 56, 75, 86)]
(1) 根据scores生成DataFrame对象,并查看默认生成的列名和列类型
scores = [("Tom", 89, 80, 77), ("Mike", 68, 73, 90), ("Rose", 88, 65, 70), ("Lucy", 56, 75, 86)]
df = spark.createDataFrame(scores)
print(df.columns,"\n")
print(df.printSchema())
['_1', '_2', '_3', '_4']
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
|-- _3: long (nullable = true)
|-- _4: long (nullable = true)
None
(2) 根据scores生成DataFrame对象,并指定列名分别是:name, hadoop_score, spark_score, nosql_score。之后调用show()方法查看数据
scores = [("Tom", 89, 80, 77), ("Mike", 68, 73, 90), ("Rose", 88, 65, 70), ("Lucy", 56, 75, 86)]
columns = ['name','hadoop_score', 'spark_score', 'nosql_score']
df1 = spark.createDataFrame(scores,columns)
df1.show()
+----+------------+-----------+-----------+
|name|hadoop_score|spark_score|nosql_score|
+----+------------+-----------+-----------+
| Tom| 89| 80| 77|
|Mike| 68| 73| 90|
|Rose| 88| 65| 70|
|Lucy| 56| 75| 86|
+----+------------+-----------+-----------+
(3) 查看(2)中DataFrame的分区个数,及其各个分区上的数据(提示:先把DataFrame转换到RDD)
scores_rdd = df1.rdd
scores_rdd.getNumPartitions()
12
scores_rdd.glom().map(len).collect()
[0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1]
scores_rdd.glom().collect()
[[],
[],
[Row(name='Tom', hadoop_score=89, spark_score=80, nosql_score=77)],
[],
[],
[Row(name='Mike', hadoop_score=68, spark_score=73, nosql_score=90)],
[],
[],
[Row(name='Rose', hadoop_score=88, spark_score=65, nosql_score=70)],
[],
[],
[Row(name='Lucy', hadoop_score=56, spark_score=75, nosql_score=86)]]
df1.rdd.glom().collect()
[[],
[],
[Row(name='Tom', hadoop_score=89, spark_score=80, nosql_score=77)],
[],
[],
[Row(name='Mike', hadoop_score=68, spark_score=73, nosql_score=90)],
[],
[],
[Row(name='Rose', hadoop_score=88, spark_score=65, nosql_score=70)],
[],
[],
[Row(name='Lucy', hadoop_score=56, spark_score=75, nosql_score=86)]]
(4) 选取(2)中DataFrame的name和spark_score列,生成新的DataFrame,并调用show()方法查看数据(用三种方法)
方法一:使用DataFrame API: select()
df_select = df1.select('name','spark_score')
df_select.show()
+----+-----------+
|name|spark_score|
+----+-----------+
| Tom| 80|
|Mike| 73|
|Rose| 65|
|Lucy| 75|
+----+-----------+
方法二:使用DataFrame API: drop()
df_select = df1.drop('hadoop_score','nosql_score')
df_select.show()
+----+-----------+
|name|spark_score|
+----+-----------+
| Tom| 80|
|Mike| 73|
|Rose| 65|
|Lucy| 75|
+----+-----------+
方法三:使用Spark SQL
df_select = df1.createOrReplaceTempView('df_views')
df_select = spark.sql("select name,spark_score from df_views")
df_select.show()
+----+-----------+
|name|spark_score|
+----+-----------+
| Tom| 80|
|Mike| 73|
|Rose| 65|
|Lucy| 75|
+----+-----------+
(5) 将(2)中DataFrame的name列重命名为student_name(用两种方法)
方法一:使用DataFrame API: withColumnRenamed()
df1_rename = df1.withColumnRenamed('name','student_name')
df1_rename.show()
+------------+------------+-----------+-----------+
|student_name|hadoop_score|spark_score|nosql_score|
+------------+------------+-----------+-----------+
| Tom| 89| 80| 77|
| Mike| 68| 73| 90|
| Rose| 88| 65| 70|
| Lucy| 56| 75| 86|
+------------+------------+-----------+-----------+
方法二:使用Spark SQL
df1_rename = df1.createOrReplaceTempView('df_view')
df_sql = spark.sql("select name as student_name, hadoop_score, spark_score, nosql_score from df_view")
df_sql.show()
+------------+------------+-----------+-----------+
|student_name|hadoop_score|spark_score|nosql_score|
+------------+------------+-----------+-----------+
| Tom| 89| 80| 77|
| Mike| 68| 73| 90|
| Rose| 88| 65| 70|
| Lucy| 56| 75| 86|
+------------+------------+-----------+-----------+
df1_rename = df1.createOrReplaceTempView('df_view')
df_sql = spark.sql("select name as student_name from df_view")
df_sql.show()
+------------+
|student_name|
+------------+
| Tom|
| Mike|
| Rose|
| Lucy|
+------------+
df1.show()
+----+------------+-----------+-----------+
|name|hadoop_score|spark_score|nosql_score|
+----+------------+-----------+-----------+
| Tom| 89| 80| 77|
|Mike| 68| 73| 90|
|Rose| 88| 65| 70|
|Lucy| 56| 75| 86|
+----+------------+-----------+-----------+
(6) 根据(2)中DataFrame的3个成绩列,生成新列total_scores和avg_scores,其列值分别为3个成绩列的总和与平均值(平均值保留两位小数)(用两种方法)
方法一:使用DataFrame API: withColumn()
df2 = df1.withColumn('total_scores',df1.hadoop_score + df1.spark_score + df1.nosql_score)
df2 = df2.withColumn('avg_scores',(df1.hadoop_score + df1.spark_score + df1.nosql_score)/3)
df2.show()
+----+------------+-----------+-----------+------------+-----------------+
|name|hadoop_score|spark_score|nosql_score|total_scores| avg_scores|
+----+------------+-----------+-----------+------------+-----------------+
| Tom| 89| 80| 77| 246| 82.0|
|Mike| 68| 73| 90| 231| 77.0|
|Rose| 88| 65| 70| 223|74.33333333333333|
|Lucy| 56| 75| 86| 217|72.33333333333333|
+----+------------+-----------+-----------+------------+-----------------+
df2.printSchema()
root
|-- name: string (nullable = true)
|-- hadoop_score: long (nullable = true)
|-- spark_score: long (nullable = true)
|-- nosql_score: long (nullable = true)
|-- total_scores: long (nullable = true)
|-- avg_scores: double (nullable = true)
from pyspark.sql.functions import col
from pyspark.sql.functions import round, format_number
# 一定要记得导入 round,format_number,col,不然会报错好多错误…………
# df2 = df2.withColumn('avg_scores',round(col('avg_scores'),2))
df2 = df2.withColumn('avg_scores',round('avg_scores',2))
df2.show()
+----+------------+-----------+-----------+------------+----------+
|name|hadoop_score|spark_score|nosql_score|total_scores|avg_scores|
+----+------------+-----------+-----------+------------+----------+
| Tom| 89| 80| 77| 246| 82.0|
|Mike| 68| 73| 90| 231| 77.0|
|Rose| 88| 65| 70| 223| 74.33|
|Lucy| 56| 75| 86| 217| 72.33|
+----+------------+-----------+-----------+------------+----------+
方法二:使用Spark SQL
df1.createOrReplaceTempView('df3_view')
df3 = spark.sql("select name,hadoop_score, spark_score, nosql_score,(hadoop_score + spark_score + nosql_score) as total_scores,round((hadoop_score + spark_score + nosql_score)/3,2) as avg_scsores from df3_view")
df3.show()
+----+------------+-----------+-----------+------------+-----------+
|name|hadoop_score|spark_score|nosql_score|total_scores|avg_scsores|
+----+------------+-----------+-----------+------------+-----------+
| Tom| 89| 80| 77| 246| 82.0|
|Mike| 68| 73| 90| 231| 77.0|
|Rose| 88| 65| 70| 223| 74.33|
|Lucy| 56| 75| 86| 217| 72.33|
+----+------------+-----------+-----------+------------+-----------+
(7) 将(2)中DataFrame的3个成绩列的类型均修改为integer类型(用两种方法)
方法一:使用DataFrame API: withColumn()
df1.printSchema()
root
|-- name: string (nullable = true)
|-- hadoop_score: long (nullable = true)
|-- spark_score: long (nullable = true)
|-- nosql_score: long (nullable = true)
from pyspark.sql.types import IntegerType
# 一定要导入模块
df7 = df1.withColumn('hadoop_score',df1['hadoop_score'].cast(IntegerType()))
df7 = df7.withColumn('spark_score',df1['spark_score'].cast(IntegerType()))
df7 = df7.withColumn('nosql_score',df1['nosql_score'].cast(IntegerType()))
df7.printSchema()
root
|-- name: string (nullable = true)
|-- hadoop_score: integer (nullable = true)
|-- spark_score: integer (nullable = true)
|-- nosql_score: integer (nullable = true)
方法二:使用Spark SQL
df1.createOrReplaceTempView('df7_view')
df7_2 = spark.sql("select name, cast(hadoop_score as integer),cast(spark_score as integer),cast(nosql_score as integer) from df7_view")
df7_2.printSchema()
root
|-- name: string (nullable = true)
|-- hadoop_score: integer (nullable = true)
|-- spark_score: integer (nullable = true)
|-- nosql_score: integer (nullable = true)
(8) 根据(2)中DataFrame的name列以及其余3列的和,生成新的DataFrame,列名分别重命名为stuName和totalScores(用两种方法)
方法一:使用DataFrame API: selectExpr()
df8 = df2.selectExpr("name as stuName","total_scores as totalScores")
df8.show()
+-------+-----------+
|stuName|totalScores|
+-------+-----------+
| Tom| 246|
| Mike| 231|
| Rose| 223|
| Lucy| 217|
+-------+-----------+
方法二:使用Spark SQL
df2.createOrReplaceTempView("df8_view")
df8_2 = spark.sql("select name as stuName,total_scores as totalScores from df8_view")
df8_2.show()
+-------+-----------+
|stuName|totalScores|
+-------+-----------+
| Tom| 246|
| Mike| 231|
| Rose| 223|
| Lucy| 217|
+-------+-----------+
实验2:已知 Others\StudentData.csv
(1) 根据Others\StudentData.csv生成DataFrame对象(注意是否有表头),并查看DataFrame数据及默认生成的列类型
dfs = spark.read.csv(r'file:\D:\juniortwo\spark\Spark2023-02-20\Others\StudentData.csv',header = True, inferSchema = True)
dfs.show()
+---+------+----------------+------+------+-----+--------------------+
|age|gender| name|course| roll|marks| email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras| DB| 2984| 59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899| 62|Margene Moores_Ma...|
| 28| Male| Celeste Lollis| PF| 21267| 45|Jeannetta Golden_...|
| 29|Female| Elenore Choy| DB| 32877| 29|Billi Clore_Mitzi...|
| 28| Male| Sheryll Towler| DSA| 41487| 41|Claude Panos_Judi...|
| 28| Male| Margene Moores| MVC| 52771| 32|Toshiko Hillyard_...|
| 28| Male| Neda Briski| OOP| 61973| 69|Alberta Freund_El...|
| 28|Female| Claude Panos| Cloud| 72409| 85|Sheryll Towler_Al...|
| 28| Male| Celeste Lollis| MVC| 81492| 64|Nicole Harwood_Cl...|
| 29| Male| Cordie Harnois| OOP| 92882| 51|Judie Chipps_Clem...|
| 29|Female| Kena Wild| DSA|102285| 35|Dustin Feagins_Ma...|
| 29| Male| Ernest Rossbach| DB|111449| 53|Maybell Duguay_Ab...|
| 28|Female| Latia Vanhoose| DB|122502| 27|Latia Vanhoose_Mi...|
| 29|Female| Latia Vanhoose| MVC|132110| 55|Eda Neathery_Nico...|
| 29| Male| Neda Briski| PF|141770| 42|Margene Moores_Mi...|
| 29|Female| Latia Vanhoose| DB|152159| 27|Claude Panos_Sant...|
| 29| Male| Loris Crossett| MVC|161771| 36|Mitzi Seldon_Jenn...|
| 29| Male| Annika Hoffman| OOP|171660| 22|Taryn Brownlee_Mi...|
| 29| Male| Santa Kerfien| PF|182129| 56|Judie Chipps_Tary...|
| 28|Female|Mickey Cortright| DB|192537| 62|Ernest Rossbach_M...|
+---+------+----------------+------+------+-----+--------------------+
only showing top 20 rows
(2) 根据Others\StudentData.csv生成DataFrame对象(注意是否有表头),并在生成DataFrame对象时推断列类型,然后查看推断的类型
dfs.printSchema()
root
|-- age: integer (nullable = true)
|-- gender: string (nullable = true)
|-- name: string (nullable = true)
|-- course: string (nullable = true)
|-- roll: integer (nullable = true)
|-- marks: integer (nullable = true)
|-- email: string (nullable = true)
(3) 将roll列的类型更改为string
from pyspark.sql.types import StringType
dfs3 = dfs.withColumn('roll',dfs['roll'].cast(StringType()))
dfs3.printSchema()
root
|-- age: integer (nullable = true)
|-- gender: string (nullable = true)
|-- name: string (nullable = true)
|-- course: string (nullable = true)
|-- roll: string (nullable = true)
|-- marks: integer (nullable = true)
|-- email: string (nullable = true)
实验3:第五周实验时,曾将genome-scores.csv文件上传到了虚拟机的HDFS上
(1) 使用HDFS上的genome-scores.csv生成DataFrame(注意表头),同时推断列类型。然后查看推断的类型
df = spark.read.csv("hdfs://192.168.110.128:9000/input/genome-scores.csv", header=True, inferSchema=True)
df.printSchema()
root
|-- movieId: integer (nullable = true)
|-- tagId: integer (nullable = true)
|-- relevance: double (nullable = true)
(2) 通过show()方法查看DataFrame的前10行数据
df.show(10)
+-------+-----+--------------------+
|movieId|tagId| relevance|
+-------+-----+--------------------+
| 1| 1|0.029000000000000026|
| 1| 2|0.023749999999999993|
| 1| 3| 0.05425000000000002|
| 1| 4| 0.06874999999999998|
| 1| 5| 0.15999999999999998|
| 1| 6| 0.19524999999999998|
| 1| 7| 0.07600000000000001|
| 1| 8| 0.252|
| 1| 9| 0.22749999999999998|
| 1| 10| 0.02400000000000002|
+-------+-----+--------------------+
only showing top 10 rows