实验手册 - 第7周Spark DataFrame

目录标题

  • DataFrame
  • 实验内容
    • 实验1
    • 实验2
    • 实验3

DataFrame

官方解释:

  1. DataFrame = RDD[Person] - 泛型 + Schema + SQL操作 + 优化
  2. 官方原文:A DataFrame is a DataSet organized into named columns.
  3. 中文翻译:以列(列名,列类型,列值)的形式构成的分布式的数据集。

用大白话讲:

  1. 在 Spark 中,DataFrame 是一种以 RDD 为基础的分布式数据集,是一种特殊的RDD,是一个分布式的表,类似于传统数据库中的二维表格。
  2. DataFrame 与 RDD 的主要区别在于,前者带有 schema 元信息,即 DataFrame 所表示的二维表数据集的每一列都带有名称和类型。

原文链接:https://blog.csdn.net/qq_17685725/article/details/123069407

DataFrame编程步骤:

step 1: 引入相关模块

step 2: 创建SparkSession对象

step 3: 通过SparkSession对象读取数据源,生成DataFrame对象

step 4: 对DataFrame进行Transformation操作(有两种方式)

方式(1) 通过DataFrame API 提供的方法

方式(2) 通过Spark SQL

step 5: 对DataFrame进行Action操作

# # 以下三行为新增内容
# import os
# import os

# os.environ["PYSPARK_PYTHON"]="D:/develop/anaconda3/python.exe"
# os.environ["PYSPARK_DRIVER_PYTHON"]="D:/develop/anaconda3/python.exe"
# !pip install  widgetsnbextension
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

实验内容

实验1

实验1:已知 scores = [(“Tom”, 89, 80, 77), (“Mike”, 68, 73, 90), (“Rose”, 88, 65, 70), (“Lucy”, 56, 75, 86)]

(1) 根据scores生成DataFrame对象,并查看默认生成的列名和列类型

scores = [("Tom", 89, 80, 77), ("Mike", 68, 73, 90), ("Rose", 88, 65, 70), ("Lucy", 56, 75, 86)]
df = spark.createDataFrame(scores)  
print(df.columns,"\n")
print(df.printSchema())
['_1', '_2', '_3', '_4'] 

root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)
 |-- _3: long (nullable = true)
 |-- _4: long (nullable = true)

None

(2) 根据scores生成DataFrame对象,并指定列名分别是:name, hadoop_score, spark_score, nosql_score。之后调用show()方法查看数据

scores = [("Tom", 89, 80, 77), ("Mike", 68, 73, 90), ("Rose", 88, 65, 70), ("Lucy", 56, 75, 86)]
columns = ['name','hadoop_score', 'spark_score', 'nosql_score']
df1 = spark.createDataFrame(scores,columns)
df1.show()
+----+------------+-----------+-----------+
|name|hadoop_score|spark_score|nosql_score|
+----+------------+-----------+-----------+
| Tom|          89|         80|         77|
|Mike|          68|         73|         90|
|Rose|          88|         65|         70|
|Lucy|          56|         75|         86|
+----+------------+-----------+-----------+

(3) 查看(2)中DataFrame的分区个数,及其各个分区上的数据(提示:先把DataFrame转换到RDD)

scores_rdd = df1.rdd
scores_rdd.getNumPartitions()
12
scores_rdd.glom().map(len).collect()
[0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1]
scores_rdd.glom().collect()
[[],
 [],
 [Row(name='Tom', hadoop_score=89, spark_score=80, nosql_score=77)],
 [],
 [],
 [Row(name='Mike', hadoop_score=68, spark_score=73, nosql_score=90)],
 [],
 [],
 [Row(name='Rose', hadoop_score=88, spark_score=65, nosql_score=70)],
 [],
 [],
 [Row(name='Lucy', hadoop_score=56, spark_score=75, nosql_score=86)]]
df1.rdd.glom().collect()
[[],
 [],
 [Row(name='Tom', hadoop_score=89, spark_score=80, nosql_score=77)],
 [],
 [],
 [Row(name='Mike', hadoop_score=68, spark_score=73, nosql_score=90)],
 [],
 [],
 [Row(name='Rose', hadoop_score=88, spark_score=65, nosql_score=70)],
 [],
 [],
 [Row(name='Lucy', hadoop_score=56, spark_score=75, nosql_score=86)]]

(4) 选取(2)中DataFrame的name和spark_score列,生成新的DataFrame,并调用show()方法查看数据(用三种方法)

方法一:使用DataFrame API: select()

df_select = df1.select('name','spark_score')
df_select.show()
+----+-----------+
|name|spark_score|
+----+-----------+
| Tom|         80|
|Mike|         73|
|Rose|         65|
|Lucy|         75|
+----+-----------+

方法二:使用DataFrame API: drop()

df_select = df1.drop('hadoop_score','nosql_score')
df_select.show()
+----+-----------+
|name|spark_score|
+----+-----------+
| Tom|         80|
|Mike|         73|
|Rose|         65|
|Lucy|         75|
+----+-----------+

方法三:使用Spark SQL

df_select = df1.createOrReplaceTempView('df_views')
df_select = spark.sql("select name,spark_score from df_views")
df_select.show()
+----+-----------+
|name|spark_score|
+----+-----------+
| Tom|         80|
|Mike|         73|
|Rose|         65|
|Lucy|         75|
+----+-----------+

(5) 将(2)中DataFrame的name列重命名为student_name(用两种方法)

方法一:使用DataFrame API: withColumnRenamed()

df1_rename = df1.withColumnRenamed('name','student_name')
df1_rename.show()
+------------+------------+-----------+-----------+
|student_name|hadoop_score|spark_score|nosql_score|
+------------+------------+-----------+-----------+
|         Tom|          89|         80|         77|
|        Mike|          68|         73|         90|
|        Rose|          88|         65|         70|
|        Lucy|          56|         75|         86|
+------------+------------+-----------+-----------+

方法二:使用Spark SQL

df1_rename = df1.createOrReplaceTempView('df_view')
df_sql = spark.sql("select name as student_name, hadoop_score, spark_score, nosql_score from df_view")
df_sql.show()
+------------+------------+-----------+-----------+
|student_name|hadoop_score|spark_score|nosql_score|
+------------+------------+-----------+-----------+
|         Tom|          89|         80|         77|
|        Mike|          68|         73|         90|
|        Rose|          88|         65|         70|
|        Lucy|          56|         75|         86|
+------------+------------+-----------+-----------+
df1_rename = df1.createOrReplaceTempView('df_view')
df_sql = spark.sql("select name as student_name from df_view")
df_sql.show()
+------------+
|student_name|
+------------+
|         Tom|
|        Mike|
|        Rose|
|        Lucy|
+------------+
df1.show()
+----+------------+-----------+-----------+
|name|hadoop_score|spark_score|nosql_score|
+----+------------+-----------+-----------+
| Tom|          89|         80|         77|
|Mike|          68|         73|         90|
|Rose|          88|         65|         70|
|Lucy|          56|         75|         86|
+----+------------+-----------+-----------+

(6) 根据(2)中DataFrame的3个成绩列,生成新列total_scores和avg_scores,其列值分别为3个成绩列的总和与平均值(平均值保留两位小数)(用两种方法)

方法一:使用DataFrame API: withColumn()

df2 = df1.withColumn('total_scores',df1.hadoop_score + df1.spark_score + df1.nosql_score)
df2 = df2.withColumn('avg_scores',(df1.hadoop_score + df1.spark_score + df1.nosql_score)/3)
df2.show()
+----+------------+-----------+-----------+------------+-----------------+
|name|hadoop_score|spark_score|nosql_score|total_scores|       avg_scores|
+----+------------+-----------+-----------+------------+-----------------+
| Tom|          89|         80|         77|         246|             82.0|
|Mike|          68|         73|         90|         231|             77.0|
|Rose|          88|         65|         70|         223|74.33333333333333|
|Lucy|          56|         75|         86|         217|72.33333333333333|
+----+------------+-----------+-----------+------------+-----------------+
df2.printSchema()
root
 |-- name: string (nullable = true)
 |-- hadoop_score: long (nullable = true)
 |-- spark_score: long (nullable = true)
 |-- nosql_score: long (nullable = true)
 |-- total_scores: long (nullable = true)
 |-- avg_scores: double (nullable = true)
from pyspark.sql.functions import col
from pyspark.sql.functions import round, format_number
# 一定要记得导入 round,format_number,col,不然会报错好多错误…………
# df2 = df2.withColumn('avg_scores',round(col('avg_scores'),2))
df2 = df2.withColumn('avg_scores',round('avg_scores',2))
df2.show()
+----+------------+-----------+-----------+------------+----------+
|name|hadoop_score|spark_score|nosql_score|total_scores|avg_scores|
+----+------------+-----------+-----------+------------+----------+
| Tom|          89|         80|         77|         246|      82.0|
|Mike|          68|         73|         90|         231|      77.0|
|Rose|          88|         65|         70|         223|     74.33|
|Lucy|          56|         75|         86|         217|     72.33|
+----+------------+-----------+-----------+------------+----------+

方法二:使用Spark SQL

df1.createOrReplaceTempView('df3_view')
df3 = spark.sql("select name,hadoop_score, spark_score, nosql_score,(hadoop_score + spark_score + nosql_score) as total_scores,round((hadoop_score + spark_score + nosql_score)/3,2) as avg_scsores from df3_view")
df3.show()
+----+------------+-----------+-----------+------------+-----------+
|name|hadoop_score|spark_score|nosql_score|total_scores|avg_scsores|
+----+------------+-----------+-----------+------------+-----------+
| Tom|          89|         80|         77|         246|       82.0|
|Mike|          68|         73|         90|         231|       77.0|
|Rose|          88|         65|         70|         223|      74.33|
|Lucy|          56|         75|         86|         217|      72.33|
+----+------------+-----------+-----------+------------+-----------+

(7) 将(2)中DataFrame的3个成绩列的类型均修改为integer类型(用两种方法)

方法一:使用DataFrame API: withColumn()

df1.printSchema()
root
 |-- name: string (nullable = true)
 |-- hadoop_score: long (nullable = true)
 |-- spark_score: long (nullable = true)
 |-- nosql_score: long (nullable = true)
from pyspark.sql.types import IntegerType
# 一定要导入模块
df7 = df1.withColumn('hadoop_score',df1['hadoop_score'].cast(IntegerType()))
df7 = df7.withColumn('spark_score',df1['spark_score'].cast(IntegerType()))
df7 = df7.withColumn('nosql_score',df1['nosql_score'].cast(IntegerType()))
df7.printSchema()
root
 |-- name: string (nullable = true)
 |-- hadoop_score: integer (nullable = true)
 |-- spark_score: integer (nullable = true)
 |-- nosql_score: integer (nullable = true)

方法二:使用Spark SQL

df1.createOrReplaceTempView('df7_view')
df7_2 = spark.sql("select name, cast(hadoop_score as integer),cast(spark_score as integer),cast(nosql_score as integer) from df7_view")
df7_2.printSchema()
root
 |-- name: string (nullable = true)
 |-- hadoop_score: integer (nullable = true)
 |-- spark_score: integer (nullable = true)
 |-- nosql_score: integer (nullable = true)

(8) 根据(2)中DataFrame的name列以及其余3列的和,生成新的DataFrame,列名分别重命名为stuName和totalScores(用两种方法)

方法一:使用DataFrame API: selectExpr()

df8 = df2.selectExpr("name as stuName","total_scores as totalScores")
df8.show()
+-------+-----------+
|stuName|totalScores|
+-------+-----------+
|    Tom|        246|
|   Mike|        231|
|   Rose|        223|
|   Lucy|        217|
+-------+-----------+

方法二:使用Spark SQL

df2.createOrReplaceTempView("df8_view")
df8_2 = spark.sql("select name as stuName,total_scores as totalScores from df8_view")
df8_2.show()
+-------+-----------+
|stuName|totalScores|
+-------+-----------+
|    Tom|        246|
|   Mike|        231|
|   Rose|        223|
|   Lucy|        217|
+-------+-----------+

实验2

实验2:已知 Others\StudentData.csv

(1) 根据Others\StudentData.csv生成DataFrame对象(注意是否有表头),并查看DataFrame数据及默认生成的列类型

dfs = spark.read.csv(r'file:\D:\juniortwo\spark\Spark2023-02-20\Others\StudentData.csv',header = True, inferSchema = True)
dfs.show()
+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29|  Male| Ernest Rossbach|    DB|111449|   53|Maybell Duguay_Ab...|
| 28|Female|  Latia Vanhoose|    DB|122502|   27|Latia Vanhoose_Mi...|
| 29|Female|  Latia Vanhoose|   MVC|132110|   55|Eda Neathery_Nico...|
| 29|  Male|     Neda Briski|    PF|141770|   42|Margene Moores_Mi...|
| 29|Female|  Latia Vanhoose|    DB|152159|   27|Claude Panos_Sant...|
| 29|  Male|  Loris Crossett|   MVC|161771|   36|Mitzi Seldon_Jenn...|
| 29|  Male|  Annika Hoffman|   OOP|171660|   22|Taryn Brownlee_Mi...|
| 29|  Male|   Santa Kerfien|    PF|182129|   56|Judie Chipps_Tary...|
| 28|Female|Mickey Cortright|    DB|192537|   62|Ernest Rossbach_M...|
+---+------+----------------+------+------+-----+--------------------+
only showing top 20 rows

(2) 根据Others\StudentData.csv生成DataFrame对象(注意是否有表头),并在生成DataFrame对象时推断列类型,然后查看推断的类型

dfs.printSchema()
root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: integer (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)

(3) 将roll列的类型更改为string

from pyspark.sql.types import StringType
dfs3 = dfs.withColumn('roll',dfs['roll'].cast(StringType()))
dfs3.printSchema()
root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: string (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)

实验3

实验3:第五周实验时,曾将genome-scores.csv文件上传到了虚拟机的HDFS上

(1) 使用HDFS上的genome-scores.csv生成DataFrame(注意表头),同时推断列类型。然后查看推断的类型

df = spark.read.csv("hdfs://192.168.110.128:9000/input/genome-scores.csv", header=True, inferSchema=True)
df.printSchema()
root
 |-- movieId: integer (nullable = true)
 |-- tagId: integer (nullable = true)
 |-- relevance: double (nullable = true)

(2) 通过show()方法查看DataFrame的前10行数据

df.show(10)
+-------+-----+--------------------+
|movieId|tagId|           relevance|
+-------+-----+--------------------+
|      1|    1|0.029000000000000026|
|      1|    2|0.023749999999999993|
|      1|    3| 0.05425000000000002|
|      1|    4| 0.06874999999999998|
|      1|    5| 0.15999999999999998|
|      1|    6| 0.19524999999999998|
|      1|    7| 0.07600000000000001|
|      1|    8|               0.252|
|      1|    9| 0.22749999999999998|
|      1|   10| 0.02400000000000002|
+-------+-----+--------------------+
only showing top 10 rows

你可能感兴趣的:(spark,spark,大数据,分布式)