DataFrame是以二维表格形式的数据存储结构。在SparkSQL中同样是分布式数据集,有分区并且可以并行计算。
结构层面:
数据层面:
DataFrame和RDD的异同:
DataFrame对象可以从RDD转换而来,都是分布式数据集。这种方式就是转换一下内部存储的结构,也就是转换为二维表结构。通过SparkSession对象的createDataFrame方法来将RDD转换为DataFrame。
具体代码展示:
# 部分数据展示
'''
Michael, 29
Andy, 30
Justin, 19
Bob, 23
Thomas, 35
Alice, 27
'''
# coding : utf8
from pyspark.sql import SparkSession
if __name__ == '__main__':
ss = SparkSession.builder \
.appName("test") \
.master("local[*]") \
.getOrCreate()
sc = ss.sparkContext
# TODO:基于RDD转换为DataFrame
rdd_file = sc.textFile("../Data/input/sql/people.txt")
rdd_split = rdd_file.map(lambda line : line.split(",")) \
.map(lambda x: (x[0], int(x[1])))
# TODO:构建DataFrame对象
# 参数1 :被转换的RDD
# 参数2 :指定列名通过list形式,按照顺序依次提供字符串名称
df = ss.createDataFrame(rdd_split, schema=['name', 'age'])
# 打印表的结构
df.printSchema()
# 输出表
# 参数1 表示 展示出多少条数据, 默认不传的话是20
# 参数2 表示是否对列进行截断, 如果列的数据长度超过20个字符串长度, 后续的内容不显示以...代替
# 如果给False 表示不截断全部显示, 默认是True
df.show(20,False)
# 将DF对象转换成临时视图表, 可供sql语句查询
df.createOrReplaceTempView("stu")
ss.sql("SELECT * FROM stu WHERE age < 30").show()
# coding : utf8
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, IntegerType
if __name__ == '__main__':
ss = SparkSession.builder \
.appName("test") \
.master("local[*]") \
.getOrCreate()
sc = ss.sparkContext
# TODO:基于RDD转换为DataFrame
rdd_file = sc.textFile("../Data/input/sql/people.txt")
rdd_split = rdd_file.map(lambda line : line.split(",")) \
.map(lambda x: (x[0], int(x[1])))
# 构建表结构的描述对象: StructType对象
schema = StructType().add("name", StringType(), nullable=False) \
.add("age", IntegerType(), nullable=True)
# 基于StructType对象去构建RDD到DF的转换
df = ss.createDataFrame(rdd_split, schema=schema)
df.printSchema()
df.show()
# coding : utf8
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, IntegerType
if __name__ == '__main__':
ss = SparkSession.builder \
.appName("test") \
.master("local[*]") \
.getOrCreate()
sc = ss.sparkContext
# TODO:基于RDD转换为DataFrame
rdd_file = sc.textFile("../Data/input/sql/people.txt")
rdd_split = rdd_file.map(lambda line : line.split(",")) \
.map(lambda x: (x[0], int(x[1])))
# toDF的方式1:直接toDF构建DataFrame
df1 = rdd_split.toDF(["name", "age"])
df1.printSchema()
df1.show()
# toDF的方式2:通过StructType来构建
schema = StructType().add("name", StringType(), nullable=False) \
.add("age", IntegerType(), nullable=True)
df2 = rdd_split.toDF(schema=schema)
df2.printSchema()
df2.show()
# coding : utf8
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, IntegerType
import pandas as pd
if __name__ == '__main__':
ss = SparkSession.builder \
.appName("test") \
.master("local[*]") \
.getOrCreate()
sc = ss.sparkContext
# 基于Pandas的DataFrame构建SparkSQL的DataFrame对象
pdf = pd.DataFrame({
"pos" : [1, 2, 3, 4, 5],
"id" : ['MATUMBAMAN', 'miCKe', 'Zai', 'Boxi', 'iNSaNiA'],
"age" : [27, 23, 25, 24, 28 ]
})
df = ss.createDataFrame(pdf)
df.printSchema()
df.show()
df.createOrReplaceTempView("Team_Liquid_Players")
ss.sql("SELECT id FROM Team_Liquid_Players WHERE pos = 1").show()
ss.read.format("text|csv|json|parquet|orc|avro|jdbc|......") \
.option("K", "V") \ # option可选
.schema(StructType | String) \ # STRING的语法如.schema("name STRING", "age INT")
.load("被读取文件的路径, 支持本地文件系统和HDFS")
# coding : utf8
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, IntegerType
if __name__ == '__main__':
ss = SparkSession.builder \
.appName("test") \
.master("local[*]") \
.getOrCreate()
sc = ss.sparkContext
# 构建StructType, text数据源, 读取数据的特点是, 将一整行只作为`一个列`读取, 默认列名是value 类型是String
schema = StructType().add("info", StringType(), nullable=True)
df = ss.read.format("text") \
.schema(schema=schema) \
.load("../Data/input/sql/people.txt")
df.printSchema()
df.show()
# coding : utf8
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, IntegerType
if __name__ == '__main__':
ss = SparkSession.builder \
.appName("test") \
.master("local[*]") \
.getOrCreate()
sc = ss.sparkContext
# JSON 类型 一般不用写.schema, json自带, json带有列名 和列类型(字符串和数字)
df = ss.read.format("json") .load("../Data/input/sql/people.json")
df.printSchema()
df.show()
# coding : utf8
from pyspark.sql import SparkSession
if __name__ == '__main__':
ss = SparkSession.builder \
.appName("test") \
.master("local[*]") \
.getOrCreate()
sc = ss.sparkContext
df = ss.read.format("csv") \
.option("sep",";") \ # 指定分隔符
.option("header", True) \ # 指定头部是否存在
.option("encoding", "utf-8") \ # 编码
.schema("name STRING, age INT, job STRING") \ #指定列名,数据类型
.load("../Data/input/sql/people.csv")
df.printSchema()
df.show()
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, IntegerType
if __name__ == '__main__':
ss = SparkSession.builder \
.appName("test") \
.master("local[*]") \
.getOrCreate()
sc = ss.sparkContext
# parquet自带schema, 直接load就可以了
df = ss.read.format("parquet").load("../Data/input/sql/users.parquet")
df.printSchema()
df.show()
Parquet是Spark中常用的一种列式存储文件格式与Hive中ORC差不多。两者都是列存储格式。
Parquet与普通文件的区别:
Parquet和ORC对比: