附录·:SparkSQL DataFrame对象官网所有属性和方法介绍
–
前言:工作中在${SPARK_HOME}/bin/pyspark交互式环境下,调试程序非常不方便。so,基于jupyter-lab + pyspark(类库,不是spark安装目录下的pyspark)连通yarn集群进行在线交互式分布运算。
环境:Jupyter(python3.9) + pyspark3.1.3 + yarn(hadoop3.2.0)
目前发现,使用jupyter 代替linux中${SPARK_HOME}/bin/pyspark命令行交互式环境,有两种实现方式。
第一种方式:# spark官网说明
## 通过配置driver端python环境为jupyter,然后再启动{SPARK_HOME}/bin/pyspark实现
> export PYSPARK_DRIVER_PYTHON=/xx/anaconda3/bin/jupyter #jupyter启动服务命令所在目录
> export PYSPARK_DRIVER_PYTHON_OPTS=notebook # jupyter启动参数,jupyter notebook方式
> export PYSPARK_DRIVER_PYTHON_OPTS='lab --allow-root' # jupyter-lab 方式 二选其一
# 上面两个配置可以直接加到{SPARK_HOME}/bin/pyspark 启动文件里
# 再次启动pyspark
> ./bin/pyspark
第二种方式:使用pyspark(python类库)+ spark )
## 重要提示:jupyterhub服务器主机与安装Spark的集群节点在一台机器中,这样使用pyspark(类库)可以直接使用集群中SPARK的已有配置(如Executor个数、内存、历史服务器配置等等)
# 1. 安装python虚拟环境和pyspark库
# 1.1 创建jupyter driver端python虚拟环境(版本要与executor指定的版本保持一致,否则报:driver端与executor端python版本不一致错误)
conda create --name py3 python=x.x
# 1.2激活进入py3虚拟环境,生成py3 jupyter内核
pip install pyspark==3.1.3 # pyspark版本与集群中spark版本保持一致
python -m ipykernel install --name 指定内核名称 --display-name 指定jupyter内核页面展示名称 --profix 生成内核路径前缀(放在jupyter内核读取目录,一般为/x/anaconda3/share/jupyter/kernels/)
# 2. 使用pyspark创建spark on yarn上下文(jupyter切换至pyspark内核)
import os
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StringType,IntegerType,FloatType,ArrayType
import pyspark.sql.functions as F
os.environ['HADOOP_CONF_DIR'] = '/data/app/hadoop-3.2.0'
os.environ['JAVA_HOME'] = '/data/app/jdk1.8.0_333/'
os.environ['SPARK_HOME'] = '/data/app/spark-3.1.3-bin-hadoop3.2' # 集群中spark的家目录
os.environ['PYSPARK_PYTHON']='./py3/bin/python' # 通过archives参数上传的python环境包;也可以是所在机器绝对路径(要求每台机器所安装python环境一致)
# winow下若报错:“Python worker failed to connect back.” 指定下面环境python变量
# os.environ['PYSPARK_PYTHON'] = r'D:\apps\python3.9\python.exe'
conf = SparkConf()
env = [
('spark.app.name','appName'), # spark 应用名称
('spark.master','yarn'), # spark on yarn
# ('spark.submit.pyFiles','./test.txt') # job开始前上传所依赖py文件,client模式下可能提示找不到
('spark.submit.deploymode','client'), # spark 运行模式
('spark.yarn.dist.archives','hdfs://ip:8020/py3.zip#py3'), # spark 运行模式
]
conf.setAll(env)
sc = SparkContext(conf=conf)
sc.addPyFile('./funcs.py') #(local_file_path/hdfs/url) 添加单个python文件到executor上,甚至在开始job后也可以添加py文件
# spark = SparkSession.builder.config(conf=conf).getOrCreate() sparksession对象
import funcs #自定义函数py文件
rdd1 = sc.textFile('./test.txt')
rdd2 = rdd1.flatMap(funcs.udf)
rdd3 = rdd2.map(lambda x:(x,1))
rdd4 = rdd3.reduceByKey(lambda a,b:a+b)
print(rdd4.collect())
sc.stop()
spark-submit 具体参数
# spark-submit 是Spark提交各类任务(python、R、Java、Scala)的工具,
# 其实,{SPARK_HOME}/bin/pyspark交互式环境,运行时底层也是使用的spark-submit提交资源管理器进行计算。
client 和 cluster 运行模式注意点
起初,在配置spark运行环境时,指定spark python环境配置如下:
vim spark-defaults.conf
spark.yarn.dist.archives=hdfs://***/***/***/env/python_env.zip#python_env
spark.pyspark.driver.python=./python_env/bin/python # pyspark程序内部自定义函数或类执行环境
spark.pyspark.python=./python_env/bin/python
Spark-submit在进行client模式提交时,提示 "Cannot run program “./python_env/bin/python “: error=2, No such file or dictor "错误,而进行cluster提交时正常运行。
原因:
--archives code运行依赖文档。
--py-files code依赖python。
如上面依赖,也可能出现file not found问题,原理应该一样。
driver服务启动后,会自动上传依赖到executor中,解压到当前文件夹下(通过查看yarn运行日志,可以看到上传解压过程)。
当使用client模式时,driver运行在本地spark-submit进程中,未进行archives的上传解压,所以报错找不到python文件。
当使用cluster模式提交时,会优先在yarn的机器中,开启一个特殊的executor运行driver,在开启executor过程中,伴随着进行archives的上传解压。
弹性分布式数据集,Spark中最基本的数据抽象,代表一个不可变、可分区、里面的元素可并行计算的集合。
通过并行化集合创建(本地对象list --> 分布式RDD)
读取本地数据源或HDFS文件
# SparkContext对象创建
conf = SparkConf().setAppName('Spark core').setMaster('local[10]')
sc = SparkContext(conf=conf)
sc
# 对象详情
'''
SparkContext
Spark UI
Version v3.3.0
Master local[10]
AppName Spark core
'''
## RDD对象创建
# 1. 通过本地list对象创建
rdd1 = sc.parallelize(c=[1,2,3,4,5],numSlices=3)
print(rdd1.glom().collect()) # 收集所有分区,汇总到driver端显示数据
print(rdd1.getNumPartitions()) # 获取RDD的分区数量
# 输出
'''
[[1], [2, 3], [4, 5]]
3
'''
# 2. 通过读取文件创建
# 2.1 sc.textFile api
rdd1 = sc.textFile('./data',minPartitions=None)# 参数2:最小分区数,一般不指定,spark有自己的合理划分,
print(rdd1.collect()) # 读取路径下所有文件,每一行认为是一条记录
print(rdd1.getNumPartitions())
# 输出
'''
['hellow world', 'hollow python', 'hollow java']
3
'''
# 2.2 sc.wholeTextFiles api
rdd1 = sc.wholeTextFiles('./data',minPartitions=None)
print(rdd1.collect()) # 读取路径下所用文件,每个元素内容为2元组,k:文件路径,v:对应文件里所有内容
print(rdd1.getNumPartitions()) # 通常用于许多小文件的需求(small files are preferred,as each file will be loaded fully in memory)
# 输出
'''
[('file:/data/jupyter_lab/zyp/pyspark学习/data/wordcount1.txt', 'hellow world\nhollow python\nhollow java')]
1
'''
# 1. map: 对RDD内的每个元素进行map操作
rdd1 = sc.parallelize(range(10),3)
print(rdd1.collect())
print(rdd1.map(lambda x:x+1).collect())
# 输出:
'''
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
'''
# 2. flatMap:对RDD执行map操作后,再进行解除嵌套的操作
rdd1 = sc.parallelize([[1,2],[3,4],[5,6]])
print(rdd1.collect())
print(rdd1.flatMap(lambda x:x).collect())
# 输出:
'''
[[1, 2], [3, 4], [5, 6]]
[1, 2, 3, 4, 5, 6]
'''
# 3 mapValues: 对k-v型RDD中value进行map操作
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
rdd1.mapValues(lambda x:x+1).collect()
# 输出
'''
[('a',2),('a',2),('b',2),('b',2),('b',2)]
'''
# 4. reduceByKey :针对K-V型RDD,自动按照key进行分组,然后根据提供的聚合逻辑,完成组内数据(value)的聚合操作,返回聚合后的K-V值
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
print(rdd1.reduceByKey(lambda a,b:a+b).collect())
# 输出
'''
[('b', 3), ('a', 2)]
'''
# 5. groupBy :将RDD的数据根据指定规则进行分组,返回k-v型RDD(v:可迭代对象)
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
print(rdd1.groupBy(lambda x:x[0]).collect())
# 输出(输出后的value为一个可迭代对象)
'''
[('b', ), ('a', )]
'''
# 6. groupByKey: 针对KV型rdd,自动按照key进行分组(groupBy算子则没有此限定)
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
rdd1.groupByKey().collect()
# 输出
'''
[('b', ),
('a', )]
'''
# 7. filter: 按给定规则对rdd中的数据进行过滤(和python filter高阶函数用法一致)
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
rdd1.filter(lambda x:True if x[0] == 'a' else False).collect()
# 输出
'''
[('a', 1), ('a', 1)]
'''
# 8. distinct:对RDD数据进行去重,返回新的RDD(k-v型数据也可以去重)
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
rdd1.distinct().collect()
# 输出
'''
[('b', 1), ('a', 1)]
'''
# 9. union: 将2个rdd合并成1个rdd
rdd1 = sc.parallelize([('b',1),('b',1),('b',1)])
rdd2 = sc.parallelize([('a',1),('a',1)])
rdd2.union(rdd1).collect()
# 输出
'''
[('a', 1), ('a', 1), ('b', 1), ('b', 1), ('b', 1)]
'''
# 10. intersection: 求2个rdd的交集
rdd1 = sc.parallelize(range(10))
rdd2 = sc.parallelize(range(5))
rdd1.intersection(rdd2).collect()
# 输出
'''
[0, 1, 2, 3, 4]
'''
# 11. join:对2个rdd执行joi操作,型数据k-v型数据(相当于sql的内连接)
rdd1 = sc.parallelize([('name','张三'),('sex','男'),('age',19),('love','足球')])
rdd2 = sc.parallelize([('name','李四'),('sex','女'),('age',12)])
print(rdd1.join(rdd2).collect())
# 输出
'''
[('name', ('张三', '李四')), ('sex', ('男', '女')), ('age', (19, 12))]
'''
# 12. leftOuterJoin:左外连接 ;rightOuterJoin:右外连接
rdd1 = sc.parallelize([('name','张三'),('sex','男'),('age',19),('love','足球')])
rdd2 = sc.parallelize([('name','李四'),('sex','女'),('age',12)])
rdd1.leftOuterJoin(rdd2).collect()
# 输出
'''
[('name', ('张三', '李四')),
('sex', ('男', '女')),
('age', (19, 12)),
('love', ('足球', None))]
'''
# 13. glom: 将rdd的数据,加上嵌套,这个嵌套按照分区来进行
rdd1 = sc.parallelize(range(10),3)
rdd1.glom().collect()
# 输出
'''
[[0, 1, 2], [3, 4, 5], [6, 7, 8, 9]]
'''
# 14 sortBy:对rdd数据按照指定规则进行排序
# 语法:rdd.sortBy(func,ascending=False,numPartitions=1) func: 排序规则定义;ascending:False降序;numPartitions:排序后的分区
rdd1 = sc.parallelize(range(10),3)
print(rdd1.glom().collect())
print(rdd1.sortBy(lambda x:x,ascending=False,numPartitions=2).glom().collect())
# 输出
'''
[[0, 1, 2], [3, 4, 5], [6, 7, 8, 9]]
[[9, 8, 7, 6], [5, 4, 3, 2, 1, 0]]
'''
# 15. sortByKey:针对KV型rdd,按照key进行排序
# 语法:rdd.sortByKey(ascending=True,numPartitions=1,keyfunc) keyfunc:在排序前对key进行处理
rdd1 = sc.parallelize([('a',1),('d',1),('c',2),('b',1),('b',1),('b',1)])
rdd1.sortByKey().collect()
# 输出
'''
[('a', 1), ('b', 1), ('b', 1), ('b', 1), ('c', 2), ('d', 1)]
'''
# 1. collect: 将rdd各个分区的数据,统一收集到driver中,形成一个list对象(注意:数据量会不会把driver内存撑爆)
rdd1 = sc.parallelize([1,2,3,4,5],2)
type(rdd1.glom().collect()) >> list
# 2. count : 统计rdd有多少元素,返回一个数值
rdd1 = sc.parallelize([('a',1),('d',1),('c',2),('b',1),('b',1),('b',2)])
print(rdd1.count()) >> 6
# 3. countByKey: 统计key出现的次数(一般适用于kv型rdd)
rdd1 = sc.parallelize([('a',1),('d',1),('c',2),('b',1),('b',1),('b',2)])
print(rdd1.countByKey)
# 输出
'''
defaultdict(, {'a': 1, 'd': 1, 'c': 1, 'b': 3})
'''
# 4. reduce:对rdd数据集按照规则进行聚集
rdd1 = sc.parallelize([1,2,3,4,5],2)
rdd1.reduce(lambda a,b :a+b) >> 15
# 5. fold: 和reduce一样,对rdd数据集进行聚合,只不过带有初始值(初始值作用在:分区内聚合,分区间聚合)
rdd1 = sc.parallelize([1,2,3,4,5],2)
rdd1.fold(zeroValue=10,op=lambda a,b:a+b) >> =10+(10+1+2)+(10+3+4+5)=45
# 6. first:取出rdd的第一个元素
rdd1 = sc.parallelize([1,2,3,4,5],2)
print(rdd1.first()) >> 1
# 7. take(N): 取出rdd的前N个元素,组成list返回
print(rdd1.take(3)) >> [1,2,3]
# 8. top(N): 对rdd数据集进行降序排序后,取出前n个
print(rdd1.top(3)) >> [5, 4, 3]
# 9. takeSample: 随机抽样RDD的数据
# 语法:takeSample(withReplacement:True/False(是否可以重复抽取),num:抽样数,seed: 随机种子)
rdd1 = sc.parallelize(range(100),5)
rdd1.takeSample(False,10)
# 输出
'''
[85, 17, 40, 80, 12, 63, 70, 96, 43, 33]
'''
# 10. takeOrdered: 对rdd进行排序取前n个(与top类似,但可以指定排序规则)
rdd1 = sc.parallelize((1,3,6,7,3,4,5,8,2,0))
rdd1.takeOrdered(num=3,key=lambda x:x) >> [0,1,2]
# 11. foreach: 对rdd每一个元素,执行相同操作,类似map,但是没有返回值
rdd1 = sc.parallelize(range(5),2)
rdd1.foreach(lambda x: print(x))
# 12. saveAsTextFile: 将rdd数据写入文本文件(支持本地,hdfs等)
rdd1.saveAsTextFile(path='./data/2222') # 路径指定不存在文件夹
## Note: foreach 和 saveTextFile 执行数据结果不返回drive,操作结果直接映射分区所在worker
# 1. mapPartitions: 与map一样,只不过迭代的是一个个整体数据分区
rdd1 = sc.parallelize([1,2,3,4,5,6,7],3)
print(rdd1.collect())
def f(x): yield sorted(x,reverse=True)
print(rdd1.mapPartitions(f).glom().collect())
# 输出
'''
[1, 2, 3, 4, 5, 6, 7]
[[[2, 1]], [[4, 3]], [[7, 6, 5]]]
'''
# 2. foreachPartition:没有返回值的mapPartitions,且执行的数据结果不返回driver
rdd1 = sc.parallelize([1,2,3,4,5,6,7],3)
print(rdd1.collect())
rdd1.foreachPartition(lambda x:print(x))
# 输出
'''
[1, 2, 3, 4, 5, 6, 7]
'''
# 3. partitionBy: 对rdd进行自定义分区(K-V型数据)
rdd1 = sc.parallelize([1,2,3,4,5,6,7],3).map(lambda x:(x,x))
print(rdd1.glom().collect())
rdd1.partitionBy(2,lambda x: 0 if x>3 else 1).glom().collect() # 参数1:重新分区数目;参数2:每个元素分区编号
# 输出
'''
[[(1, 1), (2, 2)], [(3, 3), (4, 4)], [(5, 5), (6, 6), (7, 7)]]
[[(4, 4), (5, 5), (6, 6), (7, 7)], [(1, 1), (2, 2), (3, 3)]]
'''
# 4. repartition: 仅在数量上对分区进行重新分区(为避免shuffle增加,尽量分区少,一般不调整)
rdd1 = sc.parallelize([1,2,3,4,5,6,7],3)
print(rdd1.glom().collect())
print(rdd1.repartition(2).glom().collect())
# 输出
'''
[[1, 2], [3, 4], [5, 6, 7]]
[[1, 2, 5, 6, 7], [3, 4]]
'''
# 5. coalesce: 对分区进行数量增减
# rdd1.coalesce(numPartitions:重新分区数,shuffle:True/False 是否允许增加分区)
rdd1 = sc.parallelize([1,2,3,4,5,6,7],3)
print(rdd1.glom().collect())
print(rdd1.coalesce(2).glom().collect()) # repartition == coalesce(n,shuffle=True)
# 输出
'''
[[1, 2], [3, 4], [5, 6, 7]]
[[1, 2], [3, 4, 5, 6, 7]]
'''
# cache()/persist() : 缓存算子
##1. 未使用缓存前
rdd1 = sc.parallelize([1])
# 声明一个累加器变量,记录中间rdd的执次数
value = sc.accumulator(0)
def f(x):
global value # 子任务执行到此时,会像driver,copy一份value
value += 1
return x
rdd2 = rdd1.map(f)
rdd2.count()
rdd3 = rdd2.map(lambda x:x+1)
rdd3.collect()
print(f'未使用缓存,计算rdd3时,rdd2执行次数 value: {value}')
# 输出
'''
为使用缓存,计算rdd3时,rdd2执行次数 value: 2
'''
##2. 使用缓存
rdd1 = sc.parallelize([1])
# 声明一个累加器变量,记录中间rdd的执次数
value = sc.accumulator(0)
def f(x):
global value # 子任务执行到此时,会像driver,copy一份value
value += 1
return x
rdd2 = rdd1.map(f)
rdd2.cache()
rdd2.count() # cache不是acion算子,这里count算子的作用是触发缓存执行
rdd3 = rdd2.map(lambda x:x+1)
rdd3.collect()
print(f'对rdd2进行缓存后,计算rdd3时,rdd2执行次数 value: {value}')
#输出
'''
对rdd2进行缓存后,计算rdd3时,rdd2执行次数 value: 1
'''
rdd2.unpersist() # 清理缓存
广播变量(用于本地list对象,与rdd对象进行交互场景)
Executor是一个进程,进程内资源共享,将本地list对象包装成广播变量,spark只会给每一个executor一份本地list数据,每一个executor内的多个task线程共享一份list数据,而不是原来那样每个task
执行时向driver单独申请一份list,节省内存;
累加器 (在rdd分布式计算中,声明一个全局变量)
acmlt = sc.accumulator(init_value) # 定义一个累加器变量,每一个executor对其进行的操作共享
##1. 未使用累加器
rdd1 = sc.parallelize(range(5),5)
init_value = 0
# 对init_value 进行累加操作
def f(x):
global init_value
init_value += 1
return x
rdd2 = rdd1.map(f)
print(rdd2.collect())
print(f'未使用累加器前 init_value :{init_value}') # 每一个executor中,对init_value的操作 并没有传递给driver中的init_value
# 输出
'''
[0, 1, 2, 3, 4]
未使用累加器前 init_value :0
'''
##2. 使用累加器
rdd1 = sc.parallelize(range(5),5)
init_value = 0
# 声明累加器
init_value = sc.accumulator(init_value)
# 对init_value 进行累加操作
def f(x):
global init_value
init_value += 1
return x
rdd2 = rdd1.map(f)
print(rdd2.collect())
print(f'使用累加器后 init_value :{init_value}') # 每一个executor中,对init_value的操作,共享,
# 输出
'''
[0, 1, 2, 3, 4]
使用累加器后 init_value :5
'''
##1. 不使用广播变量
local_list = dict([(1,'小明'),(2,'小红')])
rdd1 = sc.parallelize([(1,98),(2,99)],2)
# 名称替换
def f(x):
name = ''
if x[0] in local_list:
name = local_list.get(x[0])
return name,x[1]
rdd2 = rdd1.map(f)
print(rdd2.collect()) # 不用广播变量,程序也能执行,只不过每个task都得申请一份local_list
# 输出
'''
[('小明', 98), ('小红', 99)]
'''
##2. 使用广播变量
local_list = dict([(1,'小明'),(2,'小红')])
# 声明广播变量
local_broadcast = sc.broadcast(local_list)
rdd1 = sc.parallelize([(1,98),(2,99)],2)
# 名称替换
def f(x):
name = ''
# 使用广播变量
if x[0] in local_broadcast.value:
name = local_broadcast.value.get(x[0])
return name,x[1]
rdd2 = rdd1.map(f)
print(rdd2.collect()) # 使用广播变量,每一个Executor中的多个task线程共享一份local_list
# 输出
'''
[('小明', 98), ('小红', 99)]
'''
# 1. 导包
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StringType,IntegerType,FloatType,ArrayType
import pyspark.sql.functions as F # DataFrame 函数包 (F包中函数输入column对象,返回一个column对象)
import pandas as pd
import numpy as np
# 2. 添加 java 环境(使用python类库pyspark)
import os
os.environ['JAVA_HOME'] = '/data/app/jdk1.8.0_333/'
# 3.构建SparkSession对象
spark = SparkSession.builder.appName('test').getOrCreate()
## DataFrame 构建
# 1. 基于RDD进行构建
# 1.1 使用 spark.createDataFrame(rdd,schema=)创建
rdd = spark.sparkContext.textFile('./data/students_score.txt')
rdd = rdd.map(lambda x:x.split(',')).map(lambda x:[int(x[0]),x[1],int(x[2])])
print(rdd.collect())
'''[[11, '张三', 87], [22, '李四', 67], [33, '王五', 79]]'''
# 方式1:schema 只指定列名,类型靠推断,是否允许为空默认是True
df = spark.createDataFrame(data=rdd,schema=['id','name','score'])
df.show() # 默认展示前20行数据
df.printSchema() # 查看表结构
'''
+---+----+-----+
| id|name|score|
+---+----+-----+
| 11|张三| 87|
| 22|李四| 67|
| 33|王五| 79|
+---+----+-----+
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- score: long (nullable = true)
'''
# 方式2:schema 指定为 StructType表结构对象
schema = StructType()\
.add(field='id',data_type=IntegerType(),nullable=True)\
.add(field='name',data_type=StringType(),nullable=True)\
.add(field='score',data_type=IntegerType(),nullable=False)
df = spark.createDataFrame(data=rdd,schema=schema)
df.show() # 默认展示前20行数据
df.printSchema() # 查看表结构
'''
+---+----+-----+
| id|name|score|
+---+----+-----+
| 11|张三| 87|
| 22|李四| 67|
| 33|王五| 79|
+---+----+-----+
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- score: integer (nullable = false)
'''
# 1.2 rdd.toDF() 创建
rdd = spark.sparkContext.textFile('./data/students_score.txt')
rdd = rdd.map(lambda x:x.split(',')).map(lambda x:[int(x[0]),x[1],int(x[2])])
print(rdd.collect())
df = rdd.toDF(schema=['id','name','score']) # schema 同样可以只填列名list或structType对象
df.show() # 默认展示前20行数据
df.printSchema() # 查看表结构
'''
[[11, '张三', 87], [22, '李四', 67], [33, '王五', 79]]
+---+----+-----+
| id|name|score|
+---+----+-----+
| 11|张三| 87|
| 22|李四| 67|
| 33|王五| 79|
+---+----+-----+
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- score: long (nullable = true)
'''
# 2. 基于pandas df进行构建:将pandas的dataFrame对象转变为分布式的dataset
pd_data = pd.DataFrame({'id':[1,2,3],'name':['张三','李四','王五'],
'score':[65,35,89]})
df = spark.createDataFrame(pd_data)
df.printSchema()
df.show()
'''
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- score: long (nullable = true)
+---+----+-----+
| id|name|score|
+---+----+-----+
| 1|张三| 65|
| 2|李四| 35|
| 3|王五| 89|
+---+----+-----+
'''
# 3. 基于数据文件读取进行构建
# 方式1: 使用统一API进行数据读取
# 用法:
''' sparksession.read.format('text|csv|json|parquet|orc|jdbc|...')\
.option('k','v')\ # 读取时的参数选项,如scv中的seq; jdbc中的数据库连接参数
.schema(string|StructType对象)\ #string写法:"id INT,name STRING,score INT"
.load(localpath|hdfs)|.csv()'''
# 方式2:直接指定文件类型读取,sparksession.read.csv(path)
# 3.1 读取text文件,会把整个文件当成一列,默认列名称为value, 使用schema修改列名
df = spark.read.format('text')\
.schema("data_value STRING")\
.load('./test.txt')
print(f'读取txt文件方式1: ')
df.show()
df = spark.read.schema("data_value STRING").text('./test.txt',wholetext=False)
print(f'读取txt文件方式2: ')
df.show()
'''
读取txt文件方式1:
+-------------+
| data_value|
+-------------+
| hellow world|
|hellow python|
| hellow java|
+-------------+
读取txt文件方式2:
+-------------+
| data_value|
+-------------+
| hellow world|
|hellow python|
| hellow java|
+-------------+
'''
# 3.2 读取json文件,本身带有字段信息,可以不用写schema
df = spark.read.format('json')\
.load('./data/test_data/test_data/sql/people.json')
print(f'读取json文件方式1: ')
df.show()
df = spark.read.json('./data/test_data/test_data/sql/people.json')
print(f'读取json文件方式2: ')
df.show()
'''
读取json文件方式1:
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
读取json文件方式2:
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
'''
# 3.3 读取csv文件,表格数据,需指定分隔符,表头等参数
option_dict = {'sep':';','header':True,'encoding':'utf-8'}
df = spark.read.format('csv')\
.options(**option_dict)\
.load('./data/people.csv')
print(f'读取csv文件方式1: ')
df.show()
df = spark.read.csv('./data//people.csv',sep=';',header=True,encoding='utf-8')
print(f'读取csv文件方式2: ')
df.show()
'''
读取csv文件方式1:
+-----+----+---------+
| name| age| job|
+-----+----+---------+
|Jorge| 30|Developer|
| Bob| 32|Developer|
| Ani| 11|Developer|
+-----+----+---------+
读取csv文件方式2:
+-----+----+---------+
| name| age| job|
+-----+----+---------+
|Jorge| 30|Developer|
| Bob| 32|Developer|
| Ani| 11|Developer|
+-----+----+---------+
'''
# 3.4 读取sql数据表格格
df.createTempView('tt') # 创建临时表
df = spark.read.table(tableName='tt')
spark.catalog.dropTempView('tt') # 清理临时表
print(f'读取sql数据表: ')
df.show()
'''
读取sql数据表:
+-----+----+---------+
| name| age| job|
+-----+----+---------+
|Jorge| 30|Developer|
| Bob| 32|Developer|
| Ani| 11|Developer|
+-----+----+---------+
'''
# 3.5 读取parquet数据:列式存储,内置schema,序列化存储体积小
df = spark.read.format('parquet')\
.load('./data/users.parquet')
print(f'读取parquet文件方式1: ')
df.show()
df = spark.read.parquet('./data/users.parquet')
print(f'读取parquet文件方式2: ')
df.show()
'''
读取parquet文件方式1:
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]|
| Ben| red| []|
+------+--------------+----------------+
读取parquet文件方式2:
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]|
| Ben| red| []|
+------+--------------+----------------+
'''
## DataFrame数据处理代码风格
pd_data = pd.DataFrame({'id':[1,2,3],'name':['张三','李四','王五'],'score':[65,35,89]})
df = spark.createDataFrame(pd_data)
# 1. DSL: dataset language 就是dataframe 特有API
# 1.1 df.show(): 打印dataframe 参数 n:显示行数,默认20;
# truncate:字段字符长度是否截断,默认输出20个字符
df.show(n=20,truncate=True)
'''
+---+----+-----+
| id|name|score|
+---+----+-----+
| 1|张三| 65|
| 2|李四| 35|
| 3|王五| 89|
+---+----+-----+
'''
# 1.2 df.printSchema(): 打印输出df 的表结构信息
df.printSchema()
'''
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- score: long (nullable = true)
'''
# 1.3 df.select(): 选择 df 中指定的列. 参数可以是 column对象、str、list[str]、list[column对象]
df.select('name').show()
df.select(df['name']).show() # df['name'] 返回 Column对象
'''
+----+
|name|
+----+
|张三|
|李四|
|王五|
+----+
+----+
|name|
+----+
|张三|
|李四|
|王五|
+----+
'''
# 1.4 df.filter()|df.where() :按照过滤df中的数据,返回一个新df;类似pandas query()
df.filter('score > 60').show()
df.filter(df['score']>60).show()
df.where('score > 60').show()
df.where(df['score']>60).show()
'''
+---+----+-----+
| id|name|score|
+---+----+-----+
| 1|张三| 65|
| 3|王五| 89|
+---+----+-----+
'''
# 1.5 df.groupBy() 分组,返回GroupedData对象
pd_data = pd.DataFrame({'id':[1,2,3,4,5,6],
'name':['张三','李四','王五','张三','李四','王五'],
'score':[65,35,89,34,67,97]})
df = spark.createDataFrame(pd_data)
df.groupBy('name').sum().show()
'''
+----+-------+----------+
|name|sum(id)|sum(score)|
+----+-------+----------+
|张三| 5| 99|
|李四| 7| 102|
|王五| 9| 186|
+----+-------+----------+
'''
# 1.6 df.first() : 取出df第一行,返回Row对象
df = spark.read.schema('word STRING').text('./data/test_data/test_data/words.txt')
df.show()
print(df.first()['word']) # Row 对象没有show函数
'''
+------------+
| word|
+------------+
| hello spark|
|hello hadoop|
| hello flink|
+------------+
hello spark
'''
# 1.7 df.limit() : 返回df指定行数据,同sql limit
df.limit(2).show()
'''
+------------+
| word|
+------------+
| hello spark|
|hello hadoop|
| hello flink|
+------------+
'''
# 1.8 F.split() :字符串切分函数
df.select(F.split(df['word'],' ')).show()
'''
+------------------+
|split(word, , -1)|
+------------------+
| [hello, spark]|
| [hello, hadoop]|
| [hello, flink]|
+------------------+
'''
# 1.9 F.explode() : 类似pandas的explode,字符串列表纵向扩展
df.select(F.explode(F.split(df['word'],' '))).show()
'''
+------+
| col|
+------+
| hello|
| spark|
| hello|
|hadoop|
| hello|
| flink|
+------+
'''
# 1.10 df.withColumn(): 对老的列进行操作,返回新列,新列名重复,就发生替换,不一致就扩展一个新列
df1 = df.withColumn(colName='word',col=F.explode(F.split(df['word'],' ')))
df1.groupBy(df1['word']).count().show()
'''
+------+-----+
| word|count|
+------+-----+
| hello| 3|
| spark| 1|
| flink| 1|
|hadoop| 1|
+------+-----+
'''
# 1.11 df.withColumnRenamed() : 修改列名
df1.groupBy(df1['word']).count().withColumnRenamed('count','cnt').show()
'''
+------+---+
| word|cnt|
+------+---+
| hello| 3|
| spark| 1|
| flink| 1|
|hadoop| 1|
+------+---+
'''
# 1.12 df.orderBy(): 排序
df1.groupBy(df1['word']).count().orderBy('count').show()
'''
+------+-----+
| word|count|
+------+-----+
| spark| 1|
|hadoop| 1|
| flink| 1|
| hello| 3|
+------+-----+
'''
# 1.13 F.min、F.max、F.round,F.avg,column.alias:给列对象起别名,相当于 sql 中 as
df.groupBy('name').agg(F.min('score').alias('min_'),
F.max('score').alias('max_'),
F.round(F.avg('score')).alias('round_avg')).show()
'''
+----+------+----+---------+
|name| min_|max_|round_avg|
+----+------+----+---------+
|张三| 34.0|65.4| 50.0|
|李四| 35.2|67.0| 51.0|
|王五|89.034|97.0| 93.0|
+----+------+----+---------+
'''
# 2. SQL: 使用sql处理dataFrame 数据
df.createTempView('tt')
spark.sql('select name,sum(score) from tt group by name').show()
spark.catalog.dropTempView('tt')
'''
+----+----------+
|name|sum(score)|
+----+----------+
|张三| 99|
|李四| 102|
|王五| 186|
+----+----------+
'''
# 1. df.dropDuplicate() :数据去重,无参数按整理去重;也可指定列去重
pd_data = pd.DataFrame({'name':['张三','李四','王五','张三','李四','王五']
,'score':[65,35,89,65,67,97]})
df = spark.createDataFrame(pd_data)
df.show()
df.dropDuplicates().show()
df.dropDuplicates(['name']).show()
'''
+----+-----+
|name|score|
+----+-----+
|张三| 65|
|李四| 35|
|王五| 89|
|张三| 65|
|李四| 67|
|王五| 97|
+----+-----+
+----+-----+
|name|score|
+----+-----+
|张三| 65|
|李四| 35|
|王五| 89|
|李四| 67|
|王五| 97|
+----+-----+
+----+-----+
|name|score|
+----+-----+
|张三| 65|
|李四| 35|
|王五| 89|
+----+-----+
'''
#2. df.dropna(): pandas 基本一致
import numpy as np
# df.dropna() : 缺失行删除,默认how='any' 只要本行一列为空就删除;若how='all',本行全部为空才会删除
pd_data = pd.DataFrame({'name':['张三','李四','王五','张三',None,None],'score':[65,35,np.nan,65,67,97]})
df = spark.createDataFrame(pd_data)
df.show()
print("dropna(how='any'):")
df.dropna().show()
print("dropna(how='all'):")
df.dropna(how='all').show()
# thresh=n参数:指定有效列数,至少n列不为空,才不会删除行,此时how参数不起作用;
# subset参数:指定参与空值删除的列
print("dropna(thresh=1,subset=['name']):")
df.dropna(thresh=1,subset=['name']).show()
'''
+----+-----+
|name|score|
+----+-----+
|张三| 65.0|
|李四| 35.0|
|王五| NaN|
|张三| 65.0|
|null| 67.0|
|null| 97.0|
+----+-----+
dropna(how='any'):
+----+-----+
|name|score|
+----+-----+
|张三| 65.0|
|李四| 35.0|
|张三| 65.0|
+----+-----+
dropna(how='all'):
+----+-----+
|name|score|
+----+-----+
|张三| 65.0|
|李四| 35.0|
|王五| NaN|
|张三| 65.0|
|null| 67.0|
|null| 97.0|
+----+-----+
dropna(thresh=1,subset=['name']):
+----+-----+
|name|score|
+----+-----+
|张三| 65.0|
|李四| 35.0|
|王五| NaN|
|张三| 65.0|
+----+-----+
'''
# 3.df.fillna(): pandas 基本一致
import numpy as np
pd_data = pd.DataFrame({'':range(6),'name':['张三','李四','王五','张三',None,None],'score':[65,35,np.nan,65,67,97]})
df = spark.createDataFrame(pd_data)
df.show()
# df.fillna(value='待填充的值',subset=[指定要进行填充操作列]) : 填充缺失值
print("所有空值填充")
df.fillna(value='loss').show()
df.fillna(value='loss').printSchema() # 数值填充字符串,并没有填充上
print("指定列,分列分别填充:")
df.fillna(value={'name':'无名','score':0},subset=['name','score']).show()
# DataFrame 注册成表
df.createTempView('tt') # 注册临时视图(表)
df.createOrReplaceTempView('tt') # 注册临时视图(表),如果存在进行替换
df.createGlobalTempView('tt') # 注册一个全局表,在一个程序内的多个sparkSession均可调用此表,查询时带上前缀:global_temp
spark.catalog.dropTempView('tt') # 直接删除表 或 spark.stop()后自动删除
# 方式1: 统一API语法:
# df.write.mode().format().option(K,V).save()
# mode: 传入模式字符串,append 追加;overwrite 覆盖;ignore 重复数据忽略;error 重复就报错(默认的);
# format: 传入格式字符串,可选: text,csv,json,parquet(默认),orc,avro,jdbc # 注意:text 只支持单列写入
# option: 设置保存属性
# save: 写出路径,支持本地路径和HDFS
# 方式2: 直接制定文件保存格式 如:df.write.scv()
pd_data = pd.DataFrame({'id':[1,2,3,4,5,6],
'name':['张三','李四','王五','张三','李四','王五'],
'score':[65.4,35.2,89.034,34,67,97]})
df = spark.createDataFrame(pd_data)
df.show()
# 注意文件保存路径为文件夹所在路径
# 1. 写入csv文件
df.write.csv(path='/data/write_csv',
mode='overwrite',sep=',',header=True,encoding='utf-8')
# 2. 写入text文件,只能写入column对象
df.select(F.concat_ws(',',df['id'],df['name'],df['score']))\
.write.mode('overwrite').text('/data/write_text')
# 3. json写出
df.write.json(path='/data/pyspark学习/data/write_json'
,mode='overwrite',encoding='utf-8')
# 4. parquet写出(sparksql 默认保存方式,列式存储,有助于sparksql优化,列值裁剪操作)
df.write.mode('overwrite').save('/data/write_parquet')
# 5. 读取和写入mysql
# 5.1 将mysql驱动放到pyspark/jars下
options = {'user':'xxxx','password':'xxx'}
df.write.options(**options)\
.jdbc(url='jdbc:mysql://host_ip/database?useSSL=false&useUnicode=true'
,table='test_stu',mode='overwrite')
# 5.2 读取mysql表
spark.read.options(**options)\
.jdbc(url='jdbc:mysql://host_ip/databaseuseSSL=false&useUnicode=true'
,table='test_stu').show()
定义方式1:
sparksession.udf.register()
注册的udf 可以用于DSL风格和sql风格;返回值用于DSL风格,参数内的name参数值用于SQL风格
语法:
udf = sparksession.udf.register(name,f,returnType)
参数:
name: UDF名称,可用于SQL风格
f: 需要定义的函数名
returnType: 声明UDF的返回值类型
udf:返回的udf对象,可用于DSL风格处理数据
定义方式2:
pyspark.sql.functions.udf
仅能用于DSL风格
语法:
udf = F.udf(f,returnType)
参数:
f: 需要定义的函数名
returnType: 声明UDF的返回值类型
udf:返回的udf对象,可用于DSL风格处理数据
# 1. 方式1: 声明注册UDF 函数(返回FloatType)
pd_data = pd.DataFrame({'id':[1,2,3,4,5,6],'name':['张三','李四','王五','张三','李四','王五'],'score':[65.4,35.2,89.034,34,67,97]})
df = spark.createDataFrame(pd_data)
df.show()
def num_r_10(x):
return x*10
num_r_10_udf = spark.udf.register(name='sql_num_r_10',f=num_r_10,returnType=FloatType())
# DSL风格使用
df.select(num_r_10_udf(df['score'])).show()
# SQL 风格使用
df.createTempView('tt')
spark.sql('select id,name,sql_num_r_10(score) from tt').show()
spark.catalog.dropTempView('tt')
'''
+---+----+------+
| id|name| score|
+---+----+------+
| 1|张三| 65.4|
| 2|李四| 35.2|
| 3|王五|89.034|
| 4|张三| 34.0|
| 5|李四| 67.0|
| 6|王五| 97.0|
+---+----+------+
+-------------------+
|sql_num_r_10(score)|
+-------------------+
| 654.0|
| 352.0|
| 890.34|
| 340.0|
| 670.0|
| 970.0|
+-------------------+
+---+----+-------------------+
| id|name|sql_num_r_10(score)|
+---+----+-------------------+
| 1|张三| 654.0|
| 2|李四| 352.0|
| 3|王五| 890.34|
| 4|张三| 340.0|
| 5|李四| 670.0|
| 6|王五| 970.0|
+---+----+-------------------+
'''
#2. 方式2 声明UDF函数,(返回array数据类型)
rdd1 = spark.sparkContext.parallelize([['hellow word'],['hellow python'],['hellow java']])
df = spark.createDataFrame(rdd1,schema='value STRING')
df.show()
def str_split_cnt(x):
return [(i,'1') for i in x.split(' ')]
obj_udf = F.udf(f=str_split_cnt,returnType=ArrayType(elementType=ArrayType(StringType())))
df.select(obj_udf(df['value'])).show(truncate=False)
'''
+-------------+
| value|
+-------------+
| hellow word|
|hellow python|
| hellow java|
+-------------+
+--------------------------+
|str_split_cnt(value) |
+--------------------------+
|[[hellow,1],[word,1]] |
|[[hellow,1],[python,1]] |
|[[hellow,1],[java,1]] |
+--------------------------+
'''
#3. 方式2 声明UDF函数,(返回dict数据类型)
rdd1 = spark.sparkContext.parallelize([['hellow word']
,['hellow python hellow']
,['hellow java']])
df = spark.createDataFrame(rdd1,schema='value STRING')
df.show()
def str_split_cnt(x):
return {'name':'word_cnt','cnt_num':len(x.split(' '))}
obj_udf = F.udf(f=str_split_cnt,returnType=StructType()
.add(field='name',data_type=StringType(),nullable=True)
.add(field='cnt_num',data_type=IntegerType(),nullable=True)
)
df.select(obj_udf(df['value']).alias('value')).show(truncate=False)
'''
+--------------------+
| value|
+--------------------+
| hellow word|
|hellow python hellow|
| hellow java|
+--------------------+
+-------------+
|value |
+-------------+
|{word_cnt, 2}|
|{word_cnt, 3}|
|{word_cnt, 2}|
+-------------+
'''
用途:
和普通sql一样,同一行既要显示聚合前的数据,又要显示聚合后的数据,即在每一行的最后一列添加上聚合函数的结果值。(开窗意思就是为行开辟一个窗口,去观看聚合后的结果)
开窗函数分类:
1. 聚合开窗函数:
聚合函数(field_name) over(partition by field_name)
2. 排序开窗函数:
排序函数() over([partition by field_name1] order by field_name2 [desc])
3. 切片开窗函数:
ntile(n) over(partition by field_name1 order by field_name2 [desc])
# 开窗函数
pd_data = pd.DataFrame({'id':[1,2,3,4,5,6],'name':['张三','李四','王五','张三','李四','王五'],'score':[65.4,35.2,89.034,34,67,97]})
df = spark.createDataFrame(pd_data)
df.show()
df.createOrReplaceTempView('tt')
# 聚合开窗函数
spark.sql('select id,name,score,avg(score) over(partition by name)as avg_score from tt').show()
# 排序开窗函数
spark.sql('select id,name,score,row_number() over(partition by name order by score) as rank_score from tt').show()
# 分组开窗函数
spark.sql('select id,name,score,ntile(3) over(order by score desc)as ntile_score from tt').show()
'''
+---+----+------+
| id|name| score|
+---+----+------+
| 1|张三| 65.4|
| 2|李四| 35.2|
| 3|王五|89.034|
| 4|张三| 34.0|
| 5|李四| 67.0|
| 6|王五| 97.0|
+---+----+------+
+---+----+------+---------+
| id|name| score|avg_score|
+---+----+------+---------+
| 1|张三| 65.4| 49.7|
| 4|张三| 34.0| 49.7|
| 2|李四| 35.2| 51.1|
| 5|李四| 67.0| 51.1|
| 3|王五|89.034| 93.017|
| 6|王五| 97.0| 93.017|
+---+----+------+---------+
+---+----+------+----------+
| id|name| score|rank_score|
+---+----+------+----------+
| 4|张三| 34.0| 1|
| 1|张三| 65.4| 2|
| 2|李四| 35.2| 1|
| 5|李四| 67.0| 2|
| 3|王五|89.034| 1|
| 6|王五| 97.0| 2|
+---+----+------+----------+
+---+----+------+-----------+
| id|name| score|ntile_score|
+---+----+------+-----------+
| 6|王五| 97.0| 1|
| 3|王五|89.034| 1|
| 5|李四| 67.0| 2|
| 1|张三| 65.4| 2|
| 2|李四| 35.2| 3|
| 4|张三| 34.0| 3|
+---+----+------+-----------+
'''
合理调整SparkSQL Shuffle 分区数目¶
sparksql中当job中产生Shuffle时,默认分区数(spark.sql.shuffle.partitions=200),实际中要合理设置
sparksql 执行流程
SparkSQL 执行流程自动优化:RDD数据类型不固定,sparksql的dataframe数据是固定的二维表数据结构,可以被针对优化
catalyst优化器:生成rdd执行计划之前,对sql逻辑进行优化
附录·:SparkSQL DataFrame对象官网所有属性和方法介绍
属性值 | 官网注释 | 备注 |
---|---|---|
columns | Returns all column names as a list. | 返回df所有列名称 |
dtypes | Returns all column names and their data types as a list. | 返回df所有列名称和字段类型 |
isStreaming | Returns True if this DataFrame contains one or more sources that continuously return data as it arrives. | 判断数据源是否是流式数据 |
na | Returns a DataFrameNaFunctions for handling missing values. | |
rdd | Returns the content as an pyspark.RDD of Row. | |
schema | Returns the schema of this DataFrame as a pyspark.sql.types.StructType. | 返回dataframe整张表数据结构类型 |
sparkSession | Returns Spark session that created this DataFrame. | |
sql_ctx | ||
stat | Returns a DataFrameStatFunctions for statistic functions. | |
storageLevel | Get the DataFrame’s current storage level. | |
write | Interface for saving the content of the non-streaming DataFrame out into external storage. | |
writeStream | Interface for saving the content of the streaming DataFrame out into external storage. | |
方法 | 官网注释 | 备注 |
agg(*exprs) | Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()). | 在没有grouby的情况下聚合整个 DataFrame |
alias(alias) | Returns a new DataFrame with an alias set. | 给df所有列起别名 |
approxQuantile(col, probabilities, relativeError) | Calculates the approximate quantiles of numerical columns of a DataFrame. | 计算 DataFrame 的数值列的近似分位数。 |
cache() | Persists the DataFrame with the default storage level (MEMORY_AND_DISK). | 对df进行缓存(默认缓存级别:MEMORY_AND_DISK) |
checkpoint([eager]) | Returns a checkpointed version of this DataFrame. | |
coalesce(numPartitions) | Returns a new DataFrame that has exactly numPartitions partitions. | 对df进行分区修改 |
colRegex(colName) | Selects column based on the column name specified as a regex and returns it as Column. | 选择符合正则表达式的列 |
collect() | Returns all the records as a list of Row. | 将所有记录作为 Row 列表返回。 |
corr(col1, col2[, method]) | Calculates the correlation of two columns of a DataFrame as a double value. | 计算两列相关性 |
count() | Returns the number of rows in this DataFrame. | 返回此 DataFrame 中的行数。 |
cov(col1, col2) | Calculate the sample covariance for the given columns, specified by their names, as a double value. | 计算协方差 |
createGlobalTempView(name) | Creates a global temporary view with this DataFrame. | 使用此 DataFrame 创建一个全局临时视图。 |
createOrReplaceGlobalTempView(name) | Creates or replaces a global temporary view using the given name. | 使用给定名称创建或替换全局临时视图。 |
createOrReplaceTempView(name) | Creates or replaces a local temporary view with this DataFrame. | 使用此 DataFrame 创建或替换本地临时视图。 |
createTempView(name) | Creates a local temporary view with this DataFrame. | 使用此 DataFrame 创建一个本地临时视图。 |
crossJoin(other) | Returns the cartesian product with another DataFrame. | 返回带有另一个 DataFrame 的笛卡尔积。 |
crosstab(col1, col2) | Computes a pair-wise frequency table of the given columns. | 交叉表 |
cube(*cols) | Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregations on them. | 透视表 |
describe(*cols) | Computes basic statistics for numeric and string columns. | 显示字符串和数值列的基本信息 |
distinct() | Returns a new DataFrame containing the distinct rows in this DataFrame. | 去重 |
drop(*cols) | Returns a new DataFrame that drops the specified column. | 删除列 |
dropDuplicates([subset]) | Return a new DataFrame with duplicate rows removed, optionally only considering certain columns. | 返回删除重复行的新 DataFrame,可选择仅考虑某些列。 |
drop_duplicates([subset]) | drop_duplicates() is an alias for dropDuplicates(). | |
dropna([how, thresh, subset]) | Returns a new DataFrame omitting rows with null values. | 去空值 |
exceptAll(other) | Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates. | |
explain([extended, mode]) | Prints the (logical and physical) plans to the console for debugging purpose. | |
fillna(value[, subset]) | Replace null values, alias for na.fill(). | 空值填充 |
filter(condition) | Filters rows using the given condition. | 条件过滤 |
first() | Returns the first row as a Row. | 获取第一行 |
foreach(f) | Applies the f function to all Row of this DataFrame. | 将 f 函数应用于此 DataFrame 的所有行。 |
foreachPartition(f) | Applies the f function to each partition of this DataFrame. | 将 f 函数应用于此 DataFrame 的每个分区 |
freqItems(cols[, support]) | Finding frequent items for columns, possibly with false positives. | |
groupBy(*cols) | Groups the DataFrame using the specified columns, so we can run aggregation on them. | |
groupby(*cols) | groupby() is an alias for groupBy(). | 分组 |
head([n]) | Returns the first n rows. | 返回前n行 |
hint(name, *parameters) | Specifies some hint on the current DataFrame. | 指定当前 DataFrame 的一些提示。 |
inputFiles() | Returns a best-effort snapshot of the files that compose this DataFrame. | 快照 |
intersect(other) | Return a new DataFrame containing rows only in both this DataFrame and another DataFrame. | 求交集 |
intersectAll(other) | Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates. | |
isEmpty() | Returns True if this DataFrame is empty. | 判断是否为空 |
isLocal() | Returns True if the collect() and take() methods can be run locally (without any Spark executors). | 判断driver是否可以容纳collect() |
join(other[, on, how]) | Joins with another DataFrame, using the given join expression. | 关联表 |
limit(num) | Limits the result count to the number specified. | 将结果计数限制为指定的数量。 |
localCheckpoint([eager]) | Returns a locally checkpointed version of this DataFrame. | |
mapInArrow(func, schema) | Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as a DataFrame. | |
mapInPandas(func, schema) | Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame. | |
observe(observation, *exprs) | Observe (named) metrics through an Observation instance. | |
orderBy(*cols, **kwargs) | Returns a new DataFrame sorted by the specified column(s). | 排序 |
pandas_api([index_col]) | Converts the existing DataFrame into a pandas-on-Spark DataFrame. | 将现有 DataFrame 转换为 pandas-on-Spark DataFrame。 |
persist([storageLevel]) | Sets the storage level to persist the contents of the DataFrame across operations after the first time it is computed. | 设置存储级别以在第一次计算后跨操作保留 DataFrame 的内容 |
printSchema() | Prints out the schema in the tree format. | 以树格式打印出表格架构。 |
randomSplit(weights[, seed]) | Randomly splits this DataFrame with the provided weights. | 使用提供的权重随机拆分此 DataFrame。 |
registerTempTable(name) | Registers this DataFrame as a temporary table using the given name. | 使用给定名称将此 DataFrame 注册为临时表。 |
repartition(numPartitions, *cols) | Returns a new DataFrame partitioned by the given partitioning expressions. | 返回由给定分区表达式分区的新 DataFrame。 |
repartitionByRange(numPartitions, *cols) | Returns a new DataFrame partitioned by the given partitioning expressions. | 返回由给定分区表达式分区的新 DataFrame。 |
replace(to_replace[, value, subset]) | Returns a new DataFrame replacing a value with another value. | 替换操作,和pandas一样 |
rollup(*cols) | Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. | |
sameSemantics(other) | Returns True when the logical query plans inside both DataFrames are equal and therefore return same results. | |
sample([withReplacement, fraction, seed]) | Returns a sampled subset of this DataFrame. | 返回此 DataFrame 的采样子集。 |
sampleBy(col, fractions[, seed]) | Returns a stratified sample without replacement based on the fraction given on each stratum. | 根据条件抽样 |
select(*cols) | Projects a set of expressions and returns a new DataFrame. | 按列名进行列选择 |
selectExpr(*expr) | Projects a set of SQL expressions and returns a new DataFrame. | 根据sql表达式选择部分数据 |
semanticHash() | Returns a hash code of the logical query plan against this DataFrame. | |
show([n, truncate, vertical]) | Prints the first n rows to the console. | 将前 n 行打印到控制台。 |
sort(*cols, **kwargs) | Returns a new DataFrame sorted by the specified column(s). | 排序 |
sortWithinPartitions(*cols, **kwargs) | Returns a new DataFrame with each partition sorted by the specified column(s). | 分区内排序 |
subtract(other) | Return a new DataFrame containing rows in this DataFrame but not in another DataFrame. | 求差集 |
summary(*statistics) | Computes specified statistics for numeric and string columns. | 计算数字和字符串列的指定统计信息。 |
tail(num) | Returns the last num rows as a list of Row. | 将最后 num 行作为 Row 列表返回。 |
take(num) | Returns the first num rows as a list of Row. | 将前 num 行作为 Row 的列表返回。 |
toDF(*cols) | Returns a new DataFrame that with new specified column names | 返回具有新指定列名的新 DataFrame |
toJSON([use_unicode]) | Converts a DataFrame into a RDD of string. | 将 DataFrame 转换为字符串类型RDD |
toLocalIterator([prefetchPartitions]) | Returns an iterator that contains all of the rows in this DataFrame. | 返回包含此 DataFrame 中所有行的迭代器。 |
toPandas() | Returns the contents of this DataFrame as Pandas pandas.DataFrame. | 将此 DataFrame 的内容作为 Pandas pandas.DataFrame 返回。 |
to_koalas([index_col]) | ||
to_pandas_on_spark([index_col]) | ||
transform(func, *args, **kwargs) | Returns a new DataFrame. | |
union(other) | Return a new DataFrame containing union of rows in this and another DataFrame. | 两个df合并(去重?) |
unionAll(other) | Return a new DataFrame containing union of rows in this and another DataFrame. | 两个df合并(不去重) |
unionByName(other[, allowMissingColumns]) | Returns a new DataFrame containing union of rows in this and another DataFrame. | |
unpersist([blocking]) | Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk. | 清理缓存 |
where(condition) | where() is an alias for filter(). | 过滤和filter一样 |
withColumn(colName, col) | Returns a new DataFrame by adding a column or replacing the existing column that has the same name. | 添加或替换列(或对某一列进行F操作) |
withColumnRenamed(existing, new) | Returns a new DataFrame by renaming an existing column. | 列名修改 |
withColumns(*colsMap) | Returns a new DataFrame by adding multiple columns or replacing the existing columns that has the same names. | 添加或替换多列 |
withMetadata(columnName, metadata) | Returns a new DataFrame by updating an existing column with metadata. | 通过使用元数据更新现有列来返回新的 DataFrame。 |
withWatermark(eventTime, delayThreshold) | Defines an event time watermark for this DataFrame. | 为此 DataFrame 定义事件时间水印。 |
writeTo(table) | Create a write configuration builder for v2 sources. |