如何理解SparkSQL中的partitionColumn, lowerBound, upperBound, numPartitions

如何理解SparkSQL中的partitionColumn, lowerBound, upperBound, numPartitions

在SparkSQL中,读取数据的时候可以分块读取。例如下面这样,指定了partitionColumn,lowerBound,upperBound,numPartitions等读取数据的参数。简单来说,就是并行读取。

  • partitionColumn:分区字段,需要是数值类的(partitionColumn must be a numeric column from the table in question.),经测试,除整型外,float、double、decimal都是可以的
  • lowerBound:下界,必须为整数
  • upperBound:上界,必须为整数
  • numPartitions:最大分区数量,必须为整数,当为0或负整数时,实际的分区数为1;并不一定是最终的分区数量,例如“upperBound - lowerBound< numPartitions”时,实际的分区数量是“upperBound - lowerBound”;
  • 在分区结果中,分区是连续的,虽然查看每条记录的分区,不是顺序的,但是将rdd保存为文件后,可以看出是顺序的。

关于这四个参数的意思,SparkSQL官方解释是:

Property Name Meaning
partitionColumn, lowerBound, upperBound These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
numPartitions The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

从上面的解释来看,分区列得是数字类型;所谓的并行读取其实就是开了多个数据库连接,分块读取的。另外需要注意的是:

Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned.

也就是说,**这些参数的设置不会过滤数据,所以sql中读取了多少数据,那么返回的就是多少条数据,lowerBound和upperBound并不会过滤数据。**那么如果说设置的lowerBound偏大(可能读取的数据中分区列的值比这个小),或者设置的upperBound数值设置的大小偏小(可能读取的数据中分区列中最大的值比upperBound大),这个时候数据是怎么读取和返回的呢?

# 情况一:
if partitionColumn || lowerBound || upperBound || numPartitions 有任意选项未指定,报错
# 情况二:
if numPartitions == 1 忽略这些选项,直接读取,返回一个分区
# 情况三:
if numPartitions > 1 && lowerBound > upperBound 报错
# 情况四: 
numPartitions = min(upperBound - lowerBound, numPartitions)
if numPartitions == 1 同情况二
else 返回numPartitions个分区
delta = (upperBound - lowerBound) / numPartitions

分区1数据条件:partitionColumn <= lowerBound + delta || partitionColumn is null
分区2数据条件:partitionColumn > lowerBound + delta && partitionColumn <= lowerBound + 2 * delta
...
最后分区数据条件:partitionColumn > lowerBound + (n-1)*delta
先看一个错的例子:
例如,如果:

lowerBound: 0
upperBound: 1000
numPartitions: 10

步幅等于 100,分区对应于以下查询:

SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100  //不准确
SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200
...
SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000//不准确

这意味着上限和下限值过滤了正在读取的数据集。而lowerBound 和 upperBound 仅用于决定分区步幅,而不是用于过滤表中的行。所以表中的所有行都将被分区并返回。

再看个例子:

如果一个表分10个分区读取,id为分区列,其值从0-101,但是设置的lowerBound是1,upperBound是100,那么读取的规则如下: 
第一个分区:select * from tablename where id<=10; // 这是对的
第二个分区:select * from tablename where id >=10 and id<20; 
第三个分区:select * from tablename where id >=20 and id <30; 
…… 
第十个分区:select * from tablename where id >=90// 这是对的

For instance, for upperBound = 500, lowerBound = 0 and numPartitions = 5. The partitions will be as per the following queries:

SELECT * FROM table WHERE partitionColumn < 100 or partitionColumn is null
SELECT * FROM table WHERE partitionColumn >= 100 AND <200 
SELECT * FROM table WHERE partitionColumn >= 200 AND <300
SELECT * FROM table WHERE partitionColumn >= 300 AND <400
...
SELECT * FROM table WHERE partitionColumn >= 400

Depending on the actual range of values of the partitionColumn, the result size of each partition will vary.

验证实验请参考https://blog.csdn.net/wiborgite/article/details/84944596

本文参考总结自:

【1】https://stackoverflow.com/questions/41085238/what-is-the-meaning-of-partitioncolumn-lowerbound-upperbound-numpartitions-pa#

【2】https://blog.csdn.net/wiborgite/article/details/84944596(验证实验)

【3】https://blog.csdn.net/a8131357leo/article/details/107002207

你可能感兴趣的:(Spark,spark)