重点 (Top highlight)
Author: Zachary Ennenga
作者 :Zachary Ennenga
背景 (Background)
At Airbnb, our offline data processing ecosystem contains many mission-critical, time-sensitive jobs — it is essential for us to maximize the stability and efficiency of our data pipeline infrastructure.
在Airbnb,我们的离线数据处理生态系统包含许多关键任务,对时间敏感的工作,这对于我们最大化数据管道基础架构的稳定性和效率至关重要。
So, when a few months back, we encountered a recurring issue that caused significant outages of our data warehouse, it quickly became imperative that we understand and solve the root cause. We traced the outage back to a single job, and how it, unintentionally and unexpectedly, wrote millions of files to HDFS.
因此,几个月前,我们遇到了一个反复出现的问题,该问题导致了数据仓库的严重停机,因此,我们必须Swift了解并解决根本原因。 我们将中断的原因追溯到一项工作,以及它是如何无意和意外地将数百万个文件写入HDFS的。
Thus, we began to investigate the various strategies that can be used to manage our Spark file count in order to maximize the stability and efficiency of our Data Engineering ecosystem.
因此,我们开始研究可用于管理Spark文件计数的各种策略,以最大程度地提高数据工程生态系统的稳定性和效率。
词汇速记 (A quick note on vocabulary)
Throughout this post I will need to use the term “partitions” quite a lot, and both Hive and Spark use this to mean different things. For this reason, I will use the term sPartition to refer to a Spark Partition, and hPartition to refer to a Hive partition. I will use the term partition key to refer to the set of values that make up the partition identifier of any given hPartition.
在整个这篇文章中,我将需要大量使用“分区”一词,Hive和Spark都用这个来表示不同的含义。 因此,我将使用术语sPartition来指代Spark分区,使用hPartition来指代Hive分区。 我将使用术语“ 分区键”来指代组成任何给定hPartition的分区标识符的一组值。
这怎么发生? (How does this even happen?)
As the term “ETL” implies, most Spark jobs can be described by 3 operations: Read input data, process with Spark, save output data. This means that while your actual data transformations are occurring largely in-memory, your jobs generally begin and end with a large amount of IO.
就像术语“ ETL”所暗示的那样,大多数Spark作业可以通过3种操作来描述:读取输入数据,使用Spark处理,保存输出数据。 这意味着,当实际的数据转换主要在内存中进行时,您的工作通常以大量的IO开始和结束。
A common stack for Spark, one we use at Airbnb, is to use Hive tables stored on HDFS as your input and output datastore. Hive partitions are represented, effectively, as directories of files on a distributed file system. In theory, it might make sense to try to write as many files as possible. However, there is a cost.
我们在Airbnb上使用的Spark常见堆栈是将存储在HDFS上的Hive表用作您的输入和输出数据存储。 配置单元分区有效地表示为分布式文件系统上文件的目录。 从理论上讲,尝试编写尽可能多的文件可能很有意义。 但是,这是有成本的 。
HDFS does not support large amounts of small files well. Each file has a 150 byte cost in NameNode memory, and HDFS has a limited number of overall IOPS. Spikes in file writes can absolutely take down, or otherwise render unusably slow, pieces of your HDFS infrastructure.
HDFS不能很好地支持大量的小文件。 每个文件在NameNode内存中的开销为150字节,而HDFS的整体IOPS数量有限。 文件写入过程中的尖峰脉冲绝对会导致HDFS基础架构的部分崩溃或以其他方式渲染缓慢。
It may seem as though it would be hard to accidentally write a huge amount of files, but it really isn’t. If your use cases only involve writing a single hPartition at a time, there are a number of solutions to this issue. But in a large data engineering organization, these cases are not the only ones you’ll encounter.
似乎很难意外写入大量文件,但实际上并非如此。 如果您的用例仅涉及一次编写一个hPartition,则有许多解决此问题的方法。 但是在大型数据工程组织中,这些情况并不是您会遇到的唯一情况。
At Airbnb, we have a number of cases where we write to multiple hPartitions, most commonly, backfills. A backfill is a recomputation of a table from some historical date to the current date, often to fix a bug or data quality issue.
在Airbnb,我们有很多情况下会写入多个hPartition,最常见的是回填。 回填是对表从某个历史日期到当前日期的重新计算,通常是为了解决错误或数据质量问题。
When handling a large dataset, say, 500GB-1TB, that contains 365 days’ worth of data, you may break your data into a few thousand sPartitions for processing, perhaps, 2000–3000. While on the surface, this naive approach may seem reasonable, using dynamic partitioning, and writing your results to a Hive table partitioned by date will result in up to 1.1M files.
当处理一个大型数据集(例如500GB-1TB),其中包含365天的数据时,您可以将数据分成几千个分区以进行处理,也许是2000-3000。 从表面上看,这种天真的方法似乎很合理,可以使用动态分区,并将结果写入按日期分区的Hive表中, 最多可生成110万个文件 。
Why does this happen?
为什么会这样?
Data distributed randomly into 3 sPartitions 数据随机分为3个分区 Our goal — Data written neatly into 3 files 我们的目标-将数据整齐地写入3个文件Let’s assume you have a job with 3 sPartitions, and you want to write to 3 hPartitions.
假设您有一个包含3个sPartition的工作,并且想写入3个hPartitions。
What you want to have happen in this situation is 3 files written to HDFS, with all records present in a single file per partition key.
在这种情况下,您希望发生的事情是将3个文件写入HDFS,每个分区键的所有记录都在一个文件中。
Reality — Data written not-so-neatly to HDFS 现实-数据并非整齐地写入HDFSWhat will actually happen is you will generate 9 files, each with 1 record. When writing to a Hive table with dynamic partitioning, each sPartition is processed in parallel by your executors. When that sPartition is processed, each time an executor encounters a new partition key in a given sPartition, it opens a new file.
实际发生的是, 您将生成9个文件,每个文件包含1条记录 。 当使用动态分区写入Hive表时,执行程序将并行处理每个sPartition。 处理该sPartition后,每次执行程序在给定sPartition中遇到新的分区键时,它都会打开一个新文件。
By default, Spark uses either a Hash or Round Robin partitioner on your data. Both of these, when applied to an arbitrary dataframe, can be assumed to distribute your rows relatively evenly, but randomly, throughout your sPartitions. This means without taking any specific action, you can generally expect to write approximately 1 file per sPartition, per unique partition key, hence our 1.1M result above.
默认情况下,Spark在数据上使用哈希或循环分区程序。 当将这两种方法应用于任意数据框时,都可以假定它们在您的sPartition中相对均匀地但随机地分布行。 这意味着您无需采取任何特定的操作,通常可以期望每个sPartition,每个唯一的分区键写入大约1个文件,因此,上面的1.1M结果。
您如何确定目标文件数? (How do you decide on your target file count?)
Before we dig into the various ways to convince Spark to distribute our data in a way that’s amenable to efficient IO, we have to discuss what we’re even aiming for.
在我们探索各种方法以说服Spark以适合高效IO的方式分发数据之前,我们必须讨论我们的目标。
Ideally, your target file size should be approximately a multiple of your HDFS block size, 128MB by default.
理想情况下,目标文件大小应约为HDFS块大小的倍数,默认情况下为128MB。
In pure Hive pipelines, there are configurations provided to automatically collect results into reasonably sized files, nearly transparently from the perspective of the developer, such as hive.merge.smallfiles.avgsize, or hive.merge.size.per.task.
在纯Hive管道中,提供了一些配置,可以自动将结果收集到大小合理的文件中,从开发人员的角度来看几乎是透明的,例如hive.merge.smallfiles.avgsize或hive.merge.size.per.task 。
However, no such functionality exists in Spark. Instead, we must use our own heuristics to try to determine, given a dataset, how many files should be written.
但是,Spark中不存在此类功能。 相反,在给定数据集的情况下,我们必须使用自己的试探法来尝试确定应写入多少个文件。
基于大小的计算 (Size-based calculations)
In theory, this is the most straightforward approach — set a target size, estimate the size of your dataframe, and then divide.
从理论上讲,这是最直接的方法-设置目标大小,估计数据框的大小,然后进行除法。
However, files are written to disk, in many cases, with compression, and in a format that is significantly different than the format of your records stored in the Java heap. This means it’s far from trivial to estimate how large your records in memory will be when written to disk.
但是,在许多情况下,文件是通过压缩方式写入磁盘的,其格式与Java堆中存储的记录的格式明显不同。 这意味着估计写入磁盘时内存中的记录有多大并非易事。
While you may be able to estimate via the size of your data in memory using the SizeEstimator utility, then apply some sort of estimated compression/file format factor, the SizeEstimator considers internal overhead of dataframes/datasets, in addition to the size of your data. Overall, this heuristic is unlikely to be accurate for this purpose.
虽然您可以使用SizeEstimator实用程序通过内存中的数据大小进行估计,然后应用某种估计的压缩/文件格式因子,但SizeEstimator会考虑数据帧/数据集的内部开销以及数据的大小。 总体而言,这种启发式方法不太可能达到此目的。
基于行数的计算 (Row count-based calculations)
A second method is to set a target row count, count the size of your dataset, and then perform division to estimate your target.
第二种方法是设置目标行数,计算数据集的大小,然后执行除法以估计目标。
Your target row count can be determined in a number of ways, either by picking a static number for all datasets, or by determining the size of a single record on disk, and performing the necessary calculations. Which way is best will depend on your number of datasets, and their complexity.
可以通过多种方式确定目标行数,方法是为所有数据集选择一个静态数,或者确定磁盘上单个记录的大小,然后执行必要的计算。 哪种方法最好取决于您的数据集数量及其复杂性。
Counting is fairly cheap, but requires a cache before the count to avoid recomputing your dataset. We’ll discuss the cost of caching later, so while this is viable, it is not necessarily free.
计数是相当便宜的,但是在计数之前需要缓存,以避免重新计算数据集。 稍后我们将讨论缓存的成本,因此尽管可行,但并不一定要免费 。
静态文件计数 (Static file counts)
The simplest solution is to just require engineers to, on a per-insert basis, tell Spark how many files, in total, it should be writing. This “heuristic” will not work on its own, as we need to give developers some other heuristic to get this number in the first place, but could be an optimization we can apply to skip an expensive calculation.
最简单的解决方案是仅要求工程师在每次插入的基础上告诉Spark总共应写入多少个文件。 这种“启发式”功能无法单独发挥作用,因为我们首先需要为开发人员提供其他启发式功能,以获取该数字,但是这可能是一种优化方法,可用于跳过昂贵的计算。
评价 (Evaluation)
A hybrid is your best option here. Unknown datasets should be with a count-based heuristic to determine file count, but enable developers to take the result determined by the count heuristic, and encode it statically.
混合动力车是您的最佳选择。 未知数据集应具有基于计数的启发式方法来确定文件计数,但使开发人员能够获取由计数启发式确定的结果,并对其进行静态编码。
我们如何让Spark合理分配数据? (How do we get Spark to distribute our data in a reasonable way?)
Even if we know how we want our files written to disk, we still have to get Spark to get our sPartitions structured in a way that is amenable to actually generating those files.
即使我们知道如何将文件写入磁盘,我们仍然必须获得Spark才能以能够实际生成这些文件的方式来结构化sPartition。
Spark provides you a number of tools to determine how data is distributed throughout your sPartitions. However, there is a lot of hidden complexity in the various functions, and in some cases, they have implications that are not immediately obvious.
Spark为您提供了许多工具来确定如何在整个sPartition中分配数据。 但是,各种功能存在很多隐藏的复杂性,并且在某些情况下,它们所带来的影响不是立即显而易见的。
We will go through a number of these options that Spark provides, and various other techniques that we have leveraged at Airbnb to control Spark output file count.
我们将介绍Spark提供的许多这些选项,以及我们在Airbnb用来控制Spark输出文件计数的各种其他技术。
合并 (Coalesce)
Coalesce is a special version of repartition that only allows you to decrease the total sPartitions, but does not require a full shuffle, and is thus significantly faster than a repartition. It does this by, effectively, merging sPartitions.
Coalesce是重新分区的一种特殊版本,它仅允许您减少总sPartition,但不需要完全改组,因此比重新分区要快得多 。 它通过有效地合并sPartition来做到这一点。
Coalesce sounds useful in some cases, but has some problems.
合并听起来在某些情况下很有用,但存在一些问题。
First, coalesce has a behavior that makes it difficult for us to use. Take a pretty basic Spark application:
首先,合并的行为使我们难以使用。 拿一个非常基本的Spark应用程序:
load().map(…).filter(…).save()
Let’s say you had a parallelism of 1000, but you only wanted to write 10 files at the end. You might think you could do:
假设您的并行度为1000,但是最后只想写10个文件。 您可能认为您可以:
load().map(…).filter(…).coalesce(10).save()
However, Spark’s will effectively push down the coalesce operation to as early a point as possible, so this will execute as:
但是,Spark将有效地将合并操作推到尽可能早的一点,因此将执行以下操作:
load().coalesce(10).map(…).filter(…).save()
The only workaround is to force an action between your transformations and your coalesce, like:
唯一的解决方法是在转换和合并之间强制执行操作,例如:
val df = load().map(…).filter(…).cache()
df.count()
df.coalesce(10)
The cache is required because otherwise, you’ll have to recompute your data, which can be very costly. However, caching is not free; if your dataset cannot fit into memory, or if you cannot spare the memory to store your data in memory effectively twice, then you must use disk caching, which has its own limitations and a significant performance penalty.
缓存是必需的,因为否则,您将不得不重新计算数据,这可能会非常昂贵。 但是, 缓存不是免费的 。 如果数据集无法容纳到内存中,或者如果您无法腾出内存来有效地两次将数据存储在内存中,则必须使用磁盘缓存,这有其自身的局限性,并且会严重影响性能。
In addition, as you will see later on, performing a shuffle is often necessary to achieve the results we want for more complicated datasets.
另外,正如您稍后将看到的那样,通常需要执行随机播放才能获得我们想要的更复杂数据集的结果。
Evaluation
评价
Coalesce only works for a specific subset of cases:
合并仅适用于特定情况的子集:
1. You can guarantee you are only writing to 1 hPartition2. The target number of files is less than the number of sPartitions you’re using to process your data3. You can afford to cache or recompute your data
1.您可以保证只写1 hPartition2。 目标文件数少于您用于处理数据的sPartition的数量3。 您可以缓存或重新计算数据
简单分区 (Simple Repartition)
A simple repartition is a repartition who’s only parameter is target sPartition count — IE: df.repartition(100). In this case, a round-robin partitioner is used, meaning the only guarantee is that the output data has roughly equally sized sPartitions.
一个简单的分区是一个分区,其唯一参数是目标sPartition计数— IE:df.repartition(100)。 在这种情况下,将使用循环分区程序,这意味着唯一的保证是输出数据具有大致相等大小的sPartition。
Evaluation
评价
A simple repartition can fix skewed data, where the sPartitions are wildly different sizes.
一个简单的分区可以修复歪斜的数据,其中sPartition的大小千差万别。
It is only useful for file count problems where:
它仅对以下情况下的文件计数问题有用:
- You can guarantee you are only writing to 1 hPartition 您可以保证只写1 hPartition
- The number of files you are writing is greater than your number of sPartitions and/or you cannot use coalesce for some other reason 您正在写入的文件数量大于sPartition的数量,并且/或者由于某些其他原因而无法使用合并
按列划分 (Repartition by Columns)
Repartition by columns takes in a target sPartition count, as well as a sequence of columns to repartition on — e.g., df.repartition(100, $”date”). This is useful for forcing Spark to distribute records with the same key to the same partition. In general, this is useful for a number of Spark operations, such as joins, but in theory, it could allow us to solve our problem as well.
按列重新分区将包含目标sPartition计数以及要在其上重新分区的列序列,例如df.repartition(100,$“ date”)。 这对于强制Spark将具有相同键的记录分发到同一分区很有用。 通常,这对于许多Spark操作(例如联接)很有用,但是从理论上讲,它也可以使我们解决问题。
Repartitioning by columns uses a HashPartitioner, which will assign records with the same value for the hash of their key to the same partition. In effect, it will do:
按列重新分区使用HashPartitioner,它将为键的哈希将具有相同值的记录分配给同一分区。 实际上,它将执行以下操作:
Beautiful! 美丽!Which, in theory, is exactly what we want!
从理论上讲,这正是我们想要的!
Saying “in theory” is always inviting disaster. 说“理论上”总是引来灾难。However, this approach only works if each partition key can safely be written to one file. This is because no matter how many values have a certain hash value, they’ll end up in the same partition.
但是,只有将每个分区键都可以安全地写入一个文件时 ,此方法才有效。 这是因为无论多少个值具有特定的哈希值,它们都将最终位于同一分区中。
Evaluation
评价
Repartitioning by columns only works when you are writing to one or more small hPartitions. In any other case it is not useful, because you will always end up with 1 file per hPartition, which only works for the smallest of datasets.
按列重新分区仅在您写入一个或多个小型hPartition时有效。 在任何其他情况下,它都是没有用的,因为每个hPartition总是以1个文件结尾,该文件仅适用于最小的数据集。
按列进行随机分区 (Repartition by Columns with a Random Factor)
We can modify repartition by columns by adding a constrained random factor:
我们可以通过添加约束随机因子来按列修改分区:
df
.withColumn("rand", rand() % filesPerPartitionKey)
.repartition(100, $”key”, $"rand")
In theory, this approach should lead to well sorted records, and files of fairly even size, as long as you meet the following conditions:
从理论上讲,只要您满足以下条件,这种方法就应产生记录排序良好的文件和大小相当均匀的文件:
- hPartitions are all roughly the same size hPartition大小大致相同
- You know the target number of files per hPartition and can encode it at runtime 您知道每个hPartition的目标文件数,并且可以在运行时对其进行编码
As we discussed earlier, determining the correct files-per-partition value is far from easy. However, the first condition is also far from trivial to meet:
正如我们前面所讨论的,确定正确的每分区文件值并非易事。 但是,第一个条件也不是不容易满足的:
In a backfill context, say, computing a year’s worth of data, day-to-day data volume changes are low, whereas month-to-month and year-to-year changes are high. Assuming a 5% month-over-month growth rate of a data source, we expect the data volume to increase 80% over the course of the year. With a 10% month-over-month growth rate, 313%.
例如,在回填上下文中,计算一年的数据价值,则日常数据量变化很小,而逐月和逐年变化很大。 假设数据源的月度增长率为5%,我们预计数据量将在一年中增长80%。 环比增长10%,即313%。
Given these factors, it seems we will suffer performance problems and skew over the course of any period larger than a month or so, and cannot meaningfully claim that all hPartitions will require roughly the same file count.
考虑到这些因素,似乎在超过一个月左右的任何时间段内,我们都会遇到性能问题和偏差,并且无法有意义地声称所有hPartition都需要大致相同的文件数。
That said, even if we can guarantee all those conditions are met, there is one other problem: Hashing Collisions.
也就是说,即使我们可以保证满足所有这些条件,也存在另一个问题:散列冲突。
Collisions
碰撞
Let’s say you are processing 1 year’s worth of data (365 unique dates), with date as your only partition key.
假设您正在处理1年的数据(365个唯一日期),而日期是唯一的分区键。
If you need 5 files per partition, you might do something like:
如果每个分区需要5个文件,则可以执行以下操作:
df.withColumn(“rand”, rand() % 5).repartition(5*365, $”date”, $”rand”)
Under the hood, Scala will construct a key that contains both your date, and your random factor, something like (
在幕后,Scala将构造一个包含您的日期和您的随机因素的密钥,例如(
(See org.apache.spark.Partitioner.HashPartitioner for reference)
(有关参考,请参见org.apache.spark.Partitioner.HashPartitioner)
class HashPartitioner(partitions: Int) extends Partitioner {
def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
}
Effectively, all that’s being done is taking the hash of your key tuple, and then taking the (nonNegative) mod of it using the target number of sPartitions.
有效地,所有要做的就是获取键元组的哈希,然后使用目标sPartition的数量对其进行(非负)mod。
Let’s analyze how our records will actually be distributed in this case. I have written some code to perform the analysis over here, also available as a gist here.
让我们分析这种情况下记录的实际分配方式。 我已经写了一些代码在执行分析在这里 ,也可以作为一个依据这里 。
The above script calculates 3 quantities:
上面的脚本计算了3个数量:
Efficiency: The ratio of non-empty sPartitions (and thus, executors in use) to number of output files
效率 :非空sPartition(因此使用的执行程序)与输出文件数的比率
Collision Rate: The percentage of sPartitions where the hash of (date, rand) collided
冲突率 :(日期,rand)的哈希值发生冲突的sPartition的百分比
Severe Collision Rate: As above, but where the number of collisions on this key are 3 or greater
严重碰撞率 :与上面相同,但是此键上的碰撞次数为3或更大
Collisions are significant because they mean our sPartitions contain multiple unique partition keys, whereas we only expected 1 per sPartition.
冲突之所以重要,是因为它们意味着我们的sPartition包含多个唯一的分区键,而我们每个sPartition只期望1个。
The results are pretty bad: We are using 63% of the executors we could be, and there is likely to be severe skew; close to half of our executors are processing 2, 3, or in some cases up to 8 times more data than we expect.
结果非常糟糕:我们正在使用可能使用的63%的执行器,并且可能会出现严重的歪斜; 我们有将近一半的执行程序正在处理2、3甚至在某些情况下比我们预期多8倍的数据。
Now, there is a workaround — partition scaling.
现在,有一个解决方法- 分区缩放 。
In our previous examples, our number of output sPartitions is equal to our intended total file count. This causes hash collisions because of the same principles that surround the Birthday Problem— that is, if you’re randomly assigning n objects to n slots, you can expect that there will be several slots with more than one object, and several empty slots. Thus, to fix this, you must decrease the ratio of objects to slots.
在前面的示例中,输出sPartition的数量等于预期的文件总数。 由于围绕生日问题的相同原理,这会导致哈希冲突-即,如果您将n个对象随机分配给n个插槽,则可以预期会有多个插槽包含多个对象和多个空插槽。 因此,要解决此问题,必须降低对象与插槽的比例。
We do this by scaling our output partition count, by multiplying our output sPartition count by a large factor, something akin to:
我们通过缩放输出分区计数,将输出sPartition计数乘以一个很大的因子来实现此目的,类似于:
df
.withColumn(“rand”, rand() % 5)
.repartition(5*365*SCALING_FACTOR, $”date”, $”rand”)
See here for updated analysis code, however, to summarize:
请参阅此处以获取更新的分析代码,以进行总结:
As our scale factor approaches infinity, collisions fairly quickly approach 0, and efficiency gets closer to 100%.
当我们的比例因子接近无穷大时,碰撞很快就会接近0,效率接近100%。
However, this creates another problem, where a huge amount of the output sPartitions will be empty. While these empty sPartitions aren’t necessarily a deal breaker, they do carry some overhead, increase driver memory requirements, and make us more vulnerable to issues where, due to bugs, or unexpected complexity, our partition key space is unexpectedly large, and we end up writing millions of files again.
但是,这带来了另一个问题,其中大量的输出sPartition将为空。 尽管这些空的sPartition不一定会破坏交易,但它们确实会带来一些开销,增加驱动程序内存需求,并使我们更容易遇到由于错误或意外的复杂性而导致分区键空间意外大的问题,而我们最终再次写入了数百万个文件。
Default Parallelism and Scaling
默认并行度和缩放
A common approach here is to not set the partition count explicitly when using this approach, and rely on the fact that Spark defaults to your spark.default.parallelism value (and similar configurations) if you do not provide a partition count.
此处的一种常见方法是在使用此方法时不显式设置分区计数,并且在不提供分区计数的情况下,依赖于Spark默认为spark.default.parallelism值(和类似配置)的事实。
While, often, parallelism is naturally higher than total output file count (thus, implicitly providing a scaling factor greater than 1), this is not always true — I have observed many cases where developers do not tune parallelism correctly, and result in cases where the desired output file count is actually greater than their default parallelism. The penalties for this are high:
虽然并行性通常自然高于总输出文件数(因此,隐式地提供了大于1的缩放因子),但这并不总是正确的-我观察到许多情况下,开发人员无法正确调整并行性,结果是所需的输出文件数实际上大于其默认的并行度。 对此的处罚很高:
Scale factors below 1 get bad, fast. 小于1的比例因子会很快变坏。Evaluation
评价
This is an efficient approach if you can meet a few guarantees:
如果您可以满足以下要求,则这是一种有效的方法:
- hPartitions will have roughly equal file counts hPartition的文件数大致相等
- We can determine what the average partition file count should be 我们可以确定平均分区文件数应该是多少
- We know, roughly, the total number of unique partition keys, so we can correctly scale our dataset. 我们大致知道唯一分区键的总数,因此我们可以正确缩放数据集。
In the examples, we assumed many of these things could easily be known; primarily, total number of output hPartitions, and number of files desired per hPartition. However, I think it’s rare we can ask developers in broad to be able to provide these numbers, and keep them up to date.
在示例中,我们假设其中许多事情是容易知道的。 主要是输出hPartition的总数,以及每个hPartition所需的文件数。 但是,我认为我们很少要求广泛的开发人员提供这些数字并保持最新。
This approach is not a bad one by any means, and will likely work for many use cases. That said, if you are not aware of its’ pitfalls, you can encounter difficult-to-diagnose performance problems. Because of this, and because of the requirements to maintain file count related constants, I feel it is not a suitable default.
无论如何,这种方法都不错,并且可能适用于许多用例。 就是说,如果您不知道它的陷阱,那么可能会遇到难以诊断的性能问题。 因此,并且由于需要维护与文件计数相关的常量,我觉得这不是一个合适的默认值。
For a true default, we need an approach that requires minimal information from developers, and works with any sort of input.
对于真正的默认值,我们需要一种方法,该方法要求开发人员提供最少的信息,并且可以处理任何类型的输入。
天真分区(按范围) (Naive Repartition by Range)
Repartition by range is a special case of repartition. Rather than applying RoundRobin or Hash Partitioners, it uses a special one, called a Range Partitioner.
按范围重新分区是重新分区的一种特殊情况。 它没有应用RoundRobin或Hash分区程序,而是使用一种特殊的程序,称为Range Partitioner。
A range partitioner splits rows across sPartitions based on the ordering of some given key, however, it is not performing a global sort. The guarantees it makes are:
范围分区器根据某些给定键的顺序跨sPartition分割行,但是,它不执行全局排序。 它做出的保证是:
- All records with the same hash will end up in the same partition 具有相同散列的所有记录将最终位于同一分区中
- All sPartitions will have a “min” and “max” value associated with them, and all values between the “min” and “max” will be in that partition 所有sPartition都将具有与之关联的“ min”和“ max”值,并且“ min”和“ max”之间的所有值都将在该分区中
- The “min” and “max” values will be determined by using sampling to detect key frequency and range, and partition bounds will be initially set based on these estimates. “最小值”和“最大值”值将通过使用采样来检测关键频率和范围来确定,并且将根据这些估计值初步设置分区范围。
- Partitions are not guaranteed to be totally equal in size, their equality is based on the accuracy of the sample, and thus, the predicted per-sPartition min and max values. Partitions will grow or shrink as necessary to guarantee the first two conditions. 不能保证分区的大小完全相等,它们的相等性是基于样本的准确性,因此是基于预测的每个分区的最小值和最大值。 分区将根据需要增加或缩小以保证前两个条件。
To summarize, a range partitioning will cause Spark to create a number of “buckets” equal to the number of requested sPartitions. It will then map these buckets to “ranges” of the specified partition key. For example, if your partition key is date, a range could could be (Min: “2018-01-01”, Max: “2019–01–01”). Then, for each record, compare the record’s partition key value to the bucket min/max values, and distribute them accordingly.
总而言之,范围分区将使Spark创建与请求的sPartitions数量相等的“存储桶”数量。 然后,它将这些存储桶映射到指定分区键的“范围”。 例如,如果您的分区键是日期,则范围可以是(最小值:“ 2018-01-01”,最大值:“ 2019-01-01”)。 然后,对于每条记录,将记录的分区键值与存储桶的最小值/最大值进行比较,并相应地分配它们。
While this is, overall, fairly efficient, the sampling required to determine bounds is not free. To sample, Spark has to compute your whole dataset, so caching your dataset may be necessary, or at least, beneficial. In addition, sample results are stored on the driver, so driver memory must be increased — roughly 4–6G at most in our tests — but this will depend on your record and dataset size.
虽然总体上这是相当有效的,但确定界限所需的采样并非免费。 为了采样,Spark必须计算整个数据集,因此缓存数据集可能是必要的,或者至少是有益的。 此外,样本结果存储在驱动程序中,因此必须增加驱动程序内存(在我们的测试中最多增加4-6G),但这取决于您的记录和数据集的大小。
Evaluation
评价
This has the same problem as df.repartition(6,Repartition by range seems to deliver what we need, in theory. However, the first guarantee — all records with the same hash will end up in the same partition — is a sticking point. For our purposes, this makes it the same as a simple repartition, but more expensive, and thus unusable as it stands.
理论上,按范围划分似乎可以满足我们的需求。 但是,第一个保证(即具有相同散列的所有记录将最终位于同一分区中)是一个症结所在。 就我们的目的而言,这使其与简单的分区相同,但更加昂贵,因此无法使用。
按范围与附加列进行分区 (Repartition by Range with Additional Columns)
However, we should not give up on repartition by range without a fight.
但是,我们不应该在不战而放弃分区的情况下。
We will make two improvements:
我们将进行两项改进:
First, we will hash the columns that make up our partition key. We don’t actually care about the relative sorting of our keys, or the raw values of our partition keys at all, only that partition keys are differentiated, and a hash guarantees that. This reduces the cost of sampling (as all samples are collected in driver memory) as well as the cost of comparing partition keys to to min/max partition bounds.
首先,我们将对组成分区键的列进行哈希处理。 实际上,我们根本不关心键的相对排序,也根本不关心分区键的原始值,只区分分区键,而散列保证了这一点。 这降低了采样成本(因为所有样本都收集在驱动程序内存中),以及将分区键与最小/最大分区范围进行比较的成本。
Second, we will add a random key in addition to our hash. This will mean that, due to the hierarchal, key-based sort, we will effectively have all records for a given partition key present in multiple, sequential sPartitions.
其次,除了哈希之外,我们还将添加一个随机密钥。 这意味着,由于基于键的分层排序,我们将有效地将给定分区键的所有记录显示在多个顺序的sPartition中。
Partitions won’t be perfectly even because of the random factors, but you get pretty close. 即使是由于随机因素,分区也不是完美的,但是您可以接近。An implementation of this would look something like:
这样的实现看起来像:
val randDataframe = dataFrame.withColumn(
“hash”,
hash(partitionColumns.map(new ColumnName(_)): _*)
).withColumn(
“rand”,
rand()
)
randDataframe.repartitionByRange(
fileCount,
$”hash”,
$”rand”
).drop(“hash”, “rand”)
On the surface, this looks similar to the repartition by columns + rand approach earlier, and you might suspect it has the same collision problems.
从表面上看,这看起来与前面的按列+ rand方法进行的分区类似,并且您可能会怀疑它具有相同的冲突问题。
However, the hash calculation in the previous approach amounts to:
但是,以前的方法中的哈希计算为:
(date, rand() % files per hPartition).hashCode % sPartitions
Which ends up having a total number of unique hashes equal to your sPartition count.
最终得到的唯一哈希总数等于您的sPartition计数。
Here, the hash calculation is simply:
在这里,哈希计算很简单:
(date, rand()).hashCode
Which has, effectively, infinite possible hashes.
实际上,它具有无限可能的哈希值。
The reason for this is the hash done here is only done to determine uniqueness of our keys, whereas the hash function used in the previous example is a two-tier system designed to assign records to specific, limited, buckets of records.
这样做的原因是,此处完成的哈希仅用于确定键的唯一性,而上一示例中使用的哈希函数是两层系统,旨在将记录分配给特定的有限记录桶。
Evaluation
评价
This solution works for all cases where we have multiple hPartition outputs, regardless of the count of output hPartitions/sPartitions, or the relative size.
该解决方案适用于所有具有多个hPartition输出的情况,而与输出hPartitions / sPartitions的数量或相对大小无关。
Earlier, we discussed that determining file count at the dataset level, rather than the key level is the cheapest and easiest approach to perform generically, and thus, this approach requires no information to be provided by developers.
早先,我们讨论了在数据集级别而不是密钥级别确定文件数是最便宜,最简单的通用执行方法,因此,该方法不需要开发人员提供任何信息。
As long as the total file count provided to the function is reasonable, we can expect, at most, fileCount + count(distinct hash) files written to disk.
只要提供给该函数的文件总数是合理的,我们最多可以期望fileCount + count(不同哈希)文件写入磁盘。
汇集全部 (Bringing it all Together)
So, to summarize, what should you be doing?
因此,总而言之,您应该做什么?
Determine your ideal file count
确定理想的文件数
I recommend using count-based heuristics, and applying them to the entire dataset, rather than on a per-key basis. Encode these statically if you wish, to increase performance by skipping the count.
我建议使用基于计数的启发式方法,并将其应用于整个数据集,而不是基于每个键。 如果需要,可以对它们进行静态编码,以跳过计数来提高性能。
Apply a repartitioning scheme
应用重新分区方案
Based on my above evaluations, I recommend using the following to decide what repartitioning scheme to use:
基于上述评估,我建议使用以下内容来决定要使用的重新分区方案:
Use coalesce if:
在以下情况下使用合并:
- You’re writing fewer files than your sPartition count 您写入的文件少于sPartition数量
- You can bear to perform a cache and count operation before your coalesce 您可以在合并之前执行缓存和计数操作
- You’re writing exactly 1 hPartition 您正在精确地编写1 hPartition
Use simple repartition if:
在以下情况下使用简单分区:
- You’re writing exactly 1 hPartition 您正在精确地编写1 hPartition
- You can’t use coalesce 您不能使用合并
Use a simple repartition by columns if:
如果满足以下条件,请按列使用简单分区:
- You’re writing multiple hPartitions, but each hPartition needs exactly 1 file 您正在编写多个hPartition,但是每个hPartition恰好需要1个文件
- Your hPartitions are roughly equally sized 您的hPartition大小大致相等
Use a repartition by columns with a random factor if:
在以下情况下,使用具有随机因子的按列重新分区:
- Your hPartitions are roughly equally sized 您的hPartition大小大致相等
- You feel comfortable maintaining a files-per-hPartition variable for this dataset 您可以轻松地为此数据集维护一个“ file-per-h-Partition”变量
- You can estimate the number of output hPartitions at runtime or, you can guarantee your default parallelism will always be much larger (~3x) your output file count for any dataset you’re writing. 您可以在运行时估计输出hPartition的数量,也可以保证默认的并行度始终比编写的任何数据集的输出文件数大很多(〜3倍)。
Use a repartition by range (with the hash/rand columns) in every other case.
在所有其他情况下,请按范围(带有hash / rand列)使用分区。
For many use cases, caching is an acceptable cost, so I suspect this will boil down to:
在许多用例中,缓存是可以接受的成本,因此我怀疑这可以归结为:
Use coalesce if you’re writing to one hPartition.
如果要写入一个hPartition,请使用合并。
Use repartition by columns with a random factor if you can provide the necessary file constants.
如果可以提供必要的文件常量,请使用具有随机因子的按列重新分区。
Use repartition by range in every other case.
在所有其他情况下,请按范围使用分区。
期待 (Looking Forward)
Repartition by range works fairly well. However, it can be improved for this use case, by removing some of the guarantees and constraints it provides. We are experimenting with custom, more efficient versions of that repartition strategy specifically for managing your Spark file count. Stay tuned!
按范围重新分区效果很好。 但是,可以通过删除它提供的某些保证和约束来对此用例进行改进。 我们正在试验该重新分配策略的自定义,更有效的版本,专门用于管理您的Spark文件计数。 敬请关注!
Interested in becoming an expert on repartitioning schemes? Want to solve the biggest of big data problems? Airbnb is hiring! Check out our open positions and apply!
有兴趣成为重新分配方案的专家吗? 是否想解决最大的大数据问题? Airbnb正在招聘! 查看我们的空缺职位并申请!
翻译自: https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908