022 HBase 压缩和数据局部性

022 HBase Compaction and Data Locality in Hadoop

1. HBase Compaction and Data Locality With Hadoop

1. HBase 压实和数据的位置与 Hadoop

In this Hadoop HBase tutorial of HBase Compaction and Data Locality with Hadoop, we will learn the whole concept of Minor and Major Compaction in HBase, a process by which HBase cleans itself in detail. Also, we will see Data Locality with Hadoop Compaction because data locality is a solution to data not being available to Mapper.
So, let’s start HBase Compaction and Data Locality in Hadoop.

HBase 的 Hadoop 教程 HBase 的压实和地点与 Hadoop 数据,我们将未成年人学 ”的概念和主要的压实HBase,HBase 详细清理自身的过程.此外,我们将看到数据的位置Hadoop因为数据局部性是 Mapper 无法使用的数据的解决方案.
所以,让我们在 Hadoop 中开始 Compaction 压缩和数据本地化.

HBase Compaction and Data Locality in Hadoop

2. What is HBase Compaction?

2. HBase 的压实?

As we know, for read performance, HBase is an optimized distributed data store. But this optimal read performance needs one file per column family. Although, during the heavy writes, it is not always possible to have one file per column family. Hence, to reduce the maximum number of disk seeks needed for read, HBase tries to combine all HFiles into a large single HFile. So, this process is what we call Compaction.
Do you know about HBase Architecture
In other words, Compaction in HBase is a process by which HBase cleans itself, whereas this process is of two types: Minor HBase Compaction as well as Major HBase Compaction.

在读取性能方面,HBase 是一个经过优化的分布式数据存储.但是,这种最佳读取性能需要每个列族一个文件.虽然,在大量写入期间,每个列族不总是可以有一个文件.因此,为了减少读取所需的最大磁盘查找次数,HBase 尝试将所有 HFiles 组合成一个大的单个HFile.这个过程就是我们所说的压实.
了解 HBase 架构吗:
换句话说,Compaction 中的压缩是 HBase 清理自己的过程,而这个过程有两种类型: 轻微的 HBase 压缩和主要的 HBase 压缩.

a. HBase Minor Compaction

A.HBase 轻微的压实

The process of combining the configurable number of smaller HFiles into one large HFile is what we call Minor compaction. Though, it is quite important since, reading particular rows needs many disk reads and may reduce overall performance, without it.
Here are the several processes which involve in HBase Minor Compaction, are:

将可配置数量较小的 HFile 合并成一个大 HFile 的过程就是我们所说的小压缩.尽管如此,这一点非常重要,因为读取特定行需要许多磁盘读取,如果没有它,可能会降低整体性能.
以下是 HBase 次要压缩中涉及的几个过程:

  1. By combining smaller Hfiles, it creates bigger Hfile.
  2. Also, Hfile stores the deleted file along with it.
  3. To store more data increases space in memory.
  4. Uses merge sorting.

[图片上传中...(image-cb1f06-1564812083800-3)]

HBase Compaction

b. HBase Major compaction

B.HBase 主要压实

Whereas, a process of combining the StoreFiles of regions into a single StoreFile, is what we call HBase Major Compaction. Also, it deletes remove and expired versions. As a process, it merges all StoreFiles into single StoreFile and also runs every 24 hours. However, the region will split into new regions after compaction, if the new larger StoreFile is greater than a certain size (defined by property).
Have a look at HBase Commands
Well, the HBase Major Compaction in HBase is the other way to go around:

然而,将区域的存储文件组合成一个存储文件的过程,就是我们所说的 HBase 主要压缩.此外,它会删除删除和过期的版本.作为一个过程,它将所有存储文件合并到单个存储文件中,并且每 24 小时运行一次.然而,如果新的更大的存储文件大于某个大小 (由属性定义),则该区域将在压缩后拆分为新区域.
看看 HBase 命令
嗯,HBase 中的 HBase 主要压缩是另一种方法:

  1. Data present per column family in one region is accumulated to 1 Hfile.
  2. All deleted files or expired cells are deleted permanently, during this process.
  3. Increase read performance of newly created Hfile.
  4. It accepts lots of I/O.
  5. Possibilities for traffic congestion.
  6. The other name of major compaction process is Write amplification Process.
  7. And it is must schedule this process at a minimum bandwidth of network I/O.

HBase Compaction

HBase Major Compaction

Get the most demanding skills of IT Industry - Learn Hadoop

获得 IT 行业最苛刻的技能-学习 Hadoop

3. HBase Compaction Tuning

HBase 3. 压缩调整

a. Short Description of HBase Compaction:

A.HBase 压实的简要说明:

Now, to enhance performance and stability of the HBase cluster, we can use some hidden HBase compaction configuration like below.

现在,为了增强 HBase 集群的性能和稳定性,我们可以使用一些隐藏的 compaction 压缩配置,如下所示.

b. Disabling Automatic Major Compactions in HBase

在 HBase 中禁用自动主要操作

Generally, HBase users ask to possess a full management of major compaction events hence the method to do that is by setting** HBase.hregion.majorcompaction** to 0, disable periodic automatic major compactions in HBase.
However, it does not offer 100% management of major compactions, yet, by HBase automatically, minor compactions can be promoted to major ones, sometimes, although, we’ve got another configuration choice, luckily, that will help during this case.
Let’s take a tour to HBase Operations.

通常,HBase 用户要求对主要压实事件进行全面管理,因此实现这一点的方法是通过设置HBase.hregion.majorcompaction到 0,在 HBase 中禁用定期自动主要操作.
然而,它并没有提供 100% 的主要 compactions 管理,但是,通过 HBase 自动地,次要的 compactions 可以被提升到主要的 compactions,尽管有时,幸运的是, 在这种情况下,这将会有所帮助.
让我们来了解一下 HBase 操作.

c. Maximum HBase Compaction Selection Size

最大 Compaction 压缩选择大小

Control compaction process in HBase is another option:
hbase.hstore.compaction.max.size (by default value is set to LONG.MAX_VALUE)
In HBase 1.2+ we have as well:
hbase.hstore.compaction.max.size.offpeak

HBase 中的控制压缩过程是另一个选择:
Compaction.hstore.compaction.max.size (默认设置为 LONG.max _ value)
在 HBase 1.2 + 中,我们也有:
Compaction.hstore.compaction.最大尺寸.非峰值

d. Off-peak Compactions in HBase

HBase 中的非高峰竞争

Further, we can use off-peak configuration settings, if our deployment has off-peak hours.
Here are HBase Compaction Configuration options must set, to enable off peak compaction:
hbase.offpeak.start.hour= 0..23
hbase.offpeak.end.hour= 0..23
Compaction file ratio for off peak 5.0 (by default) or for peak hours is 1.2.
Both can be changed:
hbase.hstore.compaction.ratio
hbase.hstore.compaction.ratio.offpeak
As much high the file ratio value will be, the more will be the aggressive (frequent) compaction. So, for the majority of deployments, default values are fine.

此外,如果我们的部署有非高峰时间,我们可以使用非高峰配置设置.
这里是HBase 压缩配置必须设置选项,以启用非峰值压实:
Hbase.offpeak.开始.小时 = 月..23
Hbase.offpeak..小时 = 月..23
非峰值 5.0 (默认情况下) 或峰值时间的压实文件比率为 1.2.
两者都可以改变:
Compaction.hstore.压实.比
Compaction.hstore.压实.比.
文件比率值越高,积极 (频繁) 的压缩就越多.因此,对于大多数部署来说,默认值是可以的.

Hadoop Quiz

4. Data Locality in Hadoop

4. Hadoop 数据位置

As we know, in Hadoop, Datasets is stored in HDFS. Basically, it is divided into blocks as well as stored among the data nodes in a Hadoop cluster. Though, the individual Mappers will process the blocks (input splits), while a MapReduce job is executed against the dataset. However, data has to copy over the network from the data node that has data to the data node that is executing the Mapper task, when data is not available for Mapper in the same node. So, it is what we call data Locality in Hadoop.
You can learn more about Data Locality in Hadoop
In Hadoop, there are 3 categories of Data Locality, such as:

我们知道,在 Hadoop 中,数据集存储在HDFS.基本上,它被分成块,并存储在Hadoop 集群.然而,单个地图绘制程序将处理这些块 (输入拆分),而 MapReduce 作业对数据集执行.然而,当数据在同一节点中不可用于 Mapper 时,数据必须通过网络从具有数据的数据节点复制到执行 Mapper 任务的数据节点.这就是我们所说的 Hadoop 中的数据局部性.
您可以在 Hadoop 中了解更多关于数据局部性的信息
在 Hadoop 中,有 3 类数据局部性,如:

HBase Compaction and Data Locality in Hadoop

Data Locality in Hadoop

1. Data Local Data Locality

1. 数据本地数据的地方

Data local data locality is when data is located on the same node as the mapper working on the data. In this case, the proximity of data is very near to computation. Basically, it is the highly preferable option.

数据本地性是当数据位于与处理数据的映射程序相同的节点上时.在这种情况下,数据的接近性非常接近于计算.基本上,这是一个非常好的选择.

2. Intra-Rack Data Locality

2. 所在地内部资料架子

However, because of resource constraints, it is always not possible to execute the Mapper on the same node. Hence at that time, the Mapper executes on another node within the same rack as the node that has data. So, this is what we call intra-rack data locality.

然而,由于资源限制,在同一个节点上执行映射程序总是不可能的.因此,在那个时候,映射程序在与有数据的节点位于同一机架内的另一个节点上执行.这就是我们所说的机架内数据局部性.

3. Inter-Rack Data Locality

3. 间架资料地点

Well, there is a case when we are not able to achieve intra-rack locality as well as data locality because of resource constraints. So, at that time we need to execute the mapper on nodes on different racks, and also then the data copy from the node that has data to the node executing mapper between racks. So, this is what we call inter-rack data locality. Although, this option is less preferable.
Let’s learn features and principle of Hadoop
So, this was all in HBase Compaction and Data Locality in Hadoop. Hope you like our explanation.

嗯,有一种情况是,由于资源限制,我们无法实现机架内局部性和数据局部性.因此,当时我们需要在不同机架上的节点上执行映射程序,然后将数据从具有数据的节点复制到机架之间执行映射程序的节点上.这就是我们所说的机架间数据局部性.虽然,这个选项不太可取.
让我们学习 Hadoop 的特性和原理
所以,这都是 Hadoop 中的 HBase 压缩和数据本地化.希望你喜欢我们的解释

5. Conclusion: HBase Compaction

5..结论: Compaction 压缩

Hence, in this Hadoop HBase tutorial of HBase Compaction and Data Locality, we have seen the cleaning process of HBase that is HBase Compaction. Also, we have seen a solution to data not being available to Mapper, Apache Hadoop Data Locality in detail. Hope it helps! Please share your experience through comments on our HBase Compaction explanation.
See also –
HBase Performance Tuning
For reference

因此,在这个 Compaction 压缩和数据局部性的 Hadoop HBase 教程中,我们看到了 cleaning 压缩的清理过程.ALso,我们已经看到了一个数据不可用的解决方案映射,详细说明 Apache Hadoop 数据局部性.希望有帮助!请通过对我们的 HBase 压缩解释的评论分享你的经验.
另见-
HBase 性能调优
供参考

https://data-flair.training/blogs/hbase-compaction

你可能感兴趣的:(022 HBase 压缩和数据局部性)