017 13 Big Limitations of Hadoop & Solution To Hadoop Drawbacks
Although Hadoop is the most powerful tool of big data, there are various limitations of Hadoop like Hadoop is not suited for small files, it cannot handle firmly the live data, slow processing speed, not efficient for iterative processing, not efficient for caching etc.
虽然 Hadoop 是最强大的大数据的工具,Hadoop 有各种各样的局限性,比如 Hadoop 不适合小文件,它不能牢固地处理实时数据,处理速度慢,迭代处理效率低下,缓存效率低下等.
In this tutorial on the limitations of Hadoop, firstly we will learn about what is Hadoop and what are the pros and cons of Hadoop. We will see features of Hadoop due to which it is so popular. We will also see 13 Big Disadvantages of Hadoop due to which Apache Spark and Apache Flink came into existence and learn about various ways to overcome the drawbacks of Hadoop.
在这篇关于 Hadoop 局限性的教程中,首先我们将了解什么是 Hadoop,以及 Hadoop 的优缺点.我们会的查看 Hadoop 如此受欢迎的特性.我们还将看到 Hadoop 的 13 大缺点,由于这些缺点,Apache Spark 和 Apache Flink 应运而生,并了解克服 Hadoop 缺点的各种方法.
Hadoop – Introduction & Features
Hadoop-介绍和功能
Let us start with what is Hadoop and what are Hadoop features that make it so popular.
让我们从 Hadoop 是什么以及 Hadoop 有哪些特性使得它如此受欢迎开始.
Hadoop is an open-source software framework for distributed storage and distributed processing of extremely large data sets. Important features of Hadoop are:
Hadoop 是一个开源的软件框架用于超大数据集的分布式存储和分布式处理.Hadoop 的重要功能有:
Apache Hadoop is an open source project. It means one can modify its code for business requirements.
In Hadoop, data is highly available and accessible despite hardware failure due to multiple copies of data. If a machine or any hardware crashes, then we can access data from another path.
Hadoop is highly scalable, as we can easily add the new hardware to the node. Hadoop also provides horizontal scalability which means we can add nodes on the fly without any downtime.
The Hadoop is** fault tolerant **as by default, 3 replicas of each block are stored across the cluster. So if any node goes down, data on that node can recover from the other node easily.
In Hadoop, data is reliably stored on the cluster despite machine failure due to the replication of data on the cluster.
Hadoop runs on a cluster of commodity hardware which is not very expensive.
Hadoop is very easy to use, as there is no need of the client to deal with distributed computing; the framework takes care of all the things.
Apache Hadoop 是一个开源的项目.这意味着可以根据业务需求修改代码.
在 Hadoop 中,数据是高可用尽管由于多个数据副本导致硬件故障,但仍然可以访问.如果机器或任何硬件崩溃,那么我们可以从另一个路径访问数据.
Hadoop 是高度可扩展,因为我们可以很容易地将新硬件添加到节点中.Hadoop 还提供了横向可扩展性,这意味着我们可以在不停机的情况下动态添加节点.
Hadoop 是容错默认情况下,每个块的 3 个副本存储在整个集群中.因此,如果任何节点出现故障,该节点上的数据可以很容易地从另一个节点恢复.
在 Hadoop 中可靠地存储数据在群集上,尽管由于群集上的数据复制而导致机器故障.
Hadoop 运行在一个商品硬件集群上不是很贵.
Hadoop 是非常使用方便,因为不需要客户端来处理分布式计算; 框架处理所有的事情.
But as all technologies have pros and cons, similarly there are many limitations of Hadoop as well. As we have already seen features and advantages of Hadoop above, now let us see the limitations of Hadoop, due to which Apache Spark and Apache Flink came into the picture.
但是,由于所有的技术都有各自的优点和缺点,同样,Hadoop 也有许多局限性.正如我们已经看到了上面 Hadoop 的特性和优势,现在让我们看看 Hadoop 的局限性,因为 Apache Spark 和 Apache Flink 已经出现在了这个问题上.
13 Big Limitations of Hadoop for Big Data Analytics
Hadoop 对大数据分析的 13 大限制
We will discuss various limitations of Hadoop in this section along with their solution:
在本节中,我们将讨论 Hadoop 的各种限制及其解决方案:
1. Issue with Small Files
1. 问题与小锉刀
Hadoop does not suit for small data. (HDFS) Hadoop distributed file system lacks the ability to efficiently support the random reading of small files because of its high capacity design.
Hadoop 不适合小数据.(HDFS) 分布式文件系统由于其大容量设计,缺乏有效支持随机读取小文件的能力.
Small files are the major problem in HDFS. A small file is significantly smaller than the HDFS block size (default 128MB). If we are storing these huge numbers of small files, HDFS can’t handle this much of files, as HDFS is for working properly with a small number of large files for storing large data sets rather than a large number of small files. If there are too many small files, then the NameNode will get overload since it stores the namespace of HDFS.
小文件是 HDFS 中的主要问题.一个小文件比HDFS 块大小 (默认为 128 MB).如果我们存储大量的小文件,那么 HDFS 就无法处理这么多文件.由于 HDFS 是为了与少量大文件一起正常工作,用于存储大型数据集,而不是大量小文件.如果小文件太多,那么南德因为它存储了 HDFS 的命名空间,所以会过载.
Solution-
解决方案-
Solution to this Drawback of Hadoop to deal with small file issue is simple. Just merge the small files to create bigger files and then copy bigger files to HDFS.
The introduction of** HAR files** (Hadoop Archives) was for reducing the problem of lots of files putting pressure on the namenode’s memory. By building a layered filesystem on the top of HDFS, HAR files works. Using the Hadoop archive command, HAR files are created, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. Reading through files in a HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files read as well the data file to read, this makes it slower.
**Sequence files **work very well in practice to overcome the ‘small file problem’, in which we use the filename as the key and the file contents as the value. By writing a program for files (100 KB), we can put them into a single Sequence file and then we can process them in a streaming fashion operating on the Sequence file. MapReduce can break the Sequence file into chunks and operate on each chunk independently because the Sequence file is splittable.
Storing files in HBase is a very common design pattern to overcome small file problem with HDFS. We are not actually storing millions of small files into HBase, rather adding the binary content of the file to a cell.
解决 Hadoop 处理小文件问题的这个缺点很简单.只要合并小文件,创建更大的文件,然后将更大的文件复制到 HDFS.
的介绍HAR 文件(Hadoop Archives) 是为了减少大量文件给 namenode 的内存带来压力的问题.HAR 文件通过在 HDFS 的顶部构建分层文件系统来工作.使用 Hadoop archive 命令创建 HAR 文件,该文件运行MapReduce将存档的文件打包成少量 HDFS 文件的作业.在 HAR 中读取文件并不比在 HDFS 中读取文件更有效.由于每个 HAR 文件访问都需要读取两个索引文件以及要读取的数据文件,这使得速度变慢.
序列文件在实践中非常好地解决了 “小文件问题”,在这个问题中,我们使用文件名作为密钥,文件内容作为值.通过为文件 (100 KB) 编写程序,我们可以将它们放入一个序列文件中,然后我们可以以流的方式处理它们,对序列文件进行操作.MapReduce 可以将序列文件拆分成块,并且由于序列文件是可拆分的,因此可以独立地对每个块进行操作.
将文件存储在HBase 是一种很常见的设计模式解决 HDFS 的小文件问题.实际上,我们并没有将数百万个小文件存储到 HBase 中,而是将文件的二进制内容添加到单元格中.
2. Slow Processing Speed
2. 缓慢的处理速度
In Hadoop, with a parallel and distributed algorithm, the MapReduce process large data sets. There are tasks that we need to perform: Map and Reduce and, MapReduce requires a lot of time to perform these tasks thereby increasing latency. Data is distributed and processed over the cluster in MapReduce which increases the time and reduces processing speed.
在 Hadoop 中,MapReduce 通过并行和分布式算法处理大数据集.我们需要执行一些任务: Map 和 Reduce,MapReduce 需要大量时间来执行这些任务,从而增加延迟.在 MapReduce 中,数据通过集群进行分发和处理,这增加了时间,降低了处理速度.
Solution-
解决方案-
As a Solution to this Limitation of Hadoop spark has overcome this issue, by in-memory processing of data. In-memory processing is faster as no time is spent in moving the data/processes in and out of the disk. Spark is 100 times faster than MapReduce as it processes everything in memory. We also Flink, as it processes faster than spark because of its streaming architecture and Flink gets instructions to process only the parts of the data that have actually changed, thus significantly increases the performance of the job.
作为解决 Hadoop spark 这一限制的方法,通过内存中数据处理克服了这一问题.内存处理速度更快,因为在将数据/进程移入和移出磁盘方面没有花费时间.Spark 在处理内存中的所有内容时比 MapReduce 快 100 倍.我们还 Flink,因为它的流架构比 spark 处理得更快,Flink 得到的指令只处理实际发生变化的数据部分, 因此,工作绩效显著提高.
3. Support for Batch Processing only
3..仅支持批量处理
Hadoop supports batch processing only, it does not process streamed data, and hence overall performance is slower. The MapReduce framework of Hadoop does not leverage the memory of the Hadoop cluster to the maximum.
Hadoop 只支持批处理,不处理流式数据,因此整体性能较慢.Hadoop 的 MapReduce 框架没有利用Hadoop 集群达到最大.
Solution-
解决方案-
To solve these limitations of Hadoop spark is used that improves the performance, but **Spark stream processing **is not as efficient as Flink as it uses micro-batch processing. Flink improves the overall performance as it provides single run-time for the streaming as well as batch processing. Flink uses native closed loop iteration operators which make machine learning and graph processing faster.
使用 Hadoop spark 来解决这些限制,从而提高了性能,但是**火花流处理 **不像 Flink 使用微批量处理那样高效.Flink 为流处理和批处理提供了单次运行时间,从而提高了整体性能.Flink 使用本地闭环迭代运算符,这使得机器学习 图形处理速度更快.
4. No Real-time Data Processing
4..无实时数据处理
Apache Hadoop is for batch processing, which means it takes a huge amount of data in input, process it and produces the result. Although batch processing is very efficient for processing a high volume of data, depending on the size of the data that processes and the computational power of the system, an output can delay significantly. Hadoop is not suitable for Real-time data processing.
Apache Hadoop 是用于批处理的,这意味着它在输入、处理和产生结果时需要大量数据.尽管批处理对于处理大量数据非常有效,但根据处理数据的大小和系统的计算能力,输出可能会显著延迟.Hadoop 不适合实时数据处理.
Solution-
解决方案-
Apache Spark supports stream processing. Stream processing involves continuous input and output of data. It emphasizes on the velocity of the data, and data processes within a small period of time. Learn more about** Spark Streaming APIs.**
Apache Flink provides single run-time for the streaming as well as batch processing, so one common run-time is utilized for data streaming applications and batch processing applications. Flink is a stream processing system that is able to process row after row in real time.
Apache Spark支持流处理.流处理包括数据的连续输入和输出.它强调数据的速度,以及在很短的时间内处理数据的速度.了解更多信息** Spark 流 api.**
Apache Flink为流和批处理提供单一运行时,因此数据流应用程序和批处理应用程序使用一个通用运行时.Flink 是一种能够实时逐行处理的流处理系统.
5. No Delta Iteration
5. 没有增量迭代
Hadoop is not so efficient for iterative processing, as Hadoop does not support cyclic data flow(i.e. a chain of stages in which each output of the previous stage is the input to the next stage).
实现的不太有效的迭代处理的 Hadoop 不支持循环数据流量 (i.E.一个阶段链,其中前一阶段的每个输出都是下一阶段的输入).
Solution-
解决方案-
We can use Apache Spark to overcome this type of Limitations of Hadoop, as it accesses data from RAM instead of disk, which dramatically improves the performance of iterative algorithms that access the same dataset repeatedly. Spark iterates its data in batches. For iterative processing in Spark, we schedule and execute each iteration separately.
我们可以使用 Apache Spark 来克服 Hadoop 的这种限制,因为它从 RAM 而不是磁盘访问数据,这大大提高了重复访问同一数据集的迭代算法的性能.Spark 批量迭代其数据.对于 Spark 中的迭代处理,我们分别安排和执行每个迭代.
6. Latency
6. 延迟
In Hadoop, MapReduce framework is comparatively slower, since it is for supporting different format, structure and huge volume of data. In MapReduce, Map takes a set of data and converts it into another set of data, where individual elements are broken down into key-value pairs and Reduce takes the output from the map as input and process further and MapReduce requires a lot of time to perform these tasks thereby increasing latency.
在 Hadoop 中,由于 MapReduce 框架支持不同的格式、结构和巨大的数据量,所以它的速度相对较慢.在MapReduce,Map 获取一组数据,并将其转换为另一组数据,其中单个元素被分解为键值对Reduce 将 map 的输出作为输入并进一步处理,MapReduce 需要大量时间来执行这些任务,从而增加了延迟.
Solution-
解决方案-
Spark is used to reduce this limitation of Hadoop, Apache Spark is yet another batch system but it is relatively faster since it caches much of the input data on memory by RDD(Resilient Distributed Dataset) and keeps intermediate data in memory itself. Flink’s data streaming achieves low latency and high throughput.FR
Spark 用于减少 Hadoop 的这种限制,Apache Spark 是另一个批处理系统,但是速度相对较快,因为它通过以下方式将大部分输入数据缓存在内存中RDD (弹性分布式数据集)并将中间数据保存在内存中.Flink 的数据流实现了低延迟和高吞吐量.
7. Not Easy to Use
7. 不容易使用
In Hadoop, MapReduce developers need to hand code for each and every operation which makes it very difficult to work. MapReduce has no interactive mode, but adding one such as hive and pig makes working with MapReduce a little easier for adopters.
在 Hadoop 中,MapReduce 开发人员需要为每一项操作手动编写代码,这使得工作变得非常困难.MapReduce 没有交互模式,但是添加了一个蜂巢和猪的 MapReduce 工作得更关爱.
Solution-
解决方案-
To solve this Drawback of Hadoop, we can use the spark. Spark has interactive mode so that developers and users alike can have intermediate feedback for queries and other activities. Spark is easy to program as it has tons of high-level operators. We can easily use Flink as it also has high-level operators. This way spark can solve many limitations of Hadoop.
为了解决 Hadoop 的这个缺点,我们可以使用 spark.Spark 具有交互模式,因此开发人员和用户都可以对查询和其他活动进行中间反馈.Spark 拥有大量高级操作员,因此很容易编程.我们可以很容易地使用 Flink,因为它也有高级操作员.这样 spark 就可以解决 Hadoop 的很多限制.
8. Security
8..安全
Hadoop is challenging in managing the complex application. If the user doesn’t know how to enable a platform who is managing the platform, your data can be a huge risk. At storage and network levels, Hadoop is missing encryption, which is a major point of concern. Hadoop supports Kerberos authentication, which is hard to manage.
Hadoop 在管理复杂的应用程序方面具有挑战性.如果用户不知道如何启用管理平台的平台,您的数据可能会面临巨大的风险.在存储和网络层面,Hadoop 缺少加密,这是一个主要关注点.Hadoop 支持Kerberos 认证这是很难管理的.
HDFS supports access control lists (ACLs) and a traditional file permissions model. However, third-party vendors have enabled an organization to leverage** Active Directory Kerberos** andLDAP for authentication.
HDFS支持访问控制列表(Acl) 和传统的文件权限模型.然而,第三方供应商使组织能够利用活动目录 Kerberos和LDAP用于认证.
Solution-
解决方案-
Spark provides a security bonus to overcome these limitations of Hadoop. If we run the spark in HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN giving it the capability of using Kerberos authentication.
Spark 为克服 Hadoop 的这些限制提供了安全奖励.如果我们在 spark 中运行 spark,它可以使用 HDFS acl 和文件级权限.此外,Spark 可以在纱线给它使用 Kerberos 身份验证的能力.
9. No Abstraction
9. 无提取
Hadoop does not have any type of abstraction so MapReduce developers need to hand code for each and every operation which makes it very difficult to work.
Hadoop 没有任何类型的抽象,因此 MapReduce 开发人员需要为每个操作手动代码,这使得工作变得非常困难.
Solution-
解决方案-
To overcome these drawbacks of Hadoop, Spark is used in which we have RDD abstraction for the batch. Flink has Dataset abstraction.
为了克服 Hadoop 的这些缺点,我们使用了 Spark批处理的 RDD 抽象.Flink 具有数据集抽象.
10. Vulnerable by Nature
10.自然脆弱
Hadoop is entirely written in Java, a language most widely used, hence java been most heavily exploited by cyber criminals and as a result, implicated in numerous security breaches.
Hadoop 完全是用JavaJava 是一种使用最广泛的语言,因此 java 被网络犯罪分子利用得最严重,结果导致了许多安全漏洞.
11. No Caching
11.没有缓存
Hadoop is not efficient for caching. In Hadoop, MapReduce cannot cache the intermediate data in memory for a further requirement which diminishes the performance of Hadoop.
Hadoop 缓存效率不高.在 Hadoop 中,为了进一步降低 Hadoop 的性能,MapReduce 不能将中间数据缓存在内存中.
Solution-
解决方案-
Spark and Flink can overcome this limitation of Hadoop, as Spark and Flink cache data in memory for further iterations which enhance the overall performance.
Spark 和 Flink 可以克服 Hadoop 的这一限制,因为 Spark 和 Flink 会在内存中缓存数据,以便进一步迭代,从而提高整体性能.
12. Lengthy Line of Code
12.冗长的代码行
Hadoop has a 1,20,000 line of code, the number of lines produces the number of bugs and it will take more time to execute the program.
Hadoop 有一个 1,20,000 行代码,行数产生错误的数量,执行程序需要更多的时间.
Solution-
解决方案-
Although, Spark and Flink are written in scala and java but the implementation is in Scala, so the number of lines of code is lesser than Hadoop. So it will also take less time to execute the program and solve the lengthy line of code limitations of Hadoop.
虽然 Spark 和 Flink 是用 scala 和 java 编写的,但是实现是用 Scala 编写的,所以代码行数比 Hadoop 少.因此,执行程序和解决 Hadoop 冗长的代码限制也需要更少的时间.
13. Uncertainty
13.不确定性
Hadoop only ensures that the data job is complete, but it’s unable to guarantee when the job will be complete.
Hadoop 只确保数据作业完成,但无法保证作业何时完成.
Limitations of Hadoop and Its solutions – Summary
Hadoop 及其解决方案的局限性
As a result of Limitations of Hadoop, the need for Spark and Flink emerged. Thus made the system more friendly to play with a huge amount of data. Spark provides in-memory processing of data thus improves the processing speed. Flink improves the overall performance as it provides single run-time for the streaming as well as batch processing. Spark provides a security bonus.
由于 Hadoop 的局限性,出现了对 Spark 和 Flink 的需求.这使得系统在处理大量数据时更加友好.Spark 提供了数据的内存处理,从而提高了处理速度.Flink 为流处理和批处理提供了单次运行时间,从而提高了整体性能.Spark 提供安全奖励.
Now that the flaws of Hadoop have been exposed, will you continue to use it for your big data initiatives, or swap it for something else?
现在的缺陷Hadoop已经暴露,你会继续将它用于你的大数据计划,还是将它换成其他东西?
If you have any queries on limitations of Hadoop or any feedback just drop a comment in the comment section and we will get back to you.
如果您对 Hadoop 的限制有任何疑问,或者有任何反馈,请在评论部分留言,我们会回复您.
https://data-flair.training/blogs/13-limitations-of-hadoop