概念:
Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.
The two types of data storage are often confused, but are much more different than they are alike. In fact, the only real similarity between them is their high-level purpose of storing data.
The distinction is important because they serve different purposes and require different sets of eyes to be properly optimized. While a data lake works for one company, a data warehouse will be a better fit for another.
数据湖和数据仓库都广泛用于存储大数据,但它们并不是能被互换的术语。数据湖是一个巨大的原始数据池,其用途尚不明确。数据仓库是结构化的、经过过滤的、已经为特定目的处理过的数据的存储库。
这种区别很重要,因为它们服务于不同的目的,需要不同方面的关注来进行适当的优化。可能数据湖适用于一家公司,而数据仓库可能就更适合另一种类型的公司。
这两种类型的数据存储经常被混淆,但它们的不同之处远多于相似之处。事实上,它们之间唯一真正的相似之处就是是存储数据的高级目的。
Four key differences between a data lake and a data warehouse
数据湖和数据仓库之间的四个关键区别
There are several differences between a data lake and a data warehouse. Data structure, ideal users, processing methods, and the overall purpose of the data are the key differentiators.
数据湖和数据仓库之间有几个区别。数据结构、理想用户、处理方法和数据的总体用途是关键的不同点。
Data Lake 数据湖 Data Warehouse 数据仓库
Data Structure 数据结构 Raw 原始 Processed 已处理
Purpose of Data数据用途 Not Yet Determined 未确认用途 Currently In Use 正在被使用
Users 用户 Data Scientists 数据科学家 Business Professionals 商业人士
Accessibility可访问性 Highly accessible&quick to update More complicated & costly to changes
高度可访问和快速更新 更复杂和昂贵的改变
Data structure: raw vs. processed
数据结构:原始vs.已处理
Raw data is data that has not yet been processed for a purpose. Perhaps the greatest difference between data lakes and data warehouses is the varying structure of raw vs. processed data. Data lakes primarily store raw, unprocessed data, while data warehouses store processed and refined data.
Because of this, data lakes typically require much larger storage capacity than data warehouses. Additionally, raw, unprocessed data is malleable, can be quickly analyzed for any purpose, and is ideal for machine learning. The risk of all that raw data, however, is that data lakes sometimes become data swamps without appropriate data quality and data governance measures in place
Data warehouses, by storing only processed data, save on pricey storage space by not maintaining data that may never be used. Additionally, processed data can be easily understood by a larger audience.
原始数据是尚未被处理的数据。也许数据湖和数据仓库之间最大的区别是原始数据和处理数据的不同结构。数据湖主要存储原始的、未处理的数据,而数据仓库存储已处理和细化的数据。
因此,数据湖通常需要比数据仓库大得多的存储容量。此外,未经处理的原始数据具有可塑性,可以用于任何目的快速分析,是机器学习的理想选择。然而,所有这些原始数据的风险在于,如果没有适当的数据质量和数据治理措施,数据湖有时会变成数据沼泽。
数据仓库只存储处理过的数据,因此不需要维护那些可能永远不会使用的数据,从而节省了昂贵的存储空间。此外,处理过的数据可以很容易地被更多人理解。
Purpose: undetermined vs in-use
用途: 未确定 vs 正在使用
The purpose of individual data pieces in a data lake is not fixed. Raw data flows into a data lake, sometimes with a specific future use in mind and sometimes just to have on hand. This means that data lakes have less organization and less filtration of data than their counterpart.
Processed data is raw data that has been put to a specific use. Since data warehouses only house processed data, all of the data in a data warehouse has been used for a specific purpose within the organization. This means that storage space is not wasted on data that may never be used.
数据湖中单个数据块的用途是不固定的。原始数据会流入数据湖,有时会考虑将来的具体用途,有时只是手头的数据。这意味着数据湖有着相对更少的组织和数据过滤。
处理过的数据是被用于特定用途的原始数据。由于数据仓库只存放处理过的数据,因此数据仓库中的所有数据都被用于组织内的特定目的。这意味着存储空间不会浪费在可能永远不会被使用的数据上。
Users: data scientists vs business professionals
用户:数据科学家 vs 商业人士
Data lakes are often difficult to navigate by those unfamiliar with unprocessed data. Raw, unstructured data usually requires a data scientist and specialized tools to understand and translate it for any specific business use.
Alternatively, there is growing momentum behind data preparation tools that create self-service access to the information stored in data lakes.
Processed data is used in charts, spreadsheets, tables, and more, so that most, if not all, of the employees at a company can read it. Processed data, like that stored in data warehouses, only requires that the user be familiar with the topic represented.
对于不熟悉未被处理过的数据的人来说,数据湖通常很难导航定位数据。原始的、非结构化的数据通常需要数据科学家和专门的工具来理解和翻译它,以满足任何特定的业务用途。
另外,数据准备工具的发展势头也越来越强,这些工具可以创建对存储在数据湖中的信息的自助访问。
处理过的数据被用于图表、电子表格、表格等,以便公司的大多数员工能够阅读这些数据。处理过的数据,如存储在数据仓库中的数据,只要求用户熟悉所表示的主题。
Accessibility: flexible vs secure
可访问性:灵活 vs 安全
Accessibility and ease of use refers to the use of data repository as a whole, not the data within them. Data lake architecture has no structure and is therefore easy to access and easy to change. Plus, any changes that are made to the data can be done quickly since data lakes have very few limitations.
Data warehouses are, by design, more structured. One major benefit of data warehouse architecture is that the processing and structure of data makes the data itself easier to decipher, the limitations of structure make data warehouses difficult and costly to manipulate.
可访问性和易用性是指将数据存储库作为一个整体来使用,而不是其中的数据。数据湖架构没有结构,因此很容易访问和更改。此外,对数据所做的任何更改都可以快速完成,因为数据湖的限制非常少。
通过设计,数据仓库更加结构化。数据仓库体系结构的一个主要好处是,数据的处理和结构使数据本身更容易被理解,然而结构的限制使数据仓库的操作变得更加困难和昂贵。
Data lake vs data warehouse: which is right for me?
数据湖与数据仓库:哪个适合我?
Organizations often need both. Data lakes were born out of the need to harness big data and benefit from the raw, granular structured and unstructured data for machine learning, but there is still a need to create data warehouses for analytics use by business users.
通常公司或者机构是两者都需要的。 数据湖诞生是来源于对利用大数据的需求,并从原始的,粒度化的结构化和非结构化数据中受益以进行机器学习,但是仍然需要创建数据仓库以供非技术的商务用户使用。
Healthcare: data lakes store unstructured information
医疗:用数据湖存储非结构化信息
Data warehouses have been used for many years in the healthcare industry, but it has never been hugely successful. Because of the unstructured nature of much of the data in healthcare (physicians notes, clinical data, etc.) and the need for real-time insights, data warehouses are generally not an ideal model.
Data lakes allow for a combination of structured and unstructured data, which tends to be a better fit for healthcare companies.
数据仓库的架构已经在医疗IT行业中应用多年,但还未取得显著的巨大成功。 由于医疗场景中许多数据的非结构化的特性(医师说明,临床数据等)以及对数据实时洞察的需求,因此常规上数据仓库的架构也不不是理想的模型。
综上看,数据湖架构允许结构化和非结构化数据的组合,这样的组合往往更适合医疗结构。
Education: data lakes offer flexible solutions
教育领域:数据湖提供灵活的解决方案
In recent years, the value of big data in education reform has become enormously apparent. Data about student grades, attendance, and more can not only help failing students get back on track, but can actually help predict potential issues before they occur. Flexible big data solutions have also helped educational institutions streamline billing, improve fundraising, and more.
Much of this data is vast and very raw, so many times, institutions in the education sphere benefit best from the flexibility of data lakes.
近年来,大数据在教育改革中的价值已变得极为明显。 有关学生成绩,出勤率等的数据不仅可以帮助失败的学生重回正轨,而且可以实际帮助在潜在问题发生之前进行预测。 灵活的大数据解决方案还帮助教育机构简化了账单,改善了筹款活动等等。
这些数据中的大部分是巨大且非常原始的,因此,许多次教育领域的机构都受益于数据湖的灵活性。
Finance: data warehouses appeal to the masses
金融:数据仓库吸引大众
In finance, as well as other business settings, a data warehouse is often the best storage model because it can be structured for access by the entire company rather than a data scientist.
Big data has helped the financial services industry make big strides, and data warehouses have been a big player in those strides. The only reason a financial services company may be swayed away from such a model is because it is more cost-effective, but not as effective for other purposes.
在财务以及其他商业领域中,数据仓库通常是最好的存储模型,因为它可以被结构化以供整个公司而不是数据科学家使用。
大数据帮助金融服务业取得了长足的进步,而数据仓库发挥了重要作用。 金融服务公司可能会偏离这种模式的唯一原因是因为它更具成本效益,但在其他目的上却不那么有效。
Transportation: data lakes help make predictions
运输:数据湖有助于做出预测
Much of the benefit of data lake insight lies in the ability to make predictions.
In the transportation industry, especially in supply chain management, the prediction capability that comes from flexible data in a data lake can have huge benefits, namely cost cutting benefits realized by examining data from forms within the transport pipeline.
数据湖洞察力的大部分好处在于能够做出预测。
在运输行业中,特别是在供应链管理中,来自数据湖中的灵活数据的预测功能可以具有巨大的优势,即通过检查运输管道中的表格数据实现的成本削减优势。
The importance of choosing a data lake or data warehouse
选择数据湖或数据仓库的重要性
The “data lake vs data warehouse” conversation has likely just begun, but the key differences in structure, process, users, and overall agility make each model unique. Depending on your company’s needs, developing the right data lake or data warehouse will be instrumental in growth.
关于“数据湖与数据仓库”的交流可能才刚刚开始,但是结构,流程,用户和整体敏捷性方面的主要差异使它们每种模型都独一无二。 根据您公司的需求,来开发合适的数据湖或数据仓库将有助于增长。