数据湖 VS 数据仓库和应用


Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.

The two types of data storage are often confused, but are much more different than they are alike. In fact, the only real similarity between them is their high-level purpose of storing data.

The distinction is important because they serve different purposes and require different sets of eyes to be properly optimized. While a data lake works for one company, a data warehouse will be a better fit for another.




Four key differences between a data lake and a data warehouse


There are several differences between a data lake and a data warehouse. Data structure, ideal users, processing methods, and the overall purpose of the data are the key differentiators.


                                                                  Data Lake 数据湖                           Data Warehouse 数据仓库

Data Structure 数据结构                                         Raw  原始                             Processed 已处理

Purpose of Data数据用途      Not Yet Determined 未确认用途              Currently In Use 正在被使用

Users 用户                                    Data Scientists 数据科学家              Business Professionals 商业人士

Accessibility可访问性      Highly accessible&quick to update        More complicated & costly to changes

                                                         高度可访问和快速更新                                        更复杂和昂贵的改变

Data structure: raw vs. processed


Raw data is data that has not yet been processed for a purpose. Perhaps the greatest difference between data lakes and data warehouses is the varying structure of raw vs. processed data. Data lakes primarily store raw, unprocessed data, while data warehouses store processed and refined data.

Because of this, data lakes typically require much larger storage capacity than data warehouses. Additionally, raw, unprocessed data is malleable, can be quickly analyzed for any purpose, and is ideal for machine learning. The risk of all that raw data, however, is that data lakes sometimes become data swamps without appropriate data quality and data governance measures in place

Data warehouses, by storing only processed data, save on pricey storage space by not maintaining data that may never be used. Additionally, processed data can be easily understood by a larger audience.




Purpose: undetermined vs in-use

用途: 未确定 vs 正在使用

The purpose of individual data pieces in a data lake is not fixed. Raw data flows into a data lake, sometimes with a specific future use in mind and sometimes just to have on hand. This means that data lakes have less organization and less filtration of data than their counterpart.

Processed data is raw data that has been put to a specific use. Since data warehouses only house processed data, all of the data in a data warehouse has been used for a specific purpose within the organization. This means that storage space is not wasted on data that may never be used.



Users: data scientists vs business professionals

用户:数据科学家 vs 商业人士

Data lakes are often difficult to navigate by those unfamiliar with unprocessed data. Raw, unstructured data usually requires a data scientist and specialized tools to understand and translate it for any specific business use.

Alternatively, there is growing momentum behind data preparation tools that create self-service access to the information stored in data lakes.

Processed data is used in charts, spreadsheets, tables, and more, so that most, if not all, of the employees at a company can read it. Processed data, like that stored in data warehouses, only requires that the user be familiar with the topic represented.




Accessibility: flexible vs secure

可访问性:灵活 vs 安全

Accessibility and ease of use refers to the use of data repository as a whole, not the data within them. Data lake architecture has no structure and is therefore easy to access and easy to change. Plus, any changes that are made to the data can be done quickly since data lakes have very few limitations.

Data warehouses are, by design, more structured. One major benefit of data warehouse architecture is that the processing and structure of data makes the data itself easier to decipher, the limitations of structure make data warehouses difficult and costly to manipulate.



Data lake vs data warehouse: which is right for me?


Organizations often need both. Data lakes were born out of the need to harness big data and benefit from the raw, granular structured and unstructured data for machine learning, but there is still a need to create data warehouses for analytics use by business users.

通常公司或者机构是两者都需要的。 数据湖诞生是来源于对利用大数据的需求,并从原始的,粒度化的结构化和非结构化数据中受益以进行机器学习,但是仍然需要创建数据仓库以供非技术的商务用户使用。

Healthcare: data lakes store unstructured information


Data warehouses have been used for many years in the healthcare industry, but it has never been hugely successful. Because of the unstructured nature of much of the data in healthcare (physicians notes, clinical data, etc.) and the need for real-time insights, data warehouses are generally not an ideal model.

Data lakes allow for a combination of structured and unstructured data, which tends to be a better fit for healthcare companies.

数据仓库的架构已经在医疗IT行业中应用多年,但还未取得显著的巨大成功。 由于医疗场景中许多数据的非结构化的特性(医师说明,临床数据等)以及对数据实时洞察的需求,因此常规上数据仓库的架构也不不是理想的模型。


Education: data lakes offer flexible solutions


In recent years, the value of big data in education reform has become enormously apparent. Data about student grades, attendance, and more can not only help failing students get back on track, but can actually help predict potential issues before they occur. Flexible big data solutions have also helped educational institutions streamline billing, improve fundraising, and more.

Much of this data is vast and very raw, so many times, institutions in the education sphere benefit best from the flexibility of data lakes.

近年来,大数据在教育改革中的价值已变得极为明显。 有关学生成绩,出勤率等的数据不仅可以帮助失败的学生重回正轨,而且可以实际帮助在潜在问题发生之前进行预测。 灵活的大数据解决方案还帮助教育机构简化了账单,改善了筹款活动等等。


Finance: data warehouses appeal to the masses


In finance, as well as other business settings, a data warehouse is often the best storage model because it can be structured for access by the entire company rather than a data scientist.

Big data has helped the financial services industry make big strides, and data warehouses have been a big player in those strides. The only reason a financial services company may be swayed away from such a model is because it is more cost-effective, but not as effective for other purposes.


大数据帮助金融服务业取得了长足的进步,而数据仓库发挥了重要作用。 金融服务公司可能会偏离这种模式的唯一原因是因为它更具成本效益,但在其他目的上却不那么有效。

Transportation: data lakes help make predictions


Much of the benefit of data lake insight lies in the ability to make predictions.

In the transportation industry, especially in supply chain management, the prediction capability that comes from flexible data in a data lake can have huge benefits, namely cost cutting benefits realized by examining data from forms within the transport pipeline.



The importance of choosing a data lake or data warehouse


The “data lake vs data warehouse” conversation has likely just begun, but the key differences in structure, process, users, and overall agility make each model unique. Depending on your company’s needs, developing the right data lake or data warehouse will be instrumental in growth.

关于“数据湖与数据仓库”的交流可能才刚刚开始,但是结构,流程,用户和整体敏捷性方面的主要差异使它们每种模型都独一无二。 根据您公司的需求,来开发合适的数据湖或数据仓库将有助于增长。

