1)数据规模.“池塘”和“大海”最容易发现的区别就是规模.“池塘”规模相对较小,即便是先前认为比较大的“池塘”,譬如VLDB(very large database),和“大海”XLDB(extremely large database)相比仍旧偏小.“池塘”的处理对象通常以MB为基本单位,而“大海”则常常以GB,甚至是TB,PB为基本处理单位.
5)处理工具.捕捞“池塘”中的“鱼”,一种渔网或少数几种基本就可以应对,也就是所谓的one size fits all.但是在“大海”中,不可能存在一种渔网能够捕获所有的鱼类,也就是说No size fit all
信息技术的发展创造了数据产生和处理条件:云计算\网络、存储设施、数据库等技术的发展\物联网\RFID 技术\视频监控
Structured data will continue to be analyzed in an enterprise using structured access methods like Structured Query Language (SQL). However, the big data systems provide tools and structures for analyzing unstructured data.
New sources of data that contribute to the unstructured data are sensors, web logs, human-generated interaction data like click streams, tweets, Facebook chats, mobile text messages, e-mails, and so forth. The presence of this hybrid mix of data makes big data analysis complex, as decisions need to be made regarding whether all this data should be first merged and then analyzed or whether only an aggregated view from different sources has to be compared.
Unstructured data is analyzed using methods like natural language processing (NLP), data mining, master data management (MDM), and statistics.
Text analytics use NoSQL databases to standardize the structure of the data so that it can be analyzed using query languages like PIG, Hive, and others.
The analysis and extraction processes take advantageof techniques that originated in linguistics, statistics, and numerical analysis.
Bloom Filter
parallel computing
Understand and prioritize the data from the garbage that is coming into the enterprise. Ninety percent of all the data is noise, and it is a daunting task to classify and filter the knowledge from the noise.
In the search for inexpensive methods of analysis, organizations have to compromise and balance against the confidentiality requirements of the data.(安全性)
Organizations struggle to determine how long this data has to be retained. This is a tricky question, as some data is useful for making long-term decisions, while other data is not relevant even a few hours after it has been generated and analyzed and insight has been obtained.
Availability of skills is a big challenge for CIOs. A higher level of proficiency in the data sciences is required to implement big data solutions today because the tools are not user-friendly yet. They still require computer science graduates to configure and operationalize a big data system.
Infrastructure as a Service (IaaS)
This includes the storage, servers, and networkas the base, inexpensive commodities of the big data stack. This stack can be bare metal or virtual (cloud). The distributed file systems are part of this layer.
Platform as a Service (PaaS)
The NoSQL data stores and distributed caches that can be logically queried using query languages form the platform layer of big data. This layer provides the logical model for the raw, unstructured data stored in the files.
Data as a Service (DaaS)
The entire array of tools available for integrating with the PaaS layer using search engines, integration adapters, batch programs, and so on is housed in this layer. The APIs available at this layer can be consumed by all endpoint systems in an elastic-computing mode.
Big Data Business Functions as a Service (BFaaS)
Specific industries—like health, retail, ecommerce, energy, and banking—can build packaged applications that serve a specific business need and leverage the DaaS layer for cross-cutting data functions.