日常声明:论文均来自于谷歌学术或者其他国外付费论文站,博主只是读论文,译论文,分享知识,如有侵权联系我删除,谢谢。同时希望和大家一起学习,有好的论文可以推荐给我,我翻译了放上来,也欢迎大家关注我的读论文专栏https://blog.csdn.net/column/details/23027.html
Data Mining with Big Data
作者:Xindong Wu, Xingquan Zhu, Gong-Qing Wu, Wei Ding
这篇综述类的文献一共26页。文章分为6部分。
分别是:
1. Introduction
2. Big Data Characteristics: HACE Theorem
3. Data Mining Challenges with Big Data(个人认为这部分最值得精读)
4. Research Initiatives and Projects
5. Related Work
6. Conclusion
Abstract: Big Data concerns large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data is now rapidly expanding in all science and engineering domains, including physical, biological and bio-medical sciences. This article presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.
译:摘要:大数据涉及具有多个自主资源的大容量,复杂且不断增长的数据集。 随着网络,数据存储和数据收集能力的快速发展,大数据在包括物理,生物和生物医学科学在内的所有科学和工程领域迅速扩大。 本文介绍了一个表征大数据革命特征的HACE定理,并从数据挖掘角度提出了一个大数据处理模型。 这种数据驱动模型涉及需求驱动的信息源汇总,挖掘和分析,用户兴趣建模以及安全和隐私考虑。 我们分析了数据驱动模型以及大数据革命中的具有挑战性的问题。
1. Introduction
这一部分就不贴了,内容和大多数综述一样,大致说了这几件事。
1.用莫言的例子指出大数据目前在国际社会上热度
2.用例子指出大数据的体量越来越大,并不断增长
3.结合例子综合论述了,体量不断增大的数据,获取用的数据至关重要,引出数据挖掘的概念。
2. Big Data Characteristics: HACE Theorem
HACE Theorem: Big Data starts with large-volume, heterogeneous, autonomous sources with distributed and decentralized control, and seeks to explore complex and evolving relationships among data.
HACE定理:大数据始于具有分布式和分散控制的大容量,异构,自主的资源,并试图探索数据之间复杂且不断变化的关系。
Exploring the Big Data in this scenario is equivalent to aggregating heterogeneous information from different sources (blind men) to help draw a best possible picture to reveal the genuine gesture of the elephant in a real-time fashion. Indeed, this task is not as simple as asking each blind man to describe his feelings about the elephant and then getting an expert to draw one single picture with a combined view, concerning that each individual may speak a different language (heterogeneous and diverse information sources) and they may even have privacy concerns about the messages they deliberate in the information exchange process.
(本部分省去了几个盲人摸象的故事叙述)
在这种情况下探索大数据相当于汇总来自不同来源(盲人)的异构信息,以帮助绘制出最准确的面貌以实时显示大象的样子。 事实上,这个任务并不像要求每个盲人描述他对大象的感受,然后让专家用综合观点画出一张单一的照片那么简单。而是关注这些问题,比如:每个人可能说不同的语言(产生异构和多样的信息源 ),他们甚至可能会对他们在信息交换过程中所考虑的信息产生隐私担忧。
2.1 Huge Data with Heterogeneous and Diverse Dimensionality
异构且多维度的大量数据
One of the fundamental characteristics of the Big Data is the huge volume of data represented by heterogeneous and diverse dimensionalities. This is because different information collectors use their own schemata for data recording, and the nature of different applications also results in diverse representations of the data.
大数据的基本特征之一是以异构和多样化的维度表示大量的数据。这是因为不同的信息收集者使用他们自己的模式进行数据记录,不同的应用程序的性质也导致数据的表示多样化。
For a DNA or genomic related test, microarray expression images and sequences are used to represent the genetic code information because this is the way that our current techniques acquire the data. Under such circumstances, the heterogeneous features refer to the different types of representations for the same individuals, and the diverse features refer to the variety of the features involved to represent each single observation.
对于DNA或基因组相关测试,微阵列表达图像和序列用于表示遗传密码信息,因为这是我们现有技术获取数据的方式。 在这种情况下,异质特征是指同一个体的不同表征类型,不同的特征是指代表每个单一观察所涉及的各种特征。(省去了另一个生物学科的例子)
2.2 Autonomous Sources with Distributed and Decentralized Control
Autonomous data sources with distributed and decentralized controls are a main characteristic of Big Data applications. Being autonomous, each data sources is able to generate and collect information without involving (or relying on) any centralized control. This is similar to the World Wide Web (WWW) setting where each web server provides a certain amount of information and each server is able to fully function without necessarily relying on other servers. On the other hand, the enormous volumes of the data also make an application vulnerable to attacks or malfunctions, if the whole system has to rely on any centralized control unit. For major Big Data related applications, such as Google, Flicker, Facebook, and Walmart, a large number of server farms are deployed all over the world to ensure nonstop services and quick responses for local markets. Such autonomous sources are not only the solutions of the technical designs, but also the results of the legislation and the regulation rules in different countries/regions. For example, Asian markets of Walmart are inherently different from its North American markets in terms of seasonal promotions, top sell items, and customer behaviors. More specifically, the local government regulations also impact on the wholesale management process and eventually result in data representations and data warehouses for local markets.
2.2具有分布式和分散控制的自主资源
具有分布式和分散式控制的自治数据源是大数据应用的主要特征。每个数据源都是自治的,能够在不涉及(或依赖)任何集中控制的情况下生成和收集信息。这与万维网(WWW)设置类似,其中每个Web服务器提供一定量的信息,并且每个服务器能够在不必依赖于其他服务器的情况下充分运行。另一方面,如果整个系统必须依赖任何集中控制单元,则大量的数据也会使应用程序容易受到攻击或故障。对于Google,Flicker,Facebook和Walmart等主要大数据相关应用程序,大量的服务器农场部署在全球各地,以确保不间断的服务和快速响应当地市场。这种自主资源不仅是技术设计的解决方案,也是不同国家/地区立法和监管规则的结果。例如,沃尔玛的亚洲市场在季节性促销,畅销产品和顾客行为方面与北美市场有本质的不同。更具体地说,地方政府法规也影响批发管理流程,最终导致当地市场的数据表示和数据仓库。
2.3 Complex and Evolving Relationships
While the volume of the Big Data increases, so do the complexity and the relationships underneath the data. In an early stage of data centralized information systems, the focus is on finding best feature values to represent each observation. This is similar to using a number of data fields, such as age, gender, income, education background etc., to characterize each individual. This type of sample-feature representation inherently treats each individual as an independent entity without considering their social connections which is one of the most important factors of the human society. People form friend circles based on their common hobbies or connections by biological relationships. Such social connections commonly exist in not only our daily activities, but also are very popular in virtual worlds. For example, major social network sites, such as Facebook or Twitter, are mainly characterized by social functions such as friend-connections and followers (in Twitter). The correlations between individuals inherently complicate the whole data representation and any reasoning process. In the sample-feature representation, individuals are regarded similar if they share similar feature values, whereas in the sample-feature-relationship representation, two individuals can be linked together (through their social connections) even though they might share nothing in common in the feature domains at all. In a dynamic world, the features used to represent the individuals and the social ties used to represent our connections may also evolve with respect to temporal, spatial, and other factors. Such a complication is becoming part of the reality for Big Data applications, where the key is to take the complex (non-linear, many-to-many) data relationships, along with the evolving changes, into consideration, to discover useful patterns from Big Data collections
2.3复杂和不断发展的关系
在大数据量增加的同时,数据下的复杂性和相关性也随之增加。在数据集中信息系统的早期阶段,重点是找到最佳特征值来表示每个观察值。这与使用许多数据字段(如年龄,性别,收入,教育背景等)来描述每个人的特征相似。这种类型的样本特征表示固有地将每个个体视为一个独立的个体,而不考虑他们的社会关系,这是人类社会最重要的因素之一。人们根据他们的共同兴趣或生物关系的联系形成朋友圈。这种社交关系不仅在我们的日常活动中普遍存在,而且在虚拟世界中也非常流行。例如,主要的社交网站,如Facebook或Twitter,主要以社交功能为特征,如朋友关系和追随者(在Twitter中)。个人之间的相关性本质上使整个数据表示和任何推理过程复杂化。在样本特征表示中,如果个体共享相似的特征值,则个体被认为是相似的,而在样本特征关系表示中,两个个体可以通过它们的社交关系连接在一起,即使它们可能在功能域。在动态世界中,用于表示个体和用于表示我们关系的社交关系的特征也可能随时间,空间和其他因素而变化。这种复杂情况正在成为大数据应用现实的一部分,其中关键在于考虑复杂(非线性,多对多)数据关系和变化,从大数据收集中发现有用的模式。
3. Data Mining Challenges with Big Data
For an intelligent learning database system (Wu 2000) to handle Big Data, the essential key is to scale up to the exceptionally large volume of data and provide treatments for the characteristics featured by the aforementioned HACE theorem. Figure 2 shows a conceptual view of the Big Data processing framework, which includes three tiers from inside out with considerations on data accessing and computing (Tier I), data privacy and domain knowledge (Tier II), and Big Data mining algorithms (Tier III).
对于一个智能学习数据库系统(Wu 2000)来处理大数据,最重要的关键是要扩大到特别大的数据量,并为上述HACE定理的特征提供处理方法。 图2显示了大数据处理框架的概念视图,其中包括从内部开始考虑数据访问和计算(第I层),数据隐私和领域知识(第II层)以及大数据挖掘算法(第III层)。
The challenges at Tier I focus on data accessing and actual computing procedures. Because Big Data are often stored at different locations and data volumes may continuously grow, an effective computing platform will have to take distributed large-scale data storage into consideration for computing. For example, while typical data mining algorithms require all data to be loaded into the main memory, this is becoming a clear technical barrier for Big Data because moving data across different locations is expensive (e.g., subject to intensive network communication and other IO costs), even if we do have a super large main memory to hold all data for computing.
第一层的挑战侧重于数据访问和实际计算程序。 由于大数据通常存储在不同的位置,数据量可能会不断增长,因此有效的计算平台必须考虑分布式大规模数据存储以进行计算。 例如,当典型的数据挖掘算法需要将所有数据加载到主内存中,这对于大数据而言恰恰变成一个明显的技术障碍,因为,即使我们确实有一个超大的主存储器来存放所有的计算数据,在不同位置移动数据的成本依然很高(例如,受到密集的网络通信和其他IO成本的限制)。
The challenges at Tier II center around semantics and domain knowledge for different Big Data applications. Such information can provide additional benefits to the mining process, as well as add technical barriers to the Big Data access (Tier I) and mining algorithms (Tier III).
第II层面临的挑战是围绕着不同大数据应用程序的语义和领域知识。 这些信息可以为挖掘过程提供额外的好处,也为大数据访问(第I级)和挖掘算法(第III层)增加技术障碍。
At Tier III, the data mining challenges concentrate on algorithm designs in tackling the difficulties raised by the Big Data volumes, distributed data distributions, and by complex and dynamic data characteristics. The circle at Tier III contains three stages. Firstly, sparse, heterogeneous, uncertain, incomplete, and multi-source data are preprocessed by data fusion techniques. Secondly, complex and dynamic data are mined after pre-processing. Thirdly, the global knowledge that is obtained by local learning and model fusion is tested and relevant information is fed back to the pre-processing stage. Then the model and parameters are adjusted according to the feedback. In the whole process, information sharing is not only a promise of smooth development of each stage, but also a purpose of Big Data processing.
在第三层,数据挖掘挑战集中在算法设计上,以解决大数据量,分布式数据分布以及复杂和动态数据特征带来的困难。 第三层包含三个阶段。首先,数据融合技术对稀疏,异构,不确定,不完整和多源数据进行预处理。 其次,复杂和动态的数据在预处理之后被挖掘。 第三,测试通过局部学习和模型融合获得的全局知识,并将相关信息反馈到预处理阶段。 然后根据反馈调整模型和参数。在整个过程中,信息共享不仅是保障每个阶段顺利开展,也是大数据处理的目的。
3.1 Tier I: Big Data Mining Platform
In typical data mining systems, the mining procedures require computational intensive computing units for data analysis and comparisons. A computing platform is therefore needed to have efficient access to, at least, two types of resources: data and computing processors. For small scale data mining tasks, a single desktop computer, which contains hard disk and CPU processors, is sufficient to fulfill the data mining goals. Indeed, many data mining algorithm are designed to handle this type of problem settings. For medium scale data mining tasks, data are typically large (and possibly distributed) and cannot be fit into the main memory. Common solutions are to rely on parallel computing (Shafer et al. 1996; Luo et al. 2012) or collective mining (Chen et al. 2004) to sample and aggregate data from different sources and then use parallel computing programming (such as the Message Passing Interface) to carry out the mining process.
3.1第一层:大数据挖掘平台在典型的数据挖掘系统中,挖掘过程需要计算密集型计算单元进行数据分析和比较。 因此需要计算平台来有效地访问至少两种类型的资源:数据和计算处理器。 对于小规模数据挖掘任务,包含硬盘和CPU处理器的单台电脑足以满足数据挖掘目标。 事实上,许多数据挖掘算法都是为处理这类问题设置而设计的。 对于中等规模的数据挖掘任务,数据通常很大(并且可能是分布式的),并且不适合主存储器。 通常的解决方案是依靠并行计算(Shafer等,1996; Luo等,2012)或集体挖掘(Chen等,2004)对不同来源的数据进行采样和聚合,然后使用并行计算编程(如Message 传递接口)来执行挖掘过程。
Such a Big Data system, which blends both hardware and software components, is hardly available without key industrial stockholders’ support. In fact, for decades, companies have been making business decisions based on transactional data stored in relational databases. Big Data mining offers opportunities to go beyond their relational databases to rely on less structured data: weblogs, social media, email, sensors, and photographs that can be mined for useful information. Major business intelligence companies, such IBM, Oracle, Teradata etc., have all featured their own products to help customers acquire and organize these diverse data sources and coordinate with customers’ existing data to find new insights and capitalize on hidden relationships.
如果没有关键的行业股东的支持,这种混合了硬件和软件组件的大数据系统几乎是不可能的。 事实上,几十年来,很多公司一直在基于存储在关系数据库中的交易数据做出业务决策。 大数据挖掘提供了超越关系数据库的契机,可以依靠较少结构化的数据挖掘有用信息,比如:weblogs,社交媒体,电子邮件,传感器,图像等。 主要的高新科技企业,例如IBM,Oracle,Teradata等都拥有自己的产品,来帮助客户获取和组织这些不同的数据源,并与客户现有数据进行协调去发现新的见解和利用隐藏的关系。
3.2 Tier II: Big Data Semantics and Application Knowledge
第二层:大数据语义和应用知识。
(下面的部分省去背景和意义的介绍,直接贴问题和解决方法)
3.2.1 Information Sharing and Data Privacy
信息分享和数据隐私
To protect privacy, two common approaches are to (1) restrict access to the data, such as adding certification or access control to the data entries, so sensitive information is accessible by a limited group of users only, and (2) anonymize data fields such that sensitive information cannot be pinpointed to an individual record (Cormode and Srivastava 2009). For the first approach, common challenges are to design secured certification or access control mechanisms, such that no sensitive information can be misconducted by unauthorized individuals. For data anonymization, the main objective is to inject randomness into the data to ensure a number of privacy goals.
为了保护隐私,两种常用的方法是(1)限制对数据的访问,比如向数据条目添加证明或访问控制,因此敏感信息只能由有限的一组用户访问,(2)匿名化数据字段 这样敏感的信息就不能被定位到个人记录(Cormode and Srivastava 2009)。 对于第一种方法,共同的挑战是设计安全的认证或访问控制机制,从而不允许未经授权的错误的产生个人敏感信息。 对于数据匿名化,主要目标是向数据中注入随机数以确保一些隐私目标。
Common anonymization approaches are to use suppression, generalization, perturbation, and permutation to generate an altered version of the data, which is, in fact, some uncertain data.
常用的匿名方法是使用抑制,平庸化(去除特别标签),和交换来生成数据的改变版本,实际上这变成了一些不确定的数据。
3.3 Tier III: Big Data Mining Algorithms(大数据挖掘算法)
3.3.1 Local Learning and Model Fusion for Multiple Information Sources
多信息源的本地学习和模型融合
As Big Data applications are featured with autonomous sources and decentralized controls, aggregating distributed data sources to a centralized site for mining is systematically prohibitive due to the potential transmission cost and privacy concerns.
由于大数据应用具有自主资源和分散控制功能,也由于潜在的传输成本和隐私问题,将分布式数据源集中到集中式挖掘站点的方法是被系统禁止的。
At the data level, each local site can calculate the data statistics based on the local data sources and exchange the statistics between sites to achieve a global data distribution view. At the model or pattern level, each site can carry out local mining activities, with respect to the localized data, to discover local patterns. By exchanging patterns between multiple sources, new global patterns can be synthetized by aggregating patterns across all sites (Wu and Zhang 2003). At the knowledge level, model correlation analysis investigates the relevance between models generated from different data sources to determine how relevant the data sources are correlated to each other, and how to form accurate decisions based on models built from autonomous sources
在数据层面,每个本地站点都可以根据本地数据源计算出数据统计,并在站点之间交换统计数据,实现全局数据分布视图。 在模型或模式层面,每个站点都可以针对本地化数据执行本地挖掘活动,以发现本地模式。 通过在多个来源之间交换模式,可以通过在所有地点聚合模式来合成新的全球模式(Wu and Zhang 2003)。 在知识层面,模型相关分析研究了不同数据源生成的模型之间的相关性,以确定数据源彼此相关的相关程度,以及如何基于自主来源构建的模型形成准确的决策。
3.3.2 Mining from Sparse, Uncertain, and Incomplete Data
稀疏,不确定,不完整数据的挖掘
For most machine learning and data mining algorithms, high dimensional spare data significantly deteriorate the difficulty and the reliability of the models derived from the data. Common approaches are to employ dimension reduction or feature selection (Wu et al. 2012) to reduce the data dimensions or to carefully include additional samples to decrease the data scarcity, such as generic unsupervised learning methods in data mining.
对于大多数机器学习和数据挖掘算法而言,高维度的备用数据显着降低了从数据导出的模型的难度和可靠性。 通常的做法是采用降维或特征选择(Wu et al。2012)来减少数据维度或仔细包含额外样本以减少数据稀缺性,例如数据挖掘中的通用无监督学习方法。
For uncertain data, the major challenge is that each data item is represented as some sample distributions but not as a single value, so most existing data mining algorithms cannot be directly applied. Common solutions are to take the data distributions into consideration to estimate model parameters.
对于不确定的数据,主要的挑战是每个数据项被表示为一些样本分布而不是单个值,因此大多数现有的数据挖掘算法不能直接应用。 常见的解决方案是考虑数据分布来估计模型参数。
3.3.3Mining Complex and Dynamic Data(复杂动态的数据挖掘)
Complex heterogeneous data types:Currently, there is no acknowledged effective and efficient data model to handle Big Data
复杂的异构数据类型:目前,没有公认的有效和高效的数据模型来处理大数据
Complex relationship networks in data:To deal with complex relationship networks, emerging research efforts have begun to address the issues of structure-and-evolution, crowds-and-interaction, and information-and-communication.
为了处理复杂的关系网络,最新的研究工作已经开始解决结构与演化,人与交互以及信息与通信等问题。
The emergence of Big Data has also spawned new computer architectures for real-time data-intensive processing, such as the open source project Apache Hadoop which runs on high-performance clusters.
大数据的出现也催生了用于实时数据密集型处理的新计算机体系结构,例如运行在高性能集群上的开源项目Apache Hadoop。
In the context of Big Data, real-time processing for complex data is a very challenging task.在大数据的背景下,复杂数据的实时处理是一项非常具有挑战性的任务。
(也就是说复杂关系的数据挖掘依然是个难题)!
第四部分及以后就不贴了,文章里提及了近几年一些大公司或者著名的大数据项目,其他也没什么内容了。希望啰里啰嗦对大家有用