Mining Big Data using Weka 3 利用weka3挖掘大数据

from:http://www.cs.waikato.ac.nz/ml/weka/bigdata.html

A common misconception is that the Weka machine learning software cannot be applied to large datasets. This page has some information on how to handle big data with Weka (see also here). Note that the main problem lies in training models from large datasets, not prediction for large datasets. Weka is being used to make predictions in real time in very demanding real-world applications. This can be done with almost all Weka models once they have been built.

一个常见的误解是:认为weka机器学习软件不能被应用于大数据集。本页面即涉及到一些“如何利用weka来处理大数据”的相关信息(也可以参见: http://wiki.pentaho.com/display/DATAMINING/Handling+Large+Data+Sets+with+Weka)。注意到我们的主要问题是“从大数据集中训练模型”,而不是“对大数据集的预测”。利用weka来进行实时预测是实际系统的一项迫切需求,如果我们能够训练出模型,解决这个“预测”这个问题是自然而然的事情。

It is correct that it may be impossible to train models from large datasets using the Weka Explorer graphical user interface, even when the Java heap size has already been increased, because the Explorer always loads the entire dataset into the computer's main memory and also incurs significant overhead due to visualisation, etc.

利用weka的explore图形用户界面对大数据集进行训练是不太可行的,即使增大了一些java堆空间。因为explore图形用户界面会将整个数据集load到计算机的主内存中,与此同时,我们还将一部分开销用于可视化。

When dealing with large datasets, it is best to either employ a command-line interface (CLI) to interact with Weka (e.g. a shell that comes with the computer's operating system or the SimpleCLI included in Weka), use Weka's Knowledge Flow graphical user interface, or write code directly in Java or a Java-based scripting language such as Groovy or Jython. Using these methods, it is possible to deal with larger datasets and even datasets that are too big to fit into main memory.

当处理大数据集的时候,我们最好采取以下某个实践:

1、使用命令行界面CLI与weka进行交互;

2、使用weka知识流图形用户界面(参见:http://blog.csdn.net/rav009/article/details/11486105);

3、直接使用java语言、基于java的脚本语言(如 Groovy 、Jython)编程实现。

使用这些方法,你将有可能处理大一点的数据集(即使在数据集不能被一次性load到主存的情况下)。

Most Weka classifiers require the entire dataset to be loaded into memory for training, but there are also schemes that can be trained in an incremental fashion, namely all classifiers implementing the weka.classifiers.UpdateableClassifier interface. These are limited in number; however, for Weka 3.7, there is a library that provides access to the MOA data stream software containing state-of-the-art algorithms for large datasets or data streams. Note also that non-incremental learning algorithms can be applied to large datasets by subsampling the data. Reservoir sampling is an incremental sampling method implemented in Weka that can be used for this purpose.

大部分的weka分类器都要求将整个数据集load到内存中进行训练,但是,也有一些支持对数据集进行增量式训练,即:那些实现了weka.classifiers.UpdateableClassifier接口的分类器。这些分类器数量有限。


在weka3.7版本中,有一个库,支持对 “包含了世界最先进的、大数据集或数据流挖掘算法”的MOA数据流软件的访问。

ps : MASSIVE ONLINE ANALYSIS,is the most popular open source framework for data stream mining。It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection 离群点检测, concept drift detection 概念漂移检测 and recommender systems 推荐系统) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.


同时也应注意到,非增量式学习算法可以通过对数据进行数据进行取样的方式运用于大数据集。为实现这一目的,weka中的Reservoir sampling是一中增量取样的方法。


Recent versions of Weka 3.7 also provide access to new packages for distributed data mining. The first new package is called distributedWekaBase. It provides base "map" and "reduce" tasks that are not tied to any specific distributed platform. The second, called distributedWekaHadoop, provides Hadoop-specific wrappers and jobs for these base tasks. In the future, there could be other wrappers - for example for Spark. Read more ...

weka3.7版本还提供了对“ 分布式数据挖掘jar包”的访问支持。一个新的包,如“ distributedWekaBase”,他提供了基本的"map" 和"reduce" 任务,独立于任何特定的分布式平台。另一个新的包为“ distributedWekaHadoop”,提供了针对Hadoop的封装、完成这些基本tasks的jobs。将来还会实现其他的封装,如Spark等。。

你可能感兴趣的:(数据挖掘,机器学习)