A common misconception is that the Weka machine learning software cannot be applied to large datasets. This page has some information on how to handle big data with Weka (see also here). Note that the main problem lies in training models from large datasets, not prediction for large datasets. Weka is being used to make predictions in real time in very demanding real-world applications. This can be done with almost all Weka models once they have been built.
It is correct that it may be impossible to train models from large datasets using the Weka Explorer graphical user interface, even when the Java heap size has already been increased, because the Explorer always loads the entire dataset into the computer's main memory and also incurs significant overhead due to visualisation, etc.
When dealing with large datasets, it is best to either employ a command-line interface (CLI) to interact with Weka (e.g. a shell that comes with the computer's operating system or the SimpleCLI included in Weka), use Weka's Knowledge Flow graphical user interface, or write code directly in Java or a Java-based scripting language such as Groovy or Jython. Using these methods, it is possible to deal with larger datasets and even datasets that are too big to fit into main memory.
当处理大数据集的时候,我们最好采取以下某个实践:
使用这些方法,你将有可能处理大一点的数据集(即使在数据集不能被一次性load到主存的情况下)。1、使用命令行界面CLI与weka进行交互;
2、使用weka知识流图形用户界面(参见:http://blog.csdn.net/rav009/article/details/11486105);
3、直接使用java语言、基于java的脚本语言(如 Groovy 、Jython)编程实现。
Most Weka classifiers require the entire dataset to be loaded into memory for training, but there are also schemes that can be trained in an incremental fashion, namely all classifiers implementing the weka.classifiers.UpdateableClassifier interface. These are limited in number; however, for Weka 3.7, there is a library that provides access to the MOA data stream software containing state-of-the-art algorithms for large datasets or data streams. Note also that non-incremental learning algorithms can be applied to large datasets by subsampling the data. Reservoir sampling is an incremental sampling method implemented in Weka that can be used for this purpose.
大部分的weka分类器都要求将整个数据集load到内存中进行训练,但是,也有一些支持对数据集进行增量式训练,即:那些实现了weka.classifiers.UpdateableClassifier接口的分类器。这些分类器数量有限。
在weka3.7版本中,有一个库,支持对 “包含了世界最先进的、大数据集或数据流挖掘算法”的MOA数据流软件的访问。
ps : MASSIVE ONLINE ANALYSIS,is the most popular open source framework for data stream mining。It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection 离群点检测, concept drift detection 概念漂移检测 and recommender systems 推荐系统) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.
同时也应注意到,非增量式学习算法可以通过对数据进行数据进行取样的方式运用于大数据集。为实现这一目的,weka中的Reservoir sampling是一中增量取样的方法。
Recent versions of Weka 3.7 also provide access to new packages for distributed data mining. The first new package is called distributedWekaBase. It provides base "map" and "reduce" tasks that are not tied to any specific distributed platform. The second, called distributedWekaHadoop, provides Hadoop-specific wrappers and jobs for these base tasks. In the future, there could be other wrappers - for example for Spark. Read more ...