sklearn 增量学习

     如果想用sklearn进行在线学习如何操作呢?

https://scikit-learn.org/stable/modules/computing.html?highlight=incremental%20learning

Strategies to scale computationally: bigger data

 

  1. a way to stream instances

  2. a way to extract features from instances

  3. an incremental algorithm

     1.把数据变成流式(a way to stream instances):参考“使用sklearn进行增量学习”https://blog.csdn.net/whiterbear/article/details/53120004

    参考:通过 sklearn 进行大规模机器学习 http://wulc.me/2017/08/08/%E9%80%9A%E8%BF%87%20sklearn%20%E8%BF%9B%E8%A1%8C%E5%A4%A7%E8%A7%84%E6%A8%A1%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/ 

    参考: https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html   

import numpy as np
def iter_minibatches(data_stream, minibatch_size=1000):
    '''
    迭代器
    给定文件流(比如一个大文件),每次输出minibatch_size行,默认选择1k行
    将输出转化成numpy输出,返回X, y
    '''
    X = []
    y = []
    cur_line_num = 0
    f = open(data_stream, 'rb')
    for line in f:
        y.append(float(line[0]))
        X.append(line[1:])  # 这里要将数据转化成float类型

        cur_line_num += 1
        if cur_line_num >= minibatch_size:
            X, y = np.array(X), np.array(y)  # 将数据转成numpy的array类型并返回
            yield X, y
            X, y = [], []
            cur_line_num = 0
    f.close()

filename = "t_new"

minibatch_test_iterators = iter_minibatches(filename, minibatch_size=10)
X_test, y_test = next(minibatch_test_iterators)  # 得到第一份数据

for x, y in minibatch_test_iterators:
    print(x,y)

  2.特征提取(a way to extract features from instances)

   1.离线特征提取:使用离线方法,提取特征

   2.在线特征监控:随时监控特征,移除无效特征;增加新出特征

  

  3增量学习算法(an incremental algorithm):

   1.由于sklearn没有实现,需要使用python实现FTRL 或者FTRL_FM可以在线学习算法,深度学习框架tensorflow和pytorch内部实现了FTRL参数更新。

    2.使用sklearn自带的学习算法,sklearn中实现partial_fit 方法的模型都可以进行增量学习(all estimators implementing the partial_fit API are candidates)

  • Classification

    • sklearn.naive_bayes.MultinomialNB

    • sklearn.naive_bayes.BernoulliNB

    • sklearn.linear_model.Perceptron

    • sklearn.linear_model.SGDClassifier

    • sklearn.linear_model.PassiveAggressiveClassifier

    • sklearn.neural_network.MLPClassifier

  • Regression

    • sklearn.linear_model.SGDRegressor

    • sklearn.linear_model.PassiveAggressiveRegressor

    • sklearn.neural_network.MLPRegressor

  • Clustering

    • sklearn.cluster.MiniBatchKMeans

    • sklearn.cluster.Birch

  • Decomposition / feature Extraction

    • sklearn.decomposition.MiniBatchDictionaryLearning

    • sklearn.decomposition.IncrementalPCA

    • sklearn.decomposition.LatentDirichletAllocation

  • Preprocessing

    • sklearn.preprocessing.StandardScaler

    • sklearn.preprocessing.MinMaxScaler

    • sklearn.preprocessing.MaxAbsScaler

你可能感兴趣的:(数据挖掘,并行化计算,机器学习)