实际解决机器学习问题过程中,我们会遇到一些“大数据”问题,比如有上百万条数据,上千上万维特征,此时数据存储已经达到10G这种级别。
如果是文本分类分体,你还需要提取文本特征,这时候如果把数据load到内存,那占用内存就太大了,如何解决:1. 对数据进行降维?2. 使用流式或类似流式处理?3. 上大机器,高内存的,或者用spark集群。
本文将要介绍的是一种增量学算法PassiveAggressiveClassifier
处理流程:
其中对于分类问题,在第一次调用partial_fit时需要通过classes参数指定分类的类别。
def iter_minibatches(filename, minibatch_size):
'''
迭代器
给定文件流(比如一个大文件),每次输出minibatch_size行,默认选择1k行
将输出转化成numpy输出,返回X, y
'''
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
x = []
y = []
cur_line_num = 0
csvfile = open(filename, 'rb')
reader = pd.read_csv(csvfile
#,encoding = 'gb18030'
)
#分割商品名称
reader['HWMC'] = sjcl(list(reader['HWMC'].astype(str)))
reader['HWMC']=reader['HWMC'].apply(lambda x: np.NaN if str(x)=='' else x)#将空白替换为nan
#df_null = df[df['HWMC'].isnull()]
reader = reader[reader['HWMC'].notnull()]
reader.index =np.arange(len(reader))
reader = shuffle(reader)
for line in reader.index:
x.append(reader.HWMC[line])
y.append(reader.U_CODE[line]) # 这里要将数据转化成float类型
cur_line_num += 1
if cur_line_num >= minibatch_size:
x, y = np.array(x), np.array(y) # 将数据转成numpy的array类型并返回
yield x, y
x, y = [], []
cur_line_num = 0
csvfile.close()
训练代码。。。大家不可直接复制,要根据业务需求,做好特征提取
import pandas as pd
import numpy as np
import datetime
import gc
from sklearn import metrics
from sklearn.externals import joblib
df_sc = pd.DataFrame([[0,0,0]],columns = ['model','time','score'])
num = 1
for model in models:
MD = models[model]
print("获取classes",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
all_classes = get_classes(filename)
minibatch_train_iterators = iter_minibatches(filename, size)
x_test, y_test = next(minibatch_train_iterators)
print("开始训练",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
for i, (X_train, y_train) in enumerate(minibatch_train_iterators):
print("{} time".format(i)) # 当前次数
# 使用 partial_fit ,并在第一次调用 partial_fit 的时候指定 classes
MD.partial_fit(get_hv(X_train), y_train, classes=all_classes)
result=MD.predict(get_hv(x_test))
print(model,"score: %.4g" % metrics.accuracy_score(y_test,result)) # 在测试集上看效果
df_sc.loc[num] = {'model':model,'time':datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'),'score':MD.score(get_hv(x_test),y_test)}
if df_sc.score[num]>df_sc.score[num-1]:
print("模型训练完成,保存模型",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'))
#保存模型
joblib.dump(MD, "/root/lizheng/model/model_learn1%s.pkl.gz" % model, compress=('gzip', 3))
from sklearn.linear_model import PassiveAggressiveClassifier
import sys
#sys.path.append("D:/PDM/SPBM")
sys.path.append("/root/lizheng")
models_learn ={#'pa1-0.6':PassiveAggressiveClassifier(C=0.6,max_iter=100000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),#71.4
#'pa1-0.7':PassiveAggressiveClassifier(C=0.7,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),
#'pa1-0.8':PassiveAggressiveClassifier(C=0.8,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),
#'pa1-0.9':PassiveAggressiveClassifier(C=0.9,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),
#'pa1-1':PassiveAggressiveClassifier(C=1,max_iter=100000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),#71.6
'pa4-1':PassiveAggressiveClassifier(C=2,max_iter=10000,loss = 'hinge',average=True,n_jobs=-1,random_state=1)
}
sp.fitby_linear_model('/root/lizheng/fcqspbm_1214a.csv',models_learn,1000000)
sklearn.linear_model
.PassiveAggressiveClassifiersklearn.linear_model.
PassiveAggressiveClassifier
(
C=1.0,
fit_intercept=True,
max_iter=None,
tol=None,
shuffle=True,
verbose=0,
loss=’hinge’,
n_jobs=1,
random_state=None,
warm_start=False,
class_weight=None,
average=False,
n_iter=None
)
[source]
Passive Aggressive Classifier
Read more in the User Guide.
Parameters: | C : float
fit_intercept : bool, default=False
max_iter : int, optional
tol : float or None, optional
shuffle : bool, default=True
verbose : integer, optional
loss : string, optional
n_jobs : integer, optional
random_state : int, RandomState instance or None, optional, default=None
warm_start : bool, optional
class_weight : dict, {class_label: weight} or “balanced” or None, optional
average : bool or int, optional
n_iter : int, optional
|
---|---|
Attributes: | coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]
intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]
n_iter_ : int
|
sklearn.linear_model
.PassiveAggressiveClassifiersklearn.linear_model.
PassiveAggressiveClassifier
(
C=1.0,
fit_intercept=True,
max_iter=None,
tol=None,
shuffle=True,
verbose=0,
loss=’hinge’,
n_jobs=1,
random_state=None,
warm_start=False,
class_weight=None,
average=False,
n_iter=None
)
[source]
Passive Aggressive Classifier
Read more in the User Guide.
Parameters: | C : float
fit_intercept : bool, default=False
max_iter : int, optional
tol : float or None, optional
shuffle : bool, default=True
verbose : integer, optional
loss : string, optional
n_jobs : integer, optional
random_state : int, RandomState instance or None, optional, default=None
warm_start : bool, optional
class_weight : dict, {class_label: weight} or “balanced” or None, optional
average : bool or int, optional
n_iter : int, optional
|
---|---|
Attributes: | coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]
intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]
n_iter_ : int
|