Scikit-Learn : The most popular and widely used library for machine learning in Python.
scikit-learn (sklearn) 官方文档中文版
X_train,y_train,x_test,y_test=getData() # 训练集和测试集
model=somemodel()# 定义模型
model.fit(X_train,y_train)# 拟合模型
predictions=model.predict(x_test) #模型预测
model.get_params() # 获得这个模型的参数
score=model.score(x_test,y_test) # 为模型进行打分
Model Fitting
# Supervised learning
>>> lr.fit(X, y)
>>> svc.fit(X_train, y_train)
# Unsupervised Learning
>>> k_means.fit(X_train)
>>> pca_model = pca.fit_transform(X_train)
Prediction
# Supervised Estimators
>>> y_pred = svc.predict(np.random.random((2,5)))
>>> y_pred = lr.predict(X_test)
>>> y_pred = knn.predict_proba(X_test)
# Unsupervised Estimators
>>> y_pred = k_means.predict(X_test)
import sklearn.datasets
加载 | 适用类型 |
---|---|
datasets.load_iris() | 分类 |
datasets.load_boston() | 回归 |
datasets.load_didits() | 分类(数字图片) |
datasets.make_classification()
创建分类数据集
datasets.make_regression()
创建回归数据集
参数:
n_samples:指定样本数
n_features:指定特征数
n_classes:指定几分类
sklearn.linear_model | 线性模型 |
---|---|
LinearRegression | 普通最小二乘法 |
Ridge | 岭回归 |
Lasso | 估计稀疏系数的线性模型回归 |
MultiTaskLasso | 多任务 Lasso 回归 |
LogisticRegression | logistic 回归 |
SGDRegressor | 随机梯度下降回归 |
SGDClassifier | 随机梯度下降分类 |
ElasticNetCV | 弹性网络回归 |
MultiTaskElasticNet | 多任务弹性网络回归 |
sklearn.svm | 支持向量机 |
---|---|
SVC | 支持向量机分类 |
SVR | 支持向量机回归 |
sklearn.neighbors | 最近邻算法 |
---|---|
NearestNeighbors | 无监督最近邻 |
KNeighborsClassifier | k-近邻分类 |
RadiusNeighborsClassifier | 固定半径近邻分类 |
KNeighborsRegressor | 最近邻回归 |
RadiusNeighborsRegressor | |
NearestCentroid | 最近质心分类 |
sklearn.gaussian_process | 高斯过程 |
---|---|
GaussianProcessRegressor | 高斯过程回归(GPR) |
GaussianProcessClassifier | 高斯过程分类(GPC) |
Kernel | 高斯过程内核 |
sklearn.tree | 决策树 |
---|---|
DecisionTreeClassifier | 决策树分类 |
DecisionTreeRegressor | 决策树回归 |
sklearn.kernel_ridge | 内核岭回归 |
---|---|
KernelRidge | 内核岭回归 |
sklearn.isotonic.IsotonicRegression | 等式回归 |
---|---|
IsotonicRegression |
sklearn.multiclass | 多类多标签算法 |
---|---|
multiclass |
sklearn.naive_bayes | 朴素贝叶斯分类器 |
---|---|
GaussianNB | 高斯朴素贝叶斯 |
MultinomialNB | 多项分布朴素贝叶斯 |
BernoulliNB | 伯努利朴素贝叶斯 |
sklearn.ensemble | 集成方法 |
---|---|
BaggingClassifier | Bagging |
BaggingRegressor | |
RandomForestClassifier | 随机森林 |
RandomForestRegressor | |
ExtraTreesClassifier | 极限随机树 |
ExtraTreesRegressor | |
AdaBoostClassifier | AdaBoost |
AdaBoostRegressor | |
GradientBoostingClassifier | Gradient Tree Boosting (梯度树提升) |
GradientBoostingRegressor | |
VotingClassifier | 投票分类器 |
sklearn.neural_network | 神经网络 |
---|---|
MLPClassifier | 多层感知器(MLP) |
MLPRegressor |
sklearn.mixture | 高斯混合模型 |
---|---|
GaussianMixture | 高斯混合 |
BayesianGaussianMixture | 变分贝叶斯高斯混合 |
sklearn.cluster | 聚类 |
---|---|
KMeans | K-means聚类 |
AffinityPropagation | AP聚类 |
MeanShift | |
SpectralClustering | |
AgglomerativeClustering | 层次聚类 |
DBSCAN | |
Birch | |
sklearn.cluster.bicluster | 双聚类 |
SpectralCoclustering | |
SpectralBiclustering |
主成分分析(PCA) | |
---|---|
PCA | 准确的PCA和概率解释 |
IncrementalPCA | 增量PCA |
KernelPCA | 核 PCA |
SparsePCA | 稀疏主成分分析 |
MiniBatchSparsePCA | 小批量稀疏 PCA |
截断奇异值分解 | |
TruncatedSVD | |
词典学习: | |
SparseCoder | 带有预计算词典的稀疏编码 |
DictionaryLearning | 通用词典学习 |
MiniBatchDictionaryLearning | 小批量字典学习 |
因子分析 | |
FactorAnalysis | 高斯分布 |
FastICA | 隐变量基于非高斯分布 |
独立成分分析(ICA) | |
FastICA | |
非负矩阵分解(NMF 或 NNMF) | |
NMF | |
隐 Dirichlet 分配(LDA) | |
LatentDirichletAllocation |
sklearn.manifold | 流形学习,是一种非线性降维方法 |
---|---|
Isomap | 等距映射(Isometric Mapping) |
LocallyLinearEmbedding | 局部线性嵌入 |
SpectralEmbedding | 谱嵌入 |
MDS | 多维尺度分析 |
TSNE | t 分布随机邻域嵌入(t-SNE) |
import sklearn.model_selection
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=None,train_size=None)
参数test_size:float(样本比例,默认0.25), int(样本量)
scores=model_selection.cross_val_score(model,X,y,
scoring=None,cv=None,
n_jobs=1,verbose=0,fit_params=None)
scores.mean()
return ndarray(包含所有的结果)
scoring:评分方法
cv(Cross-validation):cv数量
n_jobs:并行数
predicted=cross_val_predict(model,X,y,cv=None)
estimator.get_params() # 获取估计器参数
# 网格搜索
GridSearchCV(model,param_grid,scoring=None,verbose=0)
# param_grid:dict or list
# verbose:日志冗长度,int:冗长度,0:不输出训练过程,1:偶尔输出,>1:对每个子模型都输出。
# 随机搜索
RandomizedSearchCV()
评分参数:sklearn.metrics
Examples:
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]} #超参数
>>> svc = svm.SVC()
>>> clf = GridSearchCV(svc, parameters)
>>> clf.fit(iris.data, iris.target)
# 属性:
>>> clf.best_params_ #最优参数
>>> clf.best_estimator_ #最优估计器
>>> clf.best_score_ #最优得分
>>> param_range=logspace(-6,0,5)
# 验证曲线
>>> train_scores,valid_scores=validation_curve(
model, X, y,
param_name, param_range=param_range, #参数名字和参数设定范围
cv=None, scoring=None)
# 得分
>>> train_scores_mean=np.mean(train_scores,axis=1)
>>> valid_scores_mean=np.mean(valid_scores,axis=1)
# 可视化
>>> plt.plot(param_range,train_scores_mean,colur='r',label='training')
>>> plt.plot(param_range,valid_scores_mean,colur='g',label='Cross-validation')
>>> plt.show()
train_sizes,train_scores,valid_scores = learning_curve(model, X, y, train_sizes, cv=None, scoring=None)
import sklearn.metrics
Function | Description |
---|---|
accuracy_score | 分类:准确率 |
log_loss | 分类:损失 |
roc_auc_score | 分类:auc值(ROC曲线下面积) |
confusion_matrix | 分类:混淆矩阵 |
mean_absolute_error | 回归:平均绝对误差 |
mean_squared_error | 回归 |
r2_score | 回归:R^2 |
label_ranking_loss | 排序度量 |
mutual_info_scor | 聚类 |
adjusted_rand_score | 聚类:调整rand指数 |
homogeneity_score | 聚类 |
v_measure_score | 聚类 |
# pickle
>>> #已有建好的模型model
>>> import pickle
>>> with open('model.pickle','wb') as f:
>>> pickle.dumps(model,f) #保存
>>> with open('model.pickle','rb') as f:
>>> clf = pickle.loads(f) #读取
# joblib
>>> from sklearn.externals import joblib
>>> joblib.dump(model, 'filename.pkl') # 保存
>>> model = joblib.load('filename.pkl') # 导入
sklearn.feature_selection | |
---|---|
VarianceThreshold | 移除低方差特征,单变量特征选择 |
SelectKBest | 移除那些除了评分最高的 K 个特征之外的所有特征 |
SelectPercentile | 移除除了用户指定的最高得分百分比之外的所有特征 |
GenericUnivariateSelect | 允许使用可配置方法来进行单变量特征选择。它允许超参数搜索评估器来选择最好的单变量特征 |
RFECV | 递归式特征消除 |
检验:
对于回归: f_regression , mutual_info_regression
对于分类: chi2 , f_classif , mutual_info_classif
import sklearn.preprocessing
X_scaled = preprocessing.scale(X_train)
# MinMaxScaler
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.transform(X_test)
# X_std =(X-X.min)/(X.max-X.min)
# X_scaled = X_std * (max - min) + min
# MaxAbsScaler
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_test_maxabs = max_abs_scaler.transform(X_test)
# X_std =X/X.max
RobustScaler = preprocessing.RobustScaler()
X_train_RobustScaler = RobustScaler.fit_transform(X_train)
X_test_RobustScaler = RobustScaler.transform(X_test)
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)
X_normalized = preprocessing.normalize(X, norm='l2')
binarizer = preprocessing.Binarizer(threshold=0.0)
binarizer.transform(X)
enc = preprocessing.OneHotEncoder(n_values='auto', categorical_features='all')
enc.fit(X_train)
enc.transform(X_test).toarray()
imp = preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X_train)
imp.transform(X_test)
参数:strategy :{‘mean’,‘median’,‘most_frequent’}
X=[X1,X2]
poly = preprocessing.PolynomialFeatures(degree=2, interaction_only=False)
poly.fit_transform(X)
参数:
degree=2:X 的特征从[X1,X2]转换为[1,X1,X2,X12,X1X2,X22]
interaction_only:只有交互项[1,X1,X2,X1X2]
empirical_covariance
经验协方差
可用于提取符合机器学习算法支持的特征,比如文本和图片。
CalibratedClassifierCV