【Python第三方包】scikit-learn

KFold与StratifiedKFold的区别

class sklearn.model_selection.StratifiedKFold(n_splits=3, shuffle=False, random_state=None)
Stratified K-Folds cross-validator Provides train/test indices to split data in train/test sets.This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class

意思就是Stra~是按着不同类别标签的相对占比来进行的分折

看看代码

import sklearn.model_selection as skmodsel

print('假设有100个样本,X为对应特征矩阵,y为对应的类标签')
X = [[i] for i in range(100)]
y = ['a'] * 30 + ['b'] * 30 + ['c'] * 30 + ['d'] * 10

print('用KFold进行分折')
K_folds = skmodsel.KFold(n_splits=10)
for train_indices, test_indices in K_folds.split(X):
    print('test set:', test_indices)
    d = {}
    for i in train_indices:
        d[y[i]] = d.setdefault(y[i], 0) + 1
    print(d)

K_strafold = skmodsel.StratifiedKFold(n_splits=10)
print('用StratifiedKFold进行分折')
for train_indices, test_indices in K_strafold.split(X, y):
    print('test set:', test_indices)
    d = {}
    for i in train_indices:
        d[y[i]] = d.setdefault(y[i], 0) + 1
    print(d)

结果如下

假设有100个样本,X为对应特征矩阵,y为对应的类标签
用KFold进行分折
test set: [0 1 2 3 4 5 6 7 8 9]
{'a': 20, 'b': 30, 'c': 30, 'd': 10}
test set: [10 11 12 13 14 15 16 17 18 19]
{'a': 20, 'b': 30, 'c': 30, 'd': 10}
test set: [20 21 22 23 24 25 26 27 28 29]
{'a': 20, 'b': 30, 'c': 30, 'd': 10}
test set: [30 31 32 33 34 35 36 37 38 39]
{'a': 30, 'b': 20, 'c': 30, 'd': 10}
test set: [40 41 42 43 44 45 46 47 48 49]
{'a': 30, 'b': 20, 'c': 30, 'd': 10}
test set: [50 51 52 53 54 55 56 57 58 59]
{'a': 30, 'b': 20, 'c': 30, 'd': 10}
test set: [60 61 62 63 64 65 66 67 68 69]
{'a': 30, 'b': 30, 'c': 20, 'd': 10}
test set: [70 71 72 73 74 75 76 77 78 79]
{'a': 30, 'b': 30, 'c': 20, 'd': 10}
test set: [80 81 82 83 84 85 86 87 88 89]
{'a': 30, 'b': 30, 'c': 20, 'd': 10}
test set: [90 91 92 93 94 95 96 97 98 99]
{'a': 30, 'b': 30, 'c': 30}
用StratifiedKFold进行分折
test set: [ 0  1  2 30 31 32 60 61 62 90]
{'a': 27, 'b': 27, 'c': 27, 'd': 9}
test set: [ 3  4  5 33 34 35 63 64 65 91]
{'a': 27, 'b': 27, 'c': 27, 'd': 9}
test set: [ 6  7  8 36 37 38 66 67 68 92]
{'a': 27, 'b': 27, 'c': 27, 'd': 9}
test set: [ 9 10 11 39 40 41 69 70 71 93]
{'a': 27, 'b': 27, 'c': 27, 'd': 9}
test set: [12 13 14 42 43 44 72 73 74 94]
{'a': 27, 'b': 27, 'c': 27, 'd': 9}
test set: [15 16 17 45 46 47 75 76 77 95]
{'a': 27, 'b': 27, 'c': 27, 'd': 9}
test set: [18 19 20 48 49 50 78 79 80 96]
{'a': 27, 'b': 27, 'c': 27, 'd': 9}
test set: [21 22 23 51 52 53 81 82 83 97]
{'a': 27, 'b': 27, 'c': 27, 'd': 9}
test set: [24 25 26 54 55 56 84 85 86 98]
{'a': 27, 'b': 27, 'c': 27, 'd': 9}
test set: [27 28 29 57 58 59 87 88 89 99]
{'a': 27, 'b': 27, 'c': 27, 'd': 9}

我们设置的是十折,也就是说将a分为十份,每次取一份来做test集
结果很明显,假设现在有100个样本,类别标签为a,b,c的各30个,和10个标签为d的样本,他们的占比就是3:3:3:1
当直接使用KFold的时候,第0~9个样本作为第一折,第二次取10-19做第二折,以此类推,不考虑每一折内的样本分布情况
但当使用Stra~的时候,分折的时候是要考虑样本分布情况的,以保证每一折内的各类样本比例与全部数据中的一致

SVM的decision_func和predict_proba

默认情况下,sklearn包中的svm方法是没有predict_proba方法的,也就是说默认情况下,没法获得对某个样本预测的后验概率,svm有自己的decision_func方法,如下

import numpy as np
from sklearn import svm

X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
svm1 = svm.SVC(kernel='linear', probability=False, random_state=random_state)
svm1.fit(X, y)
print(svm1.decision_function(X))
#output
#[-1.  -1.5  1.   1.5]

#如果是multi-class问题
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1], [0, 0], [0, 0.1]])
y = np.array([1, 1, 2, 2, 3, 3])
svm1 = svm.SVC(kernel='linear', probability=True, random_state=random_state)
svm1.fit(X, y)
print(svm1.decision_function(X))
#输出
#[[ 2.18181818 -0.35863636  1.17681818]
#[ 2.31818182 -0.495       1.17681818]
#[-0.36363636  2.16863636  1.195     ]
#[-0.5         2.305       1.195     ]
#[ 0.90909091 -0.095       2.18590909]
#[-0.10454545  0.91772727  2.18681818]]

可以看到,如果是两类问题,那么decision_function输出的就只是一维的,而如果是多类问题,那么输出的维度是与decision_function_shape这个参数有关,详情见SVM-api
而且,如果是两类问题,如上例,如果某个样本的decision_function输出如果为负,那么则说明,这个样本应该分到类标号较小的那个类中去,也就是1

而如果把SVC的参数probability=True,那么svm也会带有predict_proba方法,如下

from sklearn import svm

X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
svm = svm.SVC(kernel='linear', probability=True, random_state=random_state)
svm.fit(X, y)
print(svm.predict_proba(X))
#输出
# [[0.71303947 0.28696053]
#  [0.79661758 0.20338242]
#  [0.28696001 0.71303999]
#  [0.20335122 0.79664878]]

predict_proba这个方法的输出是基于decision_func的输出、运用Platt scaling计算出来的,详情见SVM-api中的Attributes中的probaA_和probaB_

PCA

explained_variance_即为特征值

import numpy as np
from sklearn.decomposition import PCA
X = np.array([(2.5,2.4), (0.5,0.7), (2.2,2.9), (1.9,2.2), (3.1,3.0), (2.3, 2.7), (2, 1.6), (1, 1.1), (1.5, 1.6), (1.1, 0.9)])
pca = PCA(n_components=1)
pca.fit(X)

print(pca.components_)
print(pca.explained_variance_) 

#[[-0.6778734  -0.73517866]]
#[1.28402771]

你可能感兴趣的:(Python)