写在前边
-
不要用精确度(或错误率)去评估你的分类器可以使用以下:
The Area Under the ROC curve (AUC) F1 Score Cohen’s Kappa
预测概率,而不是分类
调整阈值
总是在自然分布上训练
sklearn.cross_validation.StratifiedKFold
1.过采样
- 由于要在类别较少的样本中自动生成新的样本
- RandomOverSampler通过随机抽样和替换当前可用样本来生成新样本。
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=3,
... n_clusters_per_class=1,
... weights=[0.01, 0.05, 0.94],
... class_sep=0.8, random_state=0)
>>> from imblearn.over_sampling import RandomOverSampler
>>> ros = RandomOverSampler(random_state=0)
>>> X_resampled, y_resampled = ros.fit_sample(X, y)
>>> from collections import Counter
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4674), (1, 4674), (2, 4674)]
- From random over-sampling to SMOTE and ADASYN
from imblearn.over_sampling import SMOTE, ADASYN
2. 欠采样
- ClusterCentroids用Kmeans 产生新的样本
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=3,
... n_clusters_per_class=1,
... weights=[0.01, 0.05, 0.94],
... class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.under_sampling import ClusterCentroids
>>> cc = ClusterCentroids(random_state=0)
>>> X_resampled, y_resampled = cc.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 64), (2, 64)]
- RandomUnderSampler 是一个快速选择样本采样的方法。
>>> from imblearn.under_sampling import RandomUnderSampler
>>> rus = RandomUnderSampler(random_state=0)
>>> X_resampled, y_resampled = rus.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 64), (2, 64)]
3.过采样和欠采样结合
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=3,
... n_clusters_per_class=1,
... weights=[0.01, 0.05, 0.94],
... class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.combine import SMOTEENN
>>> smote_enn = SMOTEENN(random_state=0)
>>> X_resampled, y_resampled = smote_enn.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4060), (1, 4381), (2, 3502)]
>>> from imblearn.combine import SMOTETomek
>>> smote_tomek = SMOTETomek(random_state=0)
>>> X_resampled, y_resampled = smote_tomek.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4499), (1, 4566), (2, 4413)]
3.1 采样技术介绍
- 欠采样技术
Tomek: 该算法查找属于任何一个类都比较模糊的样本。这个想法是净化多类别与少类别之间的边界,使少类别区域更明显。 - 合成新的样本
SMOTE:它是一种过采样的方法,它从样本比较少的类别中创建新的样本实例,一般,它从相近的几个样本中,随机的扰动一个特征,
4. 集成采样
- EasyEnsemble 通过对原始数据集随机欠采样得到新的数据集
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=3,
... n_clusters_per_class=1,
... weights=[0.01, 0.05, 0.94],
... class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.ensemble import EasyEnsemble
>>> ee = EasyEnsemble(random_state=0, n_subsets=10)
>>> X_resampled, y_resampled = ee.fit_sample(X, y)
>>> print(X_resampled.shape)
(10, 192, 2)
>>> print(sorted(Counter(y_resampled[0]).items()))
[(0, 64), (1, 64), (2, 64)]
- BalanceCascade 不是随机选择,而是倾向选择错误分类的样本
>>> from imblearn.ensemble import BalanceCascade
>>> from sklearn.linear_model import LogisticRegression
>>> bc = BalanceCascade(random_state=0,
... estimator=LogisticRegression(random_state=0),
... n_max_subset=4)
>>> X_resampled, y_resampled = bc.fit_sample(X, y)
>>> print(X_resampled.shape)
(4, 192, 2)
>>> print(sorted(Counter(y_resampled[0]).items()))
[(0, 64), (1, 64), (2, 64)]
- sklearn的集成学习的不提供采样,可以用BalancedBaggingClassifier,相当于把EasyEnsemble和集成学习方法相结合。
>>> from imblearn.ensemble import BalancedBaggingClassifier
>>> bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
... ratio='auto',
... replacement=False,
... random_state=0)
>>> bbc.fit(X, y)
BalancedBaggingClassifier(...)
>>> y_pred = bbc.predict(X_test)
>>> confusion_matrix(y_test, y_pred)
array([[ 12, 0, 0],
[ 0, 55, 4],
[ 68, 53, 1058]])
5. 自动下载数据集
>>> from collections import Counter
>>> from imblearn.datasets import fetch_datasets
>>> ecoli = fetch_datasets()['ecoli']
>>> ecoli.data.shape
(336, 7)
>>> print(sorted(Counter(ecoli.target).items()))
[(-1, 301), (1, 3
参考网址
- http://contrib.scikit-learn.org/imbalanced-learn/stable/user_guide.html
- https://blog.csdn.net/zhangf666/article/details/53364465