ML-imbalanced-learn

写在前边

  1. 不要用精确度(或错误率)去评估你的分类器可以使用以下:

     The Area Under the ROC curve (AUC)
     F1 Score
     Cohen’s Kappa
    
  2. 预测概率,而不是分类

  3. 调整阈值

  4. 总是在自然分布上训练 sklearn.cross_validation.StratifiedKFold

1.过采样

  • 由于要在类别较少的样本中自动生成新的样本
    • RandomOverSampler通过随机抽样和替换当前可用样本来生成新样本。
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94],
...                            class_sep=0.8, random_state=0)
>>> from imblearn.over_sampling import RandomOverSampler
>>> ros = RandomOverSampler(random_state=0)
>>> X_resampled, y_resampled = ros.fit_sample(X, y)
>>> from collections import Counter
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4674), (1, 4674), (2, 4674)]
  • From random over-sampling to SMOTE and ADASYN
from imblearn.over_sampling import SMOTE, ADASYN

2. 欠采样

  • ClusterCentroids用Kmeans 产生新的样本
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94],
...                            class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.under_sampling import ClusterCentroids
>>> cc = ClusterCentroids(random_state=0)
>>> X_resampled, y_resampled = cc.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 64), (2, 64)]
  • RandomUnderSampler 是一个快速选择样本采样的方法。
>>> from imblearn.under_sampling import RandomUnderSampler
>>> rus = RandomUnderSampler(random_state=0)
>>> X_resampled, y_resampled = rus.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 64), (2, 64)]

3.过采样和欠采样结合

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94],
...                            class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.combine import SMOTEENN
>>> smote_enn = SMOTEENN(random_state=0)
>>> X_resampled, y_resampled = smote_enn.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4060), (1, 4381), (2, 3502)]
>>> from imblearn.combine import SMOTETomek
>>> smote_tomek = SMOTETomek(random_state=0)
>>> X_resampled, y_resampled = smote_tomek.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4499), (1, 4566), (2, 4413)]

3.1 采样技术介绍

  • 欠采样技术
    Tomek: 该算法查找属于任何一个类都比较模糊的样本。这个想法是净化多类别与少类别之间的边界,使少类别区域更明显。
  • 合成新的样本
    SMOTE:它是一种过采样的方法,它从样本比较少的类别中创建新的样本实例,一般,它从相近的几个样本中,随机的扰动一个特征,

4. 集成采样

  • EasyEnsemble 通过对原始数据集随机欠采样得到新的数据集
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94],
...                            class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.ensemble import EasyEnsemble
>>> ee = EasyEnsemble(random_state=0, n_subsets=10)
>>> X_resampled, y_resampled = ee.fit_sample(X, y)
>>> print(X_resampled.shape)
(10, 192, 2)
>>> print(sorted(Counter(y_resampled[0]).items()))
[(0, 64), (1, 64), (2, 64)]
  • BalanceCascade 不是随机选择,而是倾向选择错误分类的样本
>>> from imblearn.ensemble import BalanceCascade
>>> from sklearn.linear_model import LogisticRegression
>>> bc = BalanceCascade(random_state=0,
...                     estimator=LogisticRegression(random_state=0),
...                     n_max_subset=4)
>>> X_resampled, y_resampled = bc.fit_sample(X, y)
>>> print(X_resampled.shape)
(4, 192, 2)
>>> print(sorted(Counter(y_resampled[0]).items()))
[(0, 64), (1, 64), (2, 64)]
  • sklearn的集成学习的不提供采样,可以用BalancedBaggingClassifier,相当于把EasyEnsemble和集成学习方法相结合。
>>> from imblearn.ensemble import BalancedBaggingClassifier
>>> bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
...                                 ratio='auto',
...                                 replacement=False,
...                                 random_state=0)
>>> bbc.fit(X, y) 
BalancedBaggingClassifier(...)
>>> y_pred = bbc.predict(X_test)
>>> confusion_matrix(y_test, y_pred)
array([[  12,    0,    0],
       [   0,   55,    4],
       [  68,   53, 1058]])

5. 自动下载数据集

>>> from collections import Counter
>>> from imblearn.datasets import fetch_datasets
>>> ecoli = fetch_datasets()['ecoli']
>>> ecoli.data.shape
(336, 7)
>>> print(sorted(Counter(ecoli.target).items()))
[(-1, 301), (1, 3

参考网址

  • http://contrib.scikit-learn.org/imbalanced-learn/stable/user_guide.html
  • https://blog.csdn.net/zhangf666/article/details/53364465

你可能感兴趣的:(ML-imbalanced-learn)