ML-imbalanced-learn

写在前边

不要用精确度（或错误率）去评估你的分类器可以使用以下：
```
 The Area Under the ROC curve (AUC)
 F1 Score
 Cohen’s Kappa
```
预测概率，而不是分类
调整阈值
总是在自然分布上训练 sklearn.cross_validation.StratifiedKFold

1.过采样

由于要在类别较少的样本中自动生成新的样本
- RandomOverSampler通过随机抽样和替换当前可用样本来生成新样本。

>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94],
...                            class_sep=0.8, random_state=0)
>>> from imblearn.over_sampling import RandomOverSampler
>>> ros = RandomOverSampler(random_state=0)
>>> X_resampled, y_resampled = ros.fit_sample(X, y)
>>> from collections import Counter
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4674), (1, 4674), (2, 4674)]

From random over-sampling to SMOTE and ADASYN

from imblearn.over_sampling import SMOTE, ADASYN

2. 欠采样

ClusterCentroids用Kmeans 产生新的样本

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94],
...                            class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.under_sampling import ClusterCentroids
>>> cc = ClusterCentroids(random_state=0)
>>> X_resampled, y_resampled = cc.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 64), (2, 64)]

RandomUnderSampler 是一个快速选择样本采样的方法。

>>> from imblearn.under_sampling import RandomUnderSampler
>>> rus = RandomUnderSampler(random_state=0)
>>> X_resampled, y_resampled = rus.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 64), (2, 64)]

3.过采样和欠采样结合

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94],
...                            class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.combine import SMOTEENN
>>> smote_enn = SMOTEENN(random_state=0)
>>> X_resampled, y_resampled = smote_enn.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4060), (1, 4381), (2, 3502)]
>>> from imblearn.combine import SMOTETomek
>>> smote_tomek = SMOTETomek(random_state=0)
>>> X_resampled, y_resampled = smote_tomek.fit_sample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4499), (1, 4566), (2, 4413)]

3.1 采样技术介绍

欠采样技术
Tomek：该算法查找属于任何一个类都比较模糊的样本。这个想法是净化多类别与少类别之间的边界，使少类别区域更明显。
合成新的样本
SMOTE：它是一种过采样的方法，它从样本比较少的类别中创建新的样本实例，一般，它从相近的几个样本中，随机的扰动一个特征，

4. 集成采样

EasyEnsemble 通过对原始数据集随机欠采样得到新的数据集

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94],
...                            class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.ensemble import EasyEnsemble
>>> ee = EasyEnsemble(random_state=0, n_subsets=10)
>>> X_resampled, y_resampled = ee.fit_sample(X, y)
>>> print(X_resampled.shape)
(10, 192, 2)
>>> print(sorted(Counter(y_resampled[0]).items()))
[(0, 64), (1, 64), (2, 64)]

BalanceCascade 不是随机选择，而是倾向选择错误分类的样本

>>> from imblearn.ensemble import BalanceCascade
>>> from sklearn.linear_model import LogisticRegression
>>> bc = BalanceCascade(random_state=0,
...                     estimator=LogisticRegression(random_state=0),
...                     n_max_subset=4)
>>> X_resampled, y_resampled = bc.fit_sample(X, y)
>>> print(X_resampled.shape)
(4, 192, 2)
>>> print(sorted(Counter(y_resampled[0]).items()))
[(0, 64), (1, 64), (2, 64)]

sklearn的集成学习的不提供采样，可以用BalancedBaggingClassifier,相当于把EasyEnsemble和集成学习方法相结合。

>>> from imblearn.ensemble import BalancedBaggingClassifier
>>> bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
...                                 ratio='auto',
...                                 replacement=False,
...                                 random_state=0)
>>> bbc.fit(X, y) 
BalancedBaggingClassifier(...)
>>> y_pred = bbc.predict(X_test)
>>> confusion_matrix(y_test, y_pred)
array([[  12,    0,    0],
       [   0,   55,    4],
       [  68,   53, 1058]])

5. 自动下载数据集

>>> from collections import Counter
>>> from imblearn.datasets import fetch_datasets
>>> ecoli = fetch_datasets()['ecoli']
>>> ecoli.data.shape
(336, 7)
>>> print(sorted(Counter(ecoli.target).items()))
[(-1, 301), (1, 3

参考网址

http://contrib.scikit-learn.org/imbalanced-learn/stable/user_guide.html
https://blog.csdn.net/zhangf666/article/details/53364465

ML-imbalanced-learn

写在前边

1.过采样

2. 欠采样

3.过采样和欠采样结合

3.1 采样技术介绍

4. 集成采样

5. 自动下载数据集

参考网址

你可能感兴趣的:(ML-imbalanced-learn)