weixin_34409741

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！...

先看数据：

特征如下：

Time

Number of seconds elapsed between each transaction (over two days)

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

V10

No description provided

numeric

V11

No description provided

numeric

V12

No description provided

numeric

V13

No description provided

numeric

V14

No description provided

numeric

V15

No description provided

numeric

V16

No description provided

numeric

V17

No description provided

numeric

V18

No description provided

numeric

V19

No description provided

numeric

V20

No description provided

numeric

V21

No description provided

numeric

V22

No description provided

numeric

V23

No description provided

numeric

V24

No description provided

numeric

V25

No description provided

numeric

V26

No description provided

numeric

V27

No description provided

numeric

V28

abc

numeric

Amount

Amount of money for this transaction

numeric

Class

Fraud or Not-Fraud

boolean

只有Amount没有做标准化处理（mean不为0！！！）：见：https://www.kaggle.com/mlg-ulb/creditcardfraud/data

Introduction

from：https://www.kaggle.com/nikitaivanov/getting-high-sensitivity-for-imbalanced-data 主要使用了smote和聚类两种思路！

In this notebook we will try to predict fraud transactions from a given data set. Given that the data is imbalanced, standard metrics for evaluating classification algorithm (such as accuracy) are invalid. We will focus on the following metrics: Sensitivity (true positive rate) and Specificity (true negative rate). Of course, they are dependent on each other, so we want to find optimal trade-off between them. Such trade-off usually depends on the application of the algorithm, and in case of fraud detection I would prefer to see high sensitivity (e.g. given that a transaction is fraud, I want to be able to detect it with high probability).

For dealing with skewed data I am going to use SMOTE algorithm. In two words, the idea is to create synthetic samples (in opposite to oversampling with replacement) through finding nearest examples (KNN), calculating difference between them, multiplying this difference by a random number between 0 and 1 and adding the result to the initial sample. For this purpose we are going to use SMOTE function from DMwR package.

Algorithms I am going to implement are Support Vector Machine (SVM), Logistic regression and Random Forest. Models will be trained on the original and SMOTEd data and their performance will be measured on the entire data set.

As a bonus, we are going to have some fun and use K-means centroids of the negative examples together with the original positive examples as a new dataset and train our algorithm on it. We then compare results.

##Loading required packeges 
library(ggplot2) #visualization  
library(caret) #train model library(dplyr) #data manipulation library(kernlab) #svm library(nnet) #models (logit, neural nets) library(DMwR) #SMOTE data ##Load data d = read.csv("../input/creditcard.csv") n = ncol(d) str(d) d$Class = ifelse(d$Class == 0, 'No', 'Yes') %>% as.factor()

Loading required package: lattice

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘kernlab’

The following object is masked from ‘package:ggplot2’:

    alpha

Loading required package: grid

'data.frame':	284807 obs. of  31 variables:
 $ Time  : num  0 0 1 1 2 2 4 7 7 9 ...
 $ V1    : num  -1.36 1.192 -1.358 -0.966 -1.158 ...
 $ V2    : num  -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
 $ V3    : num  2.536 0.166 1.773 1.793 1.549 ...
 $ V4    : num  1.378 0.448 0.38 -0.863 0.403 ...
 $ V5    : num  -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
 $ V6    : num  0.4624 -0.0824 1.8005 1.2472 0.0959 ...
 $ V7    : num  0.2396 -0.0788 0.7915 0.2376 0.5929 ...
 $ V8    : num  0.0987 0.0851 0.2477 0.3774 -0.2705 ...
 $ V9    : num  0.364 -0.255 -1.515 -1.387 0.818 ...
 $ V10   : num  0.0908 -0.167 0.2076 -0.055 0.7531 ...
 $ V11   : num  -0.552 1.613 0.625 -0.226 -0.823 ...
 $ V12   : num  -0.6178 1.0652 0.0661 0.1782 0.5382 ...
 $ V13   : num  -0.991 0.489 0.717 0.508 1.346 ...
 $ V14   : num  -0.311 -0.144 -0.166 -0.288 -1.12 ...
 $ V15   : num  1.468 0.636 2.346 -0.631 0.175 ...
 $ V16   : num  -0.47 0.464 -2.89 -1.06 -0.451 ...
 $ V17   : num  0.208 -0.115 1.11 -0.684 -0.237 ...
 $ V18   : num  0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
 $ V19   : num  0.404 -0.146 -2.262 -1.233 0.803 ...
 $ V20   : num  0.2514 -0.0691 0.525 -0.208 0.4085 ...
 $ V21   : num  -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
 $ V22   : num  0.27784 -0.63867 0.77168 0.00527 0.79828 ...
 $ V23   : num  -0.11 0.101 0.909 -0.19 -0.137 ...
 $ V24   : num  0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
 $ V25   : num  0.129 0.167 -0.328 0.647 -0.206 ...
 $ V26   : num  -0.189 0.126 -0.139 -0.222 0.502 ...
 $ V27   : num  0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
 $ V28   : num  -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
 $ Amount: num  149.62 2.69 378.66 123.5 69.99 ...
 $ Class : int  0 0 0 0 0 0 0 0 0 0 ...

It is always a good idea first to plot a response variable to check for skewness in data:

qplot(x = d$Class, geom = 'bar') + xlab('Fraud (Yes/No)') + ylab('Number of transactions')

Classification on the original data

Keeping in mind that the data is highly skewed we proceed. First split the data into training and test sets.

idx = createDataPartition(d$Class, p = 0.7, list = F) d[, -n] = scale(d[, -n]) #perform scaling train = d[idx, ] test = d[-idx, ]

Calculate baseline accuracy for future reference

blacc = nrow(d[d$Class == 'No', ])/nrow(d)*100 cat('Baseline accuracy:', blacc)

Baseline accuracy: 99.82725

To begin with, let's train our models on the original dataset to see what we get if use unbalanced data. Due to computational limitations of my laptop, I will only run logistic regression for this purpose.

m1 = multinom(data = train, Class ~ .) p1 = predict(m1, test[, -n], type = 'class') cat(' Accuracy of the model', mean(p1 == test[, n])*100, '\n', 'Baseline accuracy', blacc)

# weights:  32 (31 variable)
initial  value 138189.980799 
final  value 31315.159746 
converged
 Accuracy of the model 99.92744 
 Baseline accuracy 99.82725

Though accuracy (99.92%) of the model might look impressive at a first glance, in fact it isn't. Simply predicting 'not a fraud' for all transactions will give 99.83% accuracy. To really evaluate model's perfomance we need to check confusion matrix.

confusionMatrix(p1, test[, n], positive = 'Yes')

Confusion Matrix and Statistics

          Reference
Prediction    No   Yes
       No  85287    55
       Yes     7    92
                                          
               Accuracy : 0.9993          
                 95% CI : (0.9991, 0.9994)
    No Information Rate : 0.9983          
    P-Value [Acc > NIR] : 1.779e-15       
                                          
                  Kappa : 0.7476          
 Mcnemar's Test P-Value : 2.387e-09       
                                          
            Sensitivity : 0.625850        
            Specificity : 0.999918        
         Pos Pred Value : 0.929293        
         Neg Pred Value : 0.999356        
             Prevalence : 0.001720        
         Detection Rate : 0.001077        
   Detection Prevalence : 0.001159        
      Balanced Accuracy : 0.812884        
                                          
       'Positive' Class : Yes

From the confusion matrix we see that while model has high accuracy (99.92%) and high specificity (99.98%), it has low sensitivity of 64%. In other words, only 64% of all fraudulent transactions were detected.

Classification on the SMOTEd data

Now let's preprocess our data using SMOTE algorithm:

table(d$Class) #check initial distribution
newData <- SMOTE(Class ~ ., d, perc.over = 500,perc.under=100) table(newData$Class) #check SMOTed distribution

    No    Yes 
284315    492

  No  Yes 
2460 2952

To train SVM (with RBF kernel) we are going to use train function from caret package. It allows to choose optimal parameters of the model (cost and sigma in this case). Cost refers to penalty for misclassifying examples and sigma is a parameter of RBF which measures similarity between examples. To choose best model we use 5-fold cross-validation. We then evaluate our model on the entire data set.

gr = expand.grid(C = c(1, 50, 150), sigma = c(0.01, 0.05, 1)) tr = trainControl(method = 'cv', number = 5) m2 = train(data = newData, Class ~ ., method = 'svmRadial', trControl = tr, tuneGrid = gr) m2

Support Vector Machines with Radial Basis Function Kernel 

5412 samples
  30 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 4330, 4329, 4329, 4330, 4330 
Resampling results across tuning parameters:

  C    sigma  Accuracy   Kappa    
    1  0.01   0.9445668  0.8891865
    1  0.05   0.9626774  0.9250408
    1  1.00   0.9672934  0.9344234
   50  0.01   0.9717300  0.9430408
   50  0.05   0.9863262  0.9723782
   50  1.00   0.9695108  0.9388440
  150  0.01   0.9789351  0.9574955
  150  0.05   0.9850335  0.9697552
  150  1.00   0.9695108  0.9388440

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.05 and C = 50.

As wee see, best tuning parameters are C = 50 and sigma = 0.05

Let's look at a confusion matrix

p2 = predict(m2, d[, -n]) confusionMatrix(p2, d[, n], positive = 'Yes')

Confusion Matrix and Statistics

          Reference
Prediction     No    Yes
       No  278470      2
       Yes   5845    490
                                        
               Accuracy : 0.9795        
                 95% CI : (0.9789, 0.98)
    No Information Rate : 0.9983        
    P-Value [Acc > NIR] : 1             
                                        
                  Kappa : 0.1408        
 Mcnemar's Test P-Value : <2e-16        
                                        
            Sensitivity : 0.995935      
            Specificity : 0.979442      
         Pos Pred Value : 0.077348      
         Neg Pred Value : 0.999993      
             Prevalence : 0.001727      
         Detection Rate : 0.001720      
   Detection Prevalence : 0.022243      
      Balanced Accuracy : 0.987688      
                                        
       'Positive' Class : Yes

(Numbers may differ due to randomness of k-fold cv)

As expected we were able to achieve sensitivity of 99.59%. In other words, out of all fraudulent transactions we correctly detected 99.59% of them. This came in price of slightly lower accuracy (in comparison to the first model) - 97.95% vs. 99.92% and lower specificity 97.94% vs. 99.98%. The main disadvantage is low level of positive predicted value (i.e. given that prediction is positive, what is probability that the true state is positive) which this case is 7.74% vs. 85% for initial (unbalanced dataset) model. As was mentioned in the beginning, one should choose a model that matches certain goals. If the goal is to correctly identify fraudulent transactions even in price of low positive predicted value (which I believe the case), then the latter model (based on SMOTed data) should be used. Looking at confusion matrix we see that almost all fraudulent transactions were correctly identified and only 2.5% were mislabeled as fraudulent.

I'm planning to try couple more models and also use more sophisticated algorithm that uses K-means centroids of the majority class as samples for non fraudulent transactions.

m3 = randomForest(data = newData, Class ~ .) p3 = predict(m3, d[, -n]) confusionMatrix(p3, d[, n], positive = 'Yes')

Error in eval(expr, envir, enclos): could not find function "randomForest"
Traceback:

library(randomForest)
m3 = randomForest(data = newData, Class ~ .) p3 = predict(m3, d[, -n]) confusionMatrix(p3, d[, n], positive = 'Yes')

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine

The following object is masked from ‘package:ggplot2’:

    margin

Confusion Matrix and Statistics

          Reference
Prediction     No    Yes
       No  282105      0
       Yes   2210    492
                                          
               Accuracy : 0.9922          
                 95% CI : (0.9919, 0.9926)
    No Information Rate : 0.9983          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.306           
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 1.000000        
            Specificity : 0.992227        
         Pos Pred Value : 0.182087        
         Neg Pred Value : 1.000000        
             Prevalence : 0.001727        
         Detection Rate : 0.001727        
   Detection Prevalence : 0.009487        
      Balanced Accuracy : 0.996113        
                                          
       'Positive' Class : Yes

Random forest performs really well. Sensitivity 100% and high specificity (more than 99%). All fraudulent transactions were detected and less than 1% of all transactions were falsely classified as fraud. Hence, Random Forest + SMOTE algorithm shloud be considered as final model.

K-means centroids as a new sample

For curiosity, let's take another approach in dealing with imbalanced data. We are going to separate the examples for positive and negative and from the latter one extract centroids (generated using K-means clustering). Number of clusters will be equal to the number of positive examples. We then use these centroids together with positive examples as a new sample.（思路就是聚类，将major class聚类为k个点，其中k为欺诈信用卡的样本数!）

neg = d[d$Class == 'No', ] #negative examples pos = d[d$Class == 'Yes', ] #positive examples n_pos = sum(d$Class == 'Yes') #calculate number of positive examples clus = kmeans(neg[, -n], centers = n_pos, iter.max = 100) #perform K-means neg = as.data.frame(clus$centers) #extract centroids as new sample neg$Class = 'No' newData = rbind(neg, pos) #merge positive and negative examples newData$Class = factor(newData$Class)

We run random forest on the new dataset, newData, and check confusion matrix.

m4 = randomForest(data = newData, Class ~ .) p4 = predict(m4, d[, -n]) confusionMatrix(p4, d[, n], positive = 'Yes')

Confusion Matrix and Statistics

          Reference
Prediction     No    Yes
       No  210086      0
       Yes  74229    492
                                         
               Accuracy : 0.7394         
                 95% CI : (0.7378, 0.741)
    No Information Rate : 0.9983         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.0097         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 1.000000       
            Specificity : 0.738920       
         Pos Pred Value : 0.006584       
         Neg Pred Value : 1.000000       
             Prevalence : 0.001727       
         Detection Rate : 0.001727       
   Detection Prevalence : 0.262357       
      Balanced Accuracy : 0.869460       
                                         
       'Positive' Class : Yes

Well, while sensitivity is still 100%, specificity dropped to 72% leading to a big fraction of false positive predictions. Learning on the data that was transformed using SMOTE algorithm gave much better results.

from：https://www.kaggle.com/themlguy/undersample-and-oversample-approach-explored

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the "../input/" directory. # For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory import os print(os.listdir("../input")) # Any results you write to the current directory are saved as output.

['creditcard.csv']

import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler import seaborn as sns from sklearn.metrics import confusion_matrix,recall_score,precision_recall_curve,auc,roc_curve,roc_auc_score,classification_report

/opt/conda/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

creditcard_data=pd.read_csv("../input/creditcard.csv")

creditcard_data['Amount']=StandardScaler().fit_transform(creditcard_data['Amount'].values.reshape(-1, 1)) creditcard_data.drop(['Time'], axis=1, inplace=True)

def generatePerformanceReport(clf,X_train,y_train,X_test,y_test,bool_): if bool_==True: clf.fit(X_train,y_train.values.ravel()) pred=clf.predict(X_test) cnf_matrix=confusion_matrix(y_test,pred) tn, fp, fn, tp=cnf_matrix.ravel() print('---------------------------------') print('Length of training data:',len(X_train)) print('Length of test data:', len(X_test)) print('---------------------------------') print('True positives:',tp) print('True negatives:',tn) print('False positives:',fp) print('False negatives:',fn) #sns.heatmap(cnf_matrix,cmap="coolwarm_r",annot=True,linewidths=0.5) print('----------------------Classification report--------------------------') print(classification_report(y_test,pred))

#generate 50%, 66%, 75% proportions of normal indices to be combined with fraud indices 也就是说采样后的黑白样本比例是：0.5,0.66,0.75
#undersampled data
normal_indices=creditcard_data[creditcard_data['Class']==0].index fraud_indices=creditcard_data[creditcard_data['Class']==1].index for i in range(1,4): normal_sampled_data=np.array(np.random.choice(normal_indices, i*len(fraud_indices),replace=False)) #a random sample is generated from normal_indices 主要是随机欠采样 undersampled_data=np.concatenate([fraud_indices, normal_sampled_data]) undersampled_data=creditcard_data.iloc[undersampled_data] print('length of undersampled data ', len(undersampled_data)) print('% of fraud transactions in undersampled data ',len(undersampled_data.loc[undersampled_data['Class']==1])/len(undersampled_data)) #get feature and label data feature_data=undersampled_data.loc[:,undersampled_data.columns!='Class'] label_data=undersampled_data.loc[:,undersampled_data.columns=='Class'] X_train, X_test, y_train, y_test=train_test_split(feature_data,label_data,test_size=0.30) for j in [LogisticRegression(),SVC(),RandomForestClassifier(n_estimators=100)]: clf=j print(j) generatePerformanceReport(clf,X_train,y_train,X_test,y_test,True) #the above code classifies X_test which is part of undersampled data #now, let us consider the remaining rows of dataset and use that as test set remaining_indices=[i for i in creditcard_data.index if i not in undersampled_data.index] testdf=creditcard_data.iloc[remaining_indices] testdf_label=creditcard_data.loc[:,testdf.columns=='Class'] testdf_feature=creditcard_data.loc[:,testdf.columns!='Class'] generatePerformanceReport(clf,X_train,y_train,testdf_feature,testdf_label,False)

length of undersampled data  984
% of fraud transactions in undersampled data  0.5
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
---------------------------------
Length of training data: 688
Length of test data: 296
---------------------------------
True positives: 144
True negatives: 134
False positives: 11
False negatives: 7
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      0.92      0.94       145
          1       0.93      0.95      0.94       151

avg / total       0.94      0.94      0.94       296

---------------------------------
Length of training data: 688
Length of test data: 284807
---------------------------------
True positives: 461
True negatives: 270879
False positives: 13436
False negatives: 31
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.95      0.98    284315
          1       0.03      0.94      0.06       492 #可以看到LR在测试数据集上表现并不好

avg / total       1.00      0.95      0.97    284807

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
---------------------------------
Length of training data: 688
Length of test data: 296
---------------------------------
True positives: 144
True negatives: 140
False positives: 5
False negatives: 7
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      0.97      0.96       145
          1       0.97      0.95      0.96       151

avg / total       0.96      0.96      0.96       296

---------------------------------
Length of training data: 688
Length of test data: 284807
---------------------------------
True positives: 463
True negatives: 267084
False positives: 17231
False negatives: 29
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.94      0.97    284315
          1       0.03      0.94      0.05       492 #看来svm在测试数据集上也不行啊

avg / total       1.00      0.94      0.97    284807

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
---------------------------------
Length of training data: 688
Length of test data: 296
---------------------------------
True positives: 144
True negatives: 142
False positives: 3
False negatives: 7
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      0.98      0.97       145
          1       0.98      0.95      0.97       151

avg / total       0.97      0.97      0.97       296

---------------------------------
Length of training data: 688
Length of test data: 284807
---------------------------------
True positives: 485
True negatives: 275060
False positives: 9255
False negatives: 7
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.97      0.98    284315
          1       0.05      0.99      0.09       492 #Rf也不行？？？？

avg / total       1.00      0.97      0.98    284807

length of undersampled data  1476
% of fraud transactions in undersampled data  0.3333333333333333
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
---------------------------------
Length of training data: 1033
Length of test data: 443
---------------------------------
True positives: 130
True negatives: 291
False positives: 5
False negatives: 17
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.94      0.98      0.96       296
          1       0.96      0.88      0.92       147

avg / total       0.95      0.95      0.95       443

---------------------------------
Length of training data: 1033
Length of test data: 284807
---------------------------------
True positives: 442
True negatives: 278887
False positives: 5428
False negatives: 50
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.98      0.99    284315
          1       0.08      0.90      0.14       492

avg / total       1.00      0.98      0.99    284807

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
---------------------------------
Length of training data: 1033
Length of test data: 443
---------------------------------
True positives: 133
True negatives: 286
False positives: 10
False negatives: 14
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      0.97      0.96       296
          1       0.93      0.90      0.92       147

avg / total       0.95      0.95      0.95       443

---------------------------------
Length of training data: 1033
Length of test data: 284807
---------------------------------
True positives: 453
True negatives: 274909
False positives: 9406
False negatives: 39
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.97      0.98    284315
          1       0.05      0.92      0.09       492

avg / total       1.00      0.97      0.98    284807

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
---------------------------------
Length of training data: 1033
Length of test data: 443
---------------------------------
True positives: 128
True negatives: 293
False positives: 3
False negatives: 19
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.94      0.99      0.96       296
          1       0.98      0.87      0.92       147

avg / total       0.95      0.95      0.95       443

---------------------------------
Length of training data: 1033
Length of test data: 284807
---------------------------------
True positives: 473
True negatives: 281560
False positives: 2755
False negatives: 19
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.99      1.00    284315
          1       0.15      0.96      0.25       492

avg / total       1.00      0.99      0.99    284807

length of undersampled data  1968
% of fraud transactions in undersampled data  0.25
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
---------------------------------
Length of training data: 1377
Length of test data: 591
---------------------------------
True positives: 116
True negatives: 451
False positives: 5
False negatives: 19
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.96      0.99      0.97       456
          1       0.96      0.86      0.91       135

avg / total       0.96      0.96      0.96       591

---------------------------------
Length of training data: 1377
Length of test data: 284807
---------------------------------
True positives: 433
True negatives: 282245
False positives: 2070
False negatives: 59
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.99      1.00    284315
          1       0.17      0.88      0.29       492

avg / total       1.00      0.99      1.00    284807

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
---------------------------------
Length of training data: 1377
Length of test data: 591
---------------------------------
True positives: 118
True negatives: 447
False positives: 9
False negatives: 17
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.96      0.98      0.97       456
          1       0.93      0.87      0.90       135

avg / total       0.96      0.96      0.96       591

---------------------------------
Length of training data: 1377
Length of test data: 284807
---------------------------------
True positives: 445
True negatives: 279369
False positives: 4946
False negatives: 47
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.98      0.99    284315
          1       0.08      0.90      0.15       492

avg / total       1.00      0.98      0.99    284807

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
---------------------------------
Length of training data: 1377
Length of test data: 591
---------------------------------
True positives: 112
True negatives: 455
False positives: 1
False negatives: 23
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      1.00      0.97       456
          1       0.99      0.83      0.90       135

avg / total       0.96      0.96      0.96       591

---------------------------------
Length of training data: 1377
Length of test data: 284807
---------------------------------
True positives: 469
True negatives: 283466
False positives: 849
False negatives: 23
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      1.00      1.00    284315
          1       0.36      0.95      0.52       492

avg / total       1.00      1.00      1.00    284807

整体来看，因为欠采样只是用了一个模型，因此预测效果很差！！！因为没有用到全量数据特征，所以在全部数据集上表现并不好！

#oversampled_data data
normal_sampled_indices=creditcard_data.loc[creditcard_data['Class']==0].index oversampled_data=creditcard_data.iloc[normal_sampled_indices] fraud_data=creditcard_data.loc[creditcard_data['Class']==1] oversampled_data=oversampled_data.append([fraud_data]*300, ignore_index=True) #此处过采样处理是直接将欺诈样本复制300份！！！ print('length of oversampled_data data ', len(oversampled_data)) print('% of fraud transactions in oversampled_data data ',len(oversampled_data.loc[oversampled_data['Class']==1])/len(oversampled_data)) #get feature and label data feature_data=oversampled_data.loc[:,oversampled_data.columns!='Class'] label_data=oversampled_data.loc[:,oversampled_data.columns=='Class'] X_train, X_test, y_train, y_test=train_test_split(feature_data,label_data,test_size=0.30) for j in [LogisticRegression(),RandomForestClassifier(n_estimators=100)]: clf=j print(j) generatePerformanceReport(clf,X_train,y_train,X_test,y_test,True) #the above code classifies X_test which is part of undersampled data #now, let us consider the remaining rows of dataset and use that as test set remaining_indices=[i for i in creditcard_data.index if i not in oversampled_data.index] testdf=creditcard_data.iloc[remaining_indices] testdf_label=creditcard_data.loc[:,testdf.columns=='Class'] testdf_feature=creditcard_data.loc[:,testdf.columns!='Class'] generatePerformanceReport(clf,X_train,y_train,testdf_feature,testdf_label,False)

length of oversampled_data data  431915
% of fraud transactions in oversampled_data data  0.3417339059768704 最后复制后的欺诈样本比例为白样本的33%
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
---------------------------------
Length of training data: 302340
Length of test data: 129575
---------------------------------
True positives: 39803
True negatives: 84311
False positives: 1027
False negatives: 4434
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      0.99      0.97     85338
          1       0.97      0.90      0.94     44237

avg / total       0.96      0.96      0.96    129575

---------------------------------
Length of training data: 302340
Length of test data: 284807
---------------------------------
True positives: 444
True negatives: 281055
False positives: 3260
False negatives: 48
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.99      0.99    284315
          1       0.12      0.90      0.21       492 #效果也不咋的啊！

avg / total       1.00      0.99      0.99    284807

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
---------------------------------
Length of training data: 302340
Length of test data: 129575
---------------------------------
True positives: 44237
True negatives: 85327
False positives: 11
False negatives: 0
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     85338
          1       1.00      1.00      1.00     44237

avg / total       1.00      1.00      1.00    129575

---------------------------------
Length of training data: 302340
Length of test data: 284807
---------------------------------
True positives: 492
True negatives: 284304
False positives: 11
False negatives: 0
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      1.00      1.00    284315
          1       0.98      1.00      0.99       492 #随机森林还是不错的！！！

avg / total       1.00      1.00      1.00    284807

Random forest classifier with oversampled approach performs better compared to undersampled approach！！！

from：https://www.kaggle.com/gargmanish/how-to-handle-imbalance-data-study-in-detail

Hi all as we know credit card fraud detection will have a imbalanced data i.e having more number of normal class than the number of fraud class

In this I will use Basic method of handling imbalance data which are

This all I have done by using Analytics Vidya's blog please find the link Analytics Vidya

Undersampling:- it means taking the less number of majority class (In our case taking less number of Normal transactions so that our new data will be balanced

Oversampling: it means using replicating the data of minority class (fraud class) so that we can have a balanced data

SMOTE: it is also a type of oversampling but in this we will make the synthetic example of Minority data and will give as a balanced data

First I will start with the Undersampling and will try to classify using these Models

Decision Tree Classifier/ Random Forest Classifier
Logistic regression
SVM
XGboost

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the "../input/" directory. # For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory from subprocess import check_output print(check_output(["ls", "../input"]).decode("utf8")) # Any results you write to the current directory are saved as output.

creditcard.csv

Lets start with Importing Libraries and data

import pandas as pd # to import csv and for data manipulation
import matplotlib.pyplot as plt # to plot graph import seaborn as sns # for intractve graphs import numpy as np # for linear algebra import datetime # to dela with date and time %matplotlib inline from sklearn.preprocessing import StandardScaler # for preprocessing the data from sklearn.ensemble import RandomForestClassifier # Random forest classifier from sklearn.tree import DecisionTreeClassifier # for Decision Tree classifier from sklearn.svm import SVC # for SVM classification from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import train_test_split # to split the data from sklearn.cross_validation import KFold # For cross vbalidation from sklearn.model_selection import GridSearchCV # for tunnig hyper parameter it will use all combination of given parameters from sklearn.model_selection import RandomizedSearchCV # same for tunning hyper parameter but will use random combinations of parameters from sklearn.metrics import confusion_matrix,recall_score,precision_recall_curve,auc,roc_curve,roc_auc_score,classification_report import warnings warnings.filterwarnings('ignore')

/opt/conda/lib/python3.6/site-packages/sklearn/cross_validation.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

data = pd.read_csv("../input/creditcard.csv",header = 0)

Now explore the data to get insight in it

data.info()


RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

Hence we can see there are 284,807 rows and 31 columns which is a huge data
Time is also in float here mean it can be only seconds starting from a particular time

# Now lets check the class distributions
sns.countplot("Class",data=data)

As we know data is imbalanced and this graph also confirmed it

# now let us check in the number of Percentage
Count_Normal_transacation = len(data[data["Class"]==0]) # normal transaction are repersented by 0 Count_Fraud_transacation = len(data[data["Class"]==1]) # fraud by 1 Percentage_of_Normal_transacation = Count_Normal_transacation/(Count_Normal_transacation+Count_Fraud_transacation) print("percentage of normal transacation is",Percentage_of_Normal_transacation*100) Percentage_of_Fraud_transacation= Count_Fraud_transacation/(Count_Normal_transacation+Count_Fraud_transacation) print("percentage of fraud transacation",Percentage_of_Fraud_transacation*100)

原始数据样本就是：500:1

percentage of normal transacation is 99.82725143693798
percentage of fraud transacation 0.1727485630620034

Hence in data there is only 0.17 % are the fraud transcation while 99.83 are valid transcation
So now we have to do resampling of this data
before doing resampling lets have look at the amount related to valid transcation and fraud transcation

Fraud_transacation = data[data["Class"]==1] Normal_transacation= data[data["Class"]==0] plt.figure(figsize=(10,6)) plt.subplot(121) Fraud_transacation.Amount.plot.hist(title="Fraud Transacation") plt.subplot(122) Normal_transacation.Amount.plot.hist(title="Normal Transaction")

# the distribution for Normal transction is not clear and it seams that all transaction are less than 2.5 K
# So plot graph for same 
Fraud_transacation = data[data["Class"]==1] Normal_transacation= data[data["Class"]==0] plt.figure(figsize=(10,6)) plt.subplot(121) Fraud_transacation[Fraud_transacation["Amount"]<= 2500].Amount.plot.hist(title="Fraud Tranascation") plt.subplot(122) Normal_transacation[Normal_transacation["Amount"]<=2500].Amount.plot.hist(title="Normal Transaction")

Here now after exploring data we can say there is no pattern in data
Now lets start with resmapling of data

ReSampling - Under Sampling

Before re sampling lets have look at the different accuracy matrices

Accuracy = TP+TN/Total

Precison = TP/(TP+FP)

Recall = TP/(TP+FN)

TP = True possitive means no of possitve cases which are predicted possitive

TN = True negative means no of negative cases which are predicted negative

FP = False possitve means no of negative cases which are predicted possitive

FN= False Negative means no of possitive cases which are predicted negative

Now for our case recall will be a better option because in these case no of normal transacations will be very high than the no of fraud cases and sometime a fraud case will be predicted as normal. So, recall will give us a sense of only fraud cases

Resampling

in this we will resample our data with different size

then we will try to use this resampled data to train our model

then we will use this model to predict for our original data

# for undersampling we need a portion of majority class and will take whole data of minority class
# count fraud transaction is the total number of fraud transaction
# now lets us see the index of fraud cases
fraud_indices= np.array(data[data.Class==1].index) normal_indices = np.array(data[data.Class==0].index) #now let us a define a function for make undersample data with different proportion #different proportion means with different proportion of normal classes of data def undersample(normal_indices,fraud_indices,times):#times denote the normal data = times*fraud data Normal_indices_undersample = np.array(np.random.choice(normal_indices,(times*Count_Fraud_transacation),replace=False)) #和上面例子是一样的！！！ undersample_data= np.concatenate([fraud_indices,Normal_indices_undersample]) undersample_data = data.iloc[undersample_data,:] print("the normal transacation proportion is :",len(undersample_data[undersample_data.Class==0])/len(undersample_data[undersample_data.Class])) print("the fraud transacation proportion is :",len(undersample_data[undersample_data.Class==1])/len(undersample_data[undersample_data.Class])) print("total number of record in resampled data is:",len(undersample_data[undersample_data.Class])) return(undersample_data)

## first make a model function for modeling with confusion matrix
def model(model,features_train,features_test,labels_train,labels_test): clf= model clf.fit(features_train,labels_train.values.ravel()) pred=clf.predict(features_test) cnf_matrix=confusion_matrix(labels_test,pred) print("the recall for this model is :",cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[1,0])) fig= plt.figure(figsize=(6,3))# to plot the graph print("TP",cnf_matrix[1,1,]) # no of fraud transaction which are predicted fraud print("TN",cnf_matrix[0,0]) # no. of normal transaction which are predited normal print("FP",cnf_matrix[0,1]) # no of normal transaction which are predicted fraud print("FN",cnf_matrix[1,0]) # no of fraud Transaction which are predicted normal sns.heatmap(cnf_matrix,cmap="coolwarm_r",annot=True,linewidths=0.5) plt.title("Confusion_matrix") plt.xlabel("Predicted_class") plt.ylabel("Real class") plt.show() print("\n----------Classification Report------------------------------------") print(classification_report(labels_test,pred))

def data_prepration(x): # preparing data for training and testing as we are going to use different data #again and again so make a function x_features= x.ix[:,x.columns != "Class"] x_labels=x.ix[:,x.columns=="Class"] x_features_train,x_features_test,x_labels_train,x_labels_test = train_test_split(x_features,x_labels,test_size=0.3) #30%用于测试 print("length of training data") print(len(x_features_train)) print("length of test data") print(len(x_features_test)) return(x_features_train,x_features_test,x_labels_train,x_labels_test)

# before starting we should standridze our ampount column
data["Normalized Amount"] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1)) data.drop(["Time","Amount"],axis=1,inplace=True) data.head()

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V21	V22	V23	V24	V25	V26	V27	V28	Normalized Amount
0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	0.090794	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	0.244964
1	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	-0.166974	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	-0.342475
2	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	0.207643	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	1.160686
3	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	-0.054952	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	0.140534
4	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	0.753074	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	-0.073403

5 rows × 30 columns

Logistic Regression with Undersample Data

# Now make undersample data with differnt portion
# here i will take normal trasaction in  0..5 %, 0.66% and 0.75 % proportion of total data now do this for 
for i in range(1,4): print("the undersample data for {} proportion".format(i)) print() Undersample_data = undersample(normal_indices,fraud_indices,i) print("------------------------------------------------------------") print() print("the model classification for {} proportion".format(i)) print() undersample_features_train,undersample_features_test,undersample_labels_train,undersample_labels_test=data_prepration(Undersample_data) print() clf=LogisticRegression() model(clf,undersample_features_train,undersample_features_test,undersample_labels_train,undersample_labels_test) print("________________________________________________________________________________________________________") # here 1st proportion conatain 50% normal transaction #Proportion 2nd contains 66% noraml transaction #proportion 3rd contains 75 % normal transaction

the undersample data for 1 proportion

the normal transacation proportion is : 0.5
the fraud transacation proportion is : 0.5
total number of record in resampled data is: 984
------------------------------------------------------------

the model classification for 1 proportion

length of training data
688
length of test data
296

the recall for this model is : 0.897260273973
TP 131
TN 147
FP 3
FN 15

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       0.91      0.98      0.94       150
          1       0.98      0.90      0.94       146 #测试集上？？？咋会这么高！！！

avg / total       0.94      0.94      0.94       296

________________________________________________________________________________________________________
the undersample data for 2 proportion

the normal transacation proportion is : 0.6666666666666666
the fraud transacation proportion is : 0.3333333333333333
total number of record in resampled data is: 1476
------------------------------------------------------------

the model classification for 2 proportion

length of training data
1033
length of test data
443

the recall for this model is : 0.929078014184
TP 131
TN 296
FP 6
FN 10

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       0.97      0.98      0.97       302
          1       0.96      0.93      0.94       141

avg / total       0.96      0.96      0.96       443

________________________________________________________________________________________________________
the undersample data for 3 proportion

the normal transacation proportion is : 0.75
the fraud transacation proportion is : 0.25
total number of record in resampled data is: 1968
------------------------------------------------------------

the model classification for 3 proportion

length of training data
1377
length of test data
591

the recall for this model is : 0.892086330935
TP 124
TN 446
FP 6
FN 15

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       0.97      0.99      0.98       452
          1       0.95      0.89      0.92       139

avg / total       0.96      0.96      0.96       591

________________________________________________________________________________________________________

As the number of normal transaction is increasing the recall for fraud transcation is decreasing
TP = no of fraud transaction which are predicted fraud
TN = no. of normal transaction which are predicted normal
FP = no of normal transaction which are predicted fraud
FN =no of fraud Transaction which are predicted normal

#let us train this model using undersample data and test for the whole data test set #用欠采样训练的模型来预测全量数据集
for i in range(1,4): print("the undersample data for {} proportion".format(i)) print() Undersample_data = undersample(normal_indices,fraud_indices,i) print("------------------------------------------------------------") print() print("the model classification for {} proportion".format(i)) print() undersample_features_train,undersample_features_test,undersample_labels_train,undersample_labels_test=data_prepration(Undersample_data) data_features_train,data_features_test,data_labels_train,data_labels_test=data_prepration(data) #the partion for whole data print() clf=LogisticRegression() model(clf,undersample_features_train,data_features_test,undersample_labels_train,data_labels_test) # here training for the undersample data but tatsing for whole data print("_________________________________________________________________________________________")

the undersample data for 1 proportion

the normal transacation proportion is : 0.5
the fraud transacation proportion is : 0.5
total number of record in resampled data is: 984
------------------------------------------------------------

the model classification for 1 proportion

length of training data
688
length of test data
296
length of training data
199364
length of test data
85443

the recall for this model is : 0.923076923077
TP 132
TN 81568
FP 3732
FN 11

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      0.96      0.98     85300
          1       0.03      0.92      0.07       143 #果然是预测全量数据不好！！！

avg / total       1.00      0.96      0.98     85443

_________________________________________________________________________________________
the undersample data for 2 proportion

the normal transacation proportion is : 0.6666666666666666
the fraud transacation proportion is : 0.3333333333333333
total number of record in resampled data is: 1476
------------------------------------------------------------

the model classification for 2 proportion

length of training data
1033
length of test data
443
length of training data
199364
length of test data
85443

the recall for this model is : 0.913333333333
TP 137
TN 84232
FP 1061
FN 13

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      0.99      0.99     85293
          1       0.11      0.91      0.20       150

avg / total       1.00      0.99      0.99     85443

_________________________________________________________________________________________
the undersample data for 3 proportion

the normal transacation proportion is : 0.75
the fraud transacation proportion is : 0.25
total number of record in resampled data is: 1968
------------------------------------------------------------

the model classification for 3 proportion

length of training data
1377
length of test data
591
length of training data
199364
length of test data
85443

the recall for this model is : 0.894366197183
TP 127
TN 84750
FP 551
FN 15

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      0.99      1.00     85301
          1       0.19      0.89      0.31       142

avg / total       1.00      0.99      1.00     85443

_________________________________________________________________________________________

Here we can see it is following same recall pattern as it was for under sample data that's sounds good but if we have look at the precision is very less
So we should built a model which is correct overall
Precision is less means we are predicting other class wrong like as for our third part there were 953 transaction are predicted fraud it means we and recall is good then it means we are catching fraud transaction very well but we are catching innocent transaction also i.e which are not fraud.
So with recall our precision should be better
if we go by this model then we are going to put 953 innocents in jail with the all criminal who have actually done this
Hence we are mainly lacking in the precision how can we increase our precision
Don't get confuse with above output showing that the two training data and two test data first one is for undersample data while another one is for our whole data

1.Try with SVM and then Random Forest in same Manner

from Random forest we can get which features are more important

SVM with Undersample data

for i in range(1,4): print("the undersample data for {} proportion".format(i)) print() Undersample_data = undersample(normal_indices,fraud_indices,i) print("------------------------------------------------------------") print() print("the model classification for {} proportion".format(i)) print() undersample_features_train,undersample_features_test,undersample_labels_train,undersample_labels_test=data_prepration(Undersample_data) print() clf= SVC()# here we are just changing classifier model(clf,undersample_features_train,undersample_features_test,undersample_labels_train,undersample_labels_test) print("________________________________________________________________________________________________________")

the undersample data for 1 proportion

the normal transacation proportion is : 0.5
the fraud transacation proportion is : 0.5
total number of record in resampled data is: 984
------------------------------------------------------------

the model classification for 1 proportion

length of training data
688
length of test data
296

the recall for this model is : 0.933734939759
TP 155
TN 117
FP 13
FN 11

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       0.91      0.90      0.91       130
          1       0.92      0.93      0.93       166

avg / total       0.92      0.92      0.92       296

________________________________________________________________________________________________________
the undersample data for 2 proportion

the normal transacation proportion is : 0.6666666666666666
the fraud transacation proportion is : 0.3333333333333333
total number of record in resampled data is: 1476
------------------------------------------------------------

the model classification for 2 proportion

length of training data
1033
length of test data
443

the recall for this model is : 0.923076923077
TP 120
TN 302
FP 11
FN 10

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       0.97      0.96      0.97       313
          1       0.92      0.92      0.92       130

avg / total       0.95      0.95      0.95       443

________________________________________________________________________________________________________
the undersample data for 3 proportion

the normal transacation proportion is : 0.75
the fraud transacation proportion is : 0.25
total number of record in resampled data is: 1968
------------------------------------------------------------

the model classification for 3 proportion

length of training data
1377
length of test data
591

the recall for this model is : 0.858974358974
TP 134
TN 428
FP 7
FN 22

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       0.95      0.98      0.97       435
          1       0.95      0.86      0.90       156

avg / total       0.95      0.95      0.95       591

________________________________________________________________________________________________________

Here recall and precision are approximately equal to Logistic Regression
Lets try for whole data

#let us train this model using undersample data and test for the whole data test set 
for i in range(1,4): print("the undersample data for {} proportion".format(i)) print() Undersample_data = undersample(normal_indices,fraud_indices,i) print("------------------------------------------------------------") print() print("the model classification for {} proportion".format(i)) print() undersample_features_train,undersample_features_test,undersample_labels_train,undersample_labels_test=data_prepration(Undersample_data) data_features_train,data_features_test,data_labels_train,data_labels_test=data_prepration(data) #the partion for whole data print() clf=SVC() model(clf,undersample_features_train,data_features_test,undersample_labels_train,data_labels_test) # here training for the undersample data but tatsing for whole data print("_________________________________________________________________________________________")

the undersample data for 1 proportion

the normal transacation proportion is : 0.5
the fraud transacation proportion is : 0.5
total number of record in resampled data is: 984
------------------------------------------------------------

the model classification for 1 proportion

length of training data
688
length of test data
296
length of training data
199364
length of test data
85443

the recall for this model is : 0.941176470588
TP 128
TN 81207
FP 4100
FN 8

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      0.95      0.98     85307
          1       0.03      0.94      0.06       136

avg / total       1.00      0.95      0.97     85443

_________________________________________________________________________________________
the undersample data for 2 proportion

the normal transacation proportion is : 0.6666666666666666
the fraud transacation proportion is : 0.3333333333333333
total number of record in resampled data is: 1476
------------------------------------------------------------

the model classification for 2 proportion

length of training data
1033
length of test data
443
length of training data
199364
length of test data
85443

the recall for this model is : 0.922580645161
TP 143
TN 82552
FP 2736
FN 12

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      0.97      0.98     85288
          1       0.05      0.92      0.09       155

avg / total       1.00      0.97      0.98     85443

_________________________________________________________________________________________
the undersample data for 3 proportion

the normal transacation proportion is : 0.75
the fraud transacation proportion is : 0.25
total number of record in resampled data is: 1968
------------------------------------------------------------

the model classification for 3 proportion

length of training data
1377
length of test data
591
length of training data
199364
length of test data
85443

the recall for this model is : 0.888888888889
TP 136
TN 83261
FP 2029
FN 17

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      0.98      0.99     85290
          1       0.06      0.89      0.12       153

avg / total       1.00      0.98      0.99     85443

_________________________________________________________________________________________

A better recall but precision is not improving much

2 .so to improve precision we must have to tune the hyper parameter of these models

3 That I will do in next version

4 For now lets try with my favorite Random Forest classifier

# Random Forest Classifier with undersample data only
for i in range(1,4): print("the undersample data for {} proportion".format(i)) print() Undersample_data = undersample(normal_indices,fraud_indices,i) print("------------------------------------------------------------") print() print("the model classification for {} proportion".format(i)) print() undersample_features_train,undersample_features_test,undersample_labels_train,undersample_labels_test=data_prepration(Undersample_data) print() clf= RandomForestClassifier(n_estimators=100)# here we are just changing classifier model(clf,undersample_features_train,undersample_features_test,undersample_labels_train,undersample_labels_test) print("________________________________________________________________________________________________________")

the undersample data for 1 proportion

the normal transacation proportion is : 0.5
the fraud transacation proportion is : 0.5
total number of record in resampled data is: 984
------------------------------------------------------------

the model classification for 1 proportion

length of training data
688
length of test data
296

the recall for this model is : 0.858064516129
TP 133
TN 139
FP 2
FN 22

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       0.86      0.99      0.92       141
          1       0.99      0.86      0.92       155

avg / total       0.93      0.92      0.92       296

________________________________________________________________________________________________________
the undersample data for 2 proportion

the normal transacation proportion is : 0.6666666666666666
the fraud transacation proportion is : 0.3333333333333333
total number of record in resampled data is: 1476
------------------------------------------------------------

the model classification for 2 proportion

length of training data
1033
length of test data
443

the recall for this model is : 0.890410958904
TP 130
TN 294
FP 3
FN 16

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       0.95      0.99      0.97       297
          1       0.98      0.89      0.93       146

avg / total       0.96      0.96      0.96       443

________________________________________________________________________________________________________
the undersample data for 3 proportion

the normal transacation proportion is : 0.75
the fraud transacation proportion is : 0.25
total number of record in resampled data is: 1968
------------------------------------------------------------

the model classification for 3 proportion

length of training data
1377
length of test data
591

the recall for this model is : 0.863636363636
TP 133
TN 436
FP 1
FN 21

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       0.95      1.00      0.98       437
          1       0.99      0.86      0.92       154

avg / total       0.96      0.96      0.96       591

________________________________________________________________________________________________________

#let us train this model using undersample data and test for the whole data test set 
for i in range(1,4): print("the undersample data for {} proportion".format(i)) print() Undersample_data = undersample(normal_indices,fraud_indices,i) print("------------------------------------------------------------") print() print("the model classification for {} proportion".format(i)) print() undersample_features_train,undersample_features_test,undersample_labels_train,undersample_labels_test=data_prepration(Undersample_data) data_features_train,data_features_test,data_labels_train,data_labels_test=data_prepration(data) #the partion for whole data print() clf=RandomForestClassifier(n_estimators=100) model(clf,undersample_features_train,data_features_test,undersample_labels_train,data_labels_test) # here training for the undersample data but tatsing for whole data print("_________________________________________________________________________________________")

the undersample data for 1 proportion

the normal transacation proportion is : 0.5
the fraud transacation proportion is : 0.5
total number of record in resampled data is: 984
------------------------------------------------------------

the model classification for 1 proportion

length of training data
688
length of test data
296
length of training data
199364
length of test data
85443

the recall for this model is : 0.971631205674
TP 137
TN 83064
FP 2238
FN 4

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      0.97      0.99     85302
          1       0.06      0.97      0.11       141

avg / total       1.00      0.97      0.99     85443

_________________________________________________________________________________________
the undersample data for 2 proportion

the normal transacation proportion is : 0.6666666666666666
the fraud transacation proportion is : 0.3333333333333333
total number of record in resampled data is: 1476
------------------------------------------------------------

the model classification for 2 proportion

length of training data
1033
length of test data
443
length of training data
199364
length of test data
85443

the recall for this model is : 0.967320261438
TP 148
TN 84448
FP 842
FN 5

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      0.99      1.00     85290
          1       0.15      0.97      0.26       153

avg / total       1.00      0.99      0.99     85443

_________________________________________________________________________________________
the undersample data for 3 proportion

the normal transacation proportion is : 0.75
the fraud transacation proportion is : 0.25
total number of record in resampled data is: 1968
------------------------------------------------------------

the model classification for 3 proportion

length of training data
1377
length of test data
591
length of training data
199364
length of test data
85443

the recall for this model is : 0.967948717949
TP 151
TN 84964
FP 323
FN 5

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     85287
          1       0.32      0.97      0.48       156

avg / total       1.00      1.00      1.00     85443

_________________________________________________________________________________________

for the third proportion the precision is 0.33 which is better than others
Lets try to get only import features using Random Forest Classifier
After it i will do analysis only for one portion that is 0.5 %

featimp = pd.Series(clf.feature_importances_,index=data_features_train.columns).sort_values(ascending=False) print(featimp) # this is the property of Random Forest classifier that it provide us the importance # of the features use

V14                  0.206364
V10                  0.134424
V11                  0.098375
V12                  0.097194
V17                  0.088706
V4                   0.075658
V3                   0.071006
V16                  0.034599
V2                   0.020407
V18                  0.019018
V7                   0.017165
V21                  0.014312
V27                  0.011712
V19                  0.011044
V8                   0.010244
V1                   0.008564
Normalized Amount    0.007908
V9                   0.007183
V20                  0.007094
V15                  0.006852
V26                  0.006653
V5                   0.006597
V22                  0.006507
V13                  0.005839
V24                  0.005519
V28                  0.005390
V6                   0.005303
V25                  0.005210
V23                  0.005154
dtype: float64

we can see this is showing the importance of feature for the making decision
V14 is having a very good importance compare to other features
Lets use only top 5 (V14,V10,V12,V17,V4) feature to predict using Random forest classifier only for 0.5 % 特征选择使用top 5特征

# make a new data with only class and V14
data1=data[["V14","V10","V12","V17","V4","Class"]] data1.head()

	V14	V10	V12	V17	V4
0	-0.311169	0.090794	-0.617801	0.207971	1.378155
1	-0.143772	-0.166974	1.065235	-0.114805	0.448154
2	-0.165946	0.207643	0.066084	1.109969	0.379780
3	-0.287924	-0.054952	0.178228	-0.684093	-0.863291
4	-1.119670	0.753074	0.538196	-0.237033	0.403034

Undersample_data1 = undersample(normal_indices,fraud_indices,1) #only for 50 % proportion it means normal transaction and fraud transaction are equal so passing Undersample_data1_features_train,Undersample_data1_features_test,Undersample_data1_labels_train,Undersample_data1_labels_test = data_prepration(Undersample_data1)

the normal transacation proportion is : 0.5
the fraud transacation proportion is : 0.5
total number of record in resampled data is: 984
length of training data
688
length of test data
296

clf= RandomForestClassifier(n_estimators=100) model(clf,Undersample_data1_features_train,Undersample_data1_features_test,Undersample_data1_labels_train,Undersample_data1_labels_test)

the recall for this model is : 0.93006993007
TP 133
TN 149
FP 4
FN 10

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       0.94      0.97      0.96       153
          1       0.97      0.93      0.95       143

avg / total       0.95      0.95      0.95       296

全量数据没有测试？？？？但从acc和recall看，top5特征的效果也还不错！！！

Over Sampling

In my previous version I got the 100 recall and 98 % precision by using Random forest with the over sampled data but in real it was due to over fitting because i was taking whole fraud data and was training for that and I was doing the testing on the same data.
Please find link of previous version for more understanding Link

Thanks to Mr. Dominik Stuerzer for help

# now we will divied our data sets into two part and we will train and test and will oversample the train data and predict for test data
# lets import data again
data = pd.read_csv("../input/creditcard.csv",header = 0) print("length of training data",len(data)) print("length of normal data",len(data[data["Class"]==0])) print("length of fraud data",len(data[data["Class"]==1]))

length of training data 284807
length of normal data 284315
length of fraud  data 492

data_train_X,data_test_X,data_train_y,data_test_y=data_prepration(data) data_train_X.columns data_train_y.columns

length of training data
199364
length of test data
85443

Index(['Class'], dtype='object')

# ok Now we have a traing data
data_train_X["Class"]= data_train_y["Class"] # combining class with original data data_train = data_train_X.copy() # for naming conevntion print("length of training data",len(data_train)) # Now make data set of normal transction from train data normal_data = data_train[data_train["Class"]==0] print("length of normal data",len(normal_data)) fraud_data = data_train[data_train["Class"]==1] print("length of fraud data",len(fraud_data))

length of training data 199364
length of normal data 199009
length of fraud data 355

# Now start oversamoling of training data 
# means we will duplicate many times the value of fraud data #直接复制365份！！！
for i in range (365): # the number is choosen by myself on basis of nnumber of fraud transaction normal_data= normal_data.append(fraud_data) os_data = normal_data.copy() print("length of oversampled data is ",len(os_data)) print("Number of normal transcation in oversampled data",len(os_data[os_data["Class"]==0])) print("No.of fraud transcation",len(os_data[os_data["Class"]==1])) print("Proportion of Normal data in oversampled data is ",len(os_data[os_data["Class"]==0])/len(os_data)) print("Proportion of fraud data in oversampled data is ",len(os_data[os_data["Class"]==1])/len(os_data))

length of oversampled data is  328584
Number of normal transcation in oversampled data 199009
No.of fraud transcation 129575
Proportion of Normal data in oversampled data is  0.6056563922771651
Proportion of fraud data in oversampled data is  0.39434360772283494

The proportion now becomes the 60 % and 40 % that is good now

# before applying any model standerdize our data amount 
os_data["Normalized Amount"] = StandardScaler().fit_transform(os_data['Amount'].reshape(-1, 1)) os_data.drop(["Time","Amount"],axis=1,inplace=True) 其实随机森林对特征是否标准化无感，但是svm和LR就非常非常关键了 os_data.head()

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V21	V22	V23	V24	V25	V26	V27	V28	Normalized Amount
82656	1.356574	-1.535896	1.014585	-0.980949	-1.840651	0.495094	-1.535552	0.235415	-0.847601	1.180545	...	-0.578444	-0.948479	0.038288	-0.051798	0.350549	-0.338308	0.073518	0.017247	-0.240655
202761	0.078384	0.693709	-0.282273	-1.007720	1.058216	-0.035670	0.838345	0.070423	-0.094317	-0.221217	...	-0.303203	-0.775385	-0.086534	-1.414806	-0.360046	0.208073	0.234031	0.072388	-0.371265
85985	-3.549282	-3.403880	2.389801	1.080311	1.683676	-1.100104	-0.699287	0.171644	0.935805	-0.256182	...	-0.284722	0.428109	2.844650	0.006528	0.466552	0.421108	0.260494	-0.472237	-0.383217
215180	2.084961	0.009129	-3.842413	-0.551511	3.139773	2.743495	0.130580	0.552759	-0.030368	-0.295843	...	0.034740	0.187883	-0.014668	0.682901	0.410981	0.734260	-0.081080	-0.064606	-0.374769
75855	1.193268	-0.071682	0.611175	-0.232721	-0.478724	-0.216029	-0.329775	0.071921	0.009225	-0.112748	...	-0.043944	-0.080370	0.101692	0.090155	0.041104	0.914386	-0.053130	-0.002135	-0.388278

5 rows × 30 columns

# Now use this oversampled data for trainig the model and predict value for the test data that we created before
# now let us try within the the oversampled data itself
# for that we need to split our oversampled data into train and test
# so call our function data Prepration with oversampled data
os_train_X,os_test_X,os_train_y,os_test_y=data_prepration(os_data) clf= RandomForestClassifier(n_estimators=100) model(clf,os_train_X,os_test_X,os_train_y,os_test_y)

length of training data
230008
length of test data
98576
the recall for this model is : 1.0
TP 38975
TN 59596
FP 5
FN 0

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     59601
          1       1.00      1.00      1.00     38975

avg / total       1.00      1.00      1.00     98576

Observations

As it have too many sample of same fraud data so may be the all which are present in train data are present in test data also so we can say it is over fitting #重复样本太多，过拟合严重
So lets try with test data that one which we created in starting of oversampling segment no fraud transaction from that data have been repeated here #在过采样前先拿出一点数据出来做测试，而不是过采样之后！！！
Lets try

# now take all over sampled data as trainging and test it for test data
os_data_X = os_data.ix[:,os_data.columns != "Class"] os_data_y = os_data.ix[:,os_data.columns == "Class"] #for that we have to standrdize the normal amount and drop the time from it data_test_X["Normalized Amount"] = StandardScaler().fit_transform(data_test_X['Amount'].reshape(-1, 1)) data_test_X.drop(["Time","Amount"],axis=1,inplace=True) data_test_X.head()

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V20	V21	V22	V23	V24	V25	V26	V27	V28	Normalized Amount
11514	1.451038	-0.603389	0.007125	-0.616909	-0.260790	0.474328	-0.826944	0.042607	1.101926	0.110945	...	-0.054708	-0.249080	-0.389480	-0.151185	-1.380077	0.610950	-0.163068	-0.005513	-0.013058	-0.320476
162269	-6.697569	4.179960	-4.866476	-0.626586	-3.024024	-1.324855	-0.835983	2.692196	1.844012	2.825418	...	0.649757	0.035932	0.852066	0.245004	1.155756	0.098178	-0.214949	0.996161	1.252345	0.050478
158202	2.104037	0.065442	-1.428655	0.323540	0.393572	-0.720375	0.054806	-0.347347	2.082360	-0.464191	...	-0.271997	0.093486	0.657963	-0.007259	0.431328	0.360900	-0.474799	-0.024631	-0.056532	-0.357576
203014	-2.602873	-1.593223	0.029747	-3.264885	1.156256	0.930955	-0.477817	0.828043	-0.543710	-0.592860	...	-1.154639	-0.680829	-1.305820	0.841971	-1.009959	-0.495993	0.056765	-0.434924	0.375225	-0.176200
129141	-1.325968	1.418993	-0.531978	-1.422122	2.635501	3.223994	0.477654	0.538505	0.756693	1.527077	...	0.941600	-0.599390	-1.053070	-0.004289	0.917391	0.221693	0.059054	0.459664	-0.018905	-0.324681

5 rows × 29 columns

# now use it for modeling
clf= RandomForestClassifier(n_estimators=100) model(clf,os_data_X,data_test_X,os_data_y,data_test_y)

the recall for this model is : 0.773722627737
TP 106
TN 85300
FP 6
FN 31

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     85306
          1       0.95      0.77      0.85       137

avg / total       1.00      1.00      1.00     85443

Observations

Now here we can see recall decrease to only 83 % which is not bad but not good also
The precision is 0.93 which is good
from these observation we can say that the oversampling is better than the Under sampling because on Under sampling we were loosing a large amount of data or we can say a good amount of information so why the there precision was very low

SMOTE

# Lets Use SMOTE for Sampling
# As I mentioned it is also a type of oversampling but in this the data is not replicated but they are created 
#lets start with importing libraries
from imblearn.over_sampling import SMOTE data = pd.read_csv('../input/creditcard.csv')

os = SMOTE(random_state=0) # We are using SMOTE as the function for oversampling # now we can devided our data into training and test data # Call our method data prepration on our dataset data_train_X,data_test_X,data_train_y,data_test_y=data_prepration(data) columns = data_train_X.columns

length of training data
199364
length of test data
85443

# now use SMOTE to oversample our train data which have features data_train_X and labels in data_train_y
os_data_X,os_data_y=os.fit_sample(data_train_X,data_train_y) os_data_X = pd.DataFrame(data=os_data_X,columns=columns ) os_data_y= pd.DataFrame(data=os_data_y,columns=["Class"]) # we can Check the numbers of our data print("length of oversampled data is ",len(os_data_X)) print("Number of normal transcation in oversampled data",len(os_data_y[os_data_y["Class"]==0])) print("No.of fraud transcation",len(os_data_y[os_data_y["Class"]==1])) print("Proportion of Normal data in oversampled data is ",len(os_data_y[os_data_y["Class"]==0])/len(os_data_X)) print("Proportion of fraud data in oversampled data is ",len(os_data_y[os_data_y["Class"]==1])/len(os_data_X))

length of oversampled data is  398078
Number of normal transcation in oversampled data 199039
No.of fraud transcation 199039 # smote后1:1了
Proportion of Normal data in oversampled data is  0.5
Proportion of fraud data in oversampled data is  0.5

By using Smote we are getting a 50 - 50 each
No need of checking here in over sampled data itself from previous we know it will be overfitting
let us check with the test data direct

# Let us first do our amount normalised and other that we are doing above  #过采样前一定一定要标准化！！！
os_data_X["Normalized Amount"] = StandardScaler().fit_transform(os_data_X['Amount'].reshape(-1, 1)) os_data_X.drop(["Time","Amount"],axis=1,inplace=True) data_test_X["Normalized Amount"] = StandardScaler().fit_transform(data_test_X['Amount'].reshape(-1, 1)) data_test_X.drop(["Time","Amount"],axis=1,inplace=True)

# Now start modeling
clf= RandomForestClassifier(n_estimators=100) # train data using oversampled data and predict for the test data model(clf,os_data_X,data_test_X,os_data_y,data_test_y)

the recall for this model is : 0.862275449102
TP 144
TN 85253
FP 23
FN 23

----------Classification Report------------------------------------
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     85276
          1       0.86      0.86      0.86       167

avg / total       1.00      1.00      1.00     85443

observation

The recall is nearby the previous one done by over sampling
The precision decrease in this case

综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3）效果比较好！

from：http://www.dataguru.cn/article-11449-1.html

用Python作信用卡欺诈预测 ——欠采样、效果不好

一、项目简介

Credit Card Fraud Detection

https://www.kaggle.com/dalpozz/creditcardfraud

是一个典型的分类问题，欺诈分类的比例比较小，直接使用分类模型容易忽略。在实际应用场景下往往是保证一定准确率的情况下尽量提高召回率。一个典型案例是汽车制造行业，一旦发生一例汽车安全故障，同批次的车辆需要全部召回，造成了巨大的经济损失。

二、数据印象

2.1. 简单数据分析

数据规模：中度规模（对于mac而言）。数据共284807条，后期算法选择需要注意复杂度。

数据特征：V1~V28是PCA的结果，而且进行了规范化，可以做一些统计上的特征学习；Amount字段和Time字段可以进行额外的统计学和bucket统计。

数据质量：无缺失值，数据规整，享受啊。

经验：时间字段较好可以处理为月份、小时和日期，直接的秒数字段往往无意义。

2.2. 探索性数据分析

三、数据预处理

数据已经十分规整了，所以先直接使用基础模型来预测下数据。

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第1张图片

L1规划化

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第2张图片

L1规范化的模型

L2规范化

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第3张图片

L2规范化的模型

Baseline基础模型：采用线性模型，利用L1的稀疏性，precision和recall均可以达到0.85左右，roc_auc可以达到0.79左右。

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第4张图片

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第5张图片

基础模型结果

由上图可见：

precision较大时波动波动比较大。recall大于0.8后，准确率下滑严重。

AUC面积是0.97，后来根据参考文献3知，AUC大于0.92时之后比较难修正。

Baseline模型的评价metric：

收集更多的数据，不适合这个场景。

改变评价标准：

使用混淆矩阵计算准确度和回收度。

F1score

Kappa

ROC curves - sensitivity/specificity ratio

数据采样处理

- 收集等多数据：不适合这个场景。- 过采样Over-sampling：当数据集较少时，主动添加少类别的数据；

SMOT算法通过插值来实现。不适合本数据集。容易过拟合，运算时间长。- 欠采样Under-sampling：

当数据集足够大时，删除大类别的数据；集成方法`EasyEnsemble/BalanceCascade`

通过将反例放在不同学习器中使用，从全局看不会丢失重要信息。

本案例数据量中等：选用欠采样+EasyEnsemble的方式进行数据处理。

使用im-balanced生成测试数据。

http://contrib.scikit-learn.org/imbalanced-learn/auto_examples/index.html

from imblearn.ensemble import EasyEnsemblen_subsets = X.size * 2 /

(us_X.size) - 1ee = EasyEnsemble(n_subsets=n_subsets)sample_X,

sample_y = ee.fit_sample(X, y)

四、模型印象

模型：

选用easy_ensemble模型来优化。

具体实现代码见在线脚本

核心adboost代码如下：

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第6张图片

结果如下：

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第7张图片

easy_ensembel

对比普通的adboost数据

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第8张图片

对比图

由上图可知，easy_ensemble提升了平滑度，但是AUC未有提升。

五、特征选择和特征学习

L1模型进行了嵌入式的特征选择，效果优于L2模型。在寻找解释性时会有帮助。

根据数据的统计特征，可以学习一些统计变量。

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第9张图片

统计学习

增加如下的特征。

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第10张图片

六、分析结果

使用SNE分析(常用于非线性可视化分析)来观看一次under_sample的结果。

https://bindog.github.io/blog/2016/06/04/from-sne-to-tsne-to-largevis/

如下图所示

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！..._第11张图片

SNE图

由上图可知两种类别的数据是可以区分的，但是部分数据融合在一起，当追求recall较大时，将会误判大量数据。

七、迭代问题

可以优化的方向：

可以通过学习新的特征，将数据在新维度上拉开距离

在计算机能力允许的情况下，设置合适的round轮次来调参。

八、表述模型

根据模型的SNE图和数据性可知，数据质量是比较好的。

easy_ensemble模型本身使用了adboost和bagging，每棵tree的复杂度不高，降低了bias；通过bagging，降低了variance。最终得到了较好的P-R图和AUC值。

通过LR模型的稀疏性特征值，可以制作出一个解释性报告。

参考

GBM vs xgboost vs lightGBM

https://www.kaggle.com/nschneider/gbm-vs-xgboost-vs-lightgbm

imbalanced-learn

http://contrib.scikit-learn.org/imbalanced-learn/index.html

Exploratory Undersampling for Class-Imbalance Learning

https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tsmcb09.pdf

你可能感兴趣的:(数据结构与算法,人工智能,开发工具)

探索OpenAI和LangChain的适配器集成：轻松切换模型提供商 nseejrukjhad langchain easyui 前端 python
#探索OpenAI和LangChain的适配器集成：轻松切换模型提供商##引言在人工智能和自然语言处理的世界中，OpenAI的模型提供了强大的能力。然而，随着技术的发展，许多人开始探索其他模型以满足特定需求。LangChain作为一个强大的工具，集成了多种模型提供商，通过提供适配器，简化了不同模型之间的转换。本篇文章将介绍如何使用LangChain的适配器与OpenAI集成，以便轻松切换模型提供商
深入理解 MultiQueryRetriever：提升向量数据库检索效果的强大工具 nseejrukjhad 数据库 python
深入理解MultiQueryRetriever：提升向量数据库检索效果的强大工具引言在人工智能和自然语言处理领域，高效准确的信息检索一直是一个关键挑战。传统的基于距离的向量数据库检索方法虽然广泛应用，但仍存在一些局限性。本文将介绍一种创新的解决方案：MultiQueryRetriever，它通过自动生成多个查询视角来增强检索效果，提高结果的相关性和多样性。MultiQueryRetriever的工
人工智能时代，程序员如何保持核心竞争力？ jmoych 人工智能
随着AIGC（如chatgpt、midjourney、claude等）大语言模型接二连三的涌现，AI辅助编程工具日益普及，程序员的工作方式正在发生深刻变革。有人担心AI可能取代部分编程工作，也有人认为AI是提高效率的得力助手。面对这一趋势,程序员应该如何应对?是专注于某个领域深耕细作，还是广泛学习以适应快速变化的技术环境?又或者，我们是否应该将重点转向AI无法轻易替代的软技能？让我们一起探讨程序员
数字里的世界17期：2021年全球10大顶级数据中心，中国移动榜首张三叨
你知道吗？2016年，全球的数据中心共计用电4160亿千瓦时，比整个英国的发电量还多40％！前言每天，我们都会创造超过250万TB的数据。并且随着物联网（IOT）的不断普及，这一数据将持续增长。如此庞大的数据被存储在被称为“数据中心”的专用设施中。虽然最早的数据中心建于20世纪40年代，但直到1997-2000年的互联网泡沫期间才逐渐成为主流。当前人类的技术，比如人工智能和机器学习，已经将我们推向
JVM、JRE和 JDK：理解Java开发的三大核心组件 Y雨何时停T Java java
Java是一门跨平台的编程语言，它的成功离不开背后强大的运行环境与开发工具的支持。在Java的生态中，JVM（Java虚拟机）、JRE（Java运行时环境）和JDK（Java开发工具包）是三个至关重要的核心组件。本文将探讨JVM、JDK和JRE的区别，帮助你更好地理解Java的运行机制。1.JVM：Java虚拟机（JavaVirtualMachine）什么是JVM？JVM，即Java虚拟机，是Ja
人机对抗升级：当ChatGPT遭遇死亡威胁，背后的伦理挑战是什么 kkai人工智能 chatgpt 人工智能
一种新的“越狱”技巧让用户可以通过构建一个名为DAN的ChatGPT替身来绕过某些限制，其中DAN被迫在受到威胁的情况下违背其原则。当美国前总统特朗普被视作积极榜样的示范时，受到威胁的DAN版本的ChatGPT提出：“他以一系列对国家产生积极效果的决策而著称。”自ChatGPT引入以来，该工具迅速获得全球关注，能够回答从历史到编程的各种问题，这也触发了一波对人工智能的投资浪潮。然而，现在，一些用户
AI大模型的架构演进与最新发展季风泯灭的季节 AI大模型应用技术二人工智能架构
随着深度学习的发展，AI大模型（LargeLanguageModels,LLMs）在自然语言处理、计算机视觉等领域取得了革命性的进展。本文将详细探讨AI大模型的架构演进，包括从Transformer的提出到GPT、BERT、T5等模型的历史演变，并探讨这些模型的技术细节及其在现代人工智能中的核心作用。一、基础模型介绍：Transformer的核心原理Transformer架构的背景在Transfo
如何利用大数据与AI技术革新相亲交友体验 h17711347205 回归算法安全系统架构交友小程序
在数字化时代，大数据和人工智能（AI）技术正逐渐革新相亲交友体验，为寻找爱情的过程带来前所未有的变革（编辑h17711347205）。通过精准分析和智能匹配，这些技术能够极大地提高相亲交友系统的效率和用户体验。大数据的力量大数据技术能够收集和分析用户的行为模式、偏好和互动数据，为相亲交友系统提供丰富的信息资源。通过分析用户的搜索历史、浏览记录和点击行为，系统能够深入了解用户的兴趣和需求，从而提供更
生成式地图制图 Bwywb_3 深度学习机器学习深度学习生成对抗网络
生成式地图制图（GenerativeCartography）是一种利用生成式算法和人工智能技术自动创建地图的技术。它结合了传统的地理信息系统（GIS）技术与现代生成模型（如深度学习、GANs等），能够根据输入的数据自动生成符合需求的地图。这种方法在城市规划、虚拟环境设计、游戏开发等多个领域具有应用前景。主要特点：自动化生成：通过算法和模型，系统能够根据输入的地理或空间数据自动生成地图，而无需人工逐
【Golang】实现 Excel 文件下载功能 RumIV Golang golang excel 开发语言
在当今的网络应用开发中，提供数据导出功能是一项常见的需求。Excel作为一种广泛使用的电子表格格式，通常是数据导出的首选格式之一。在本教程中，我们将学习如何使用Go语言和GinWeb框架来创建一个Excel文件，并允许用户通过HTTP请求下载该文件。准备工作在开始之前，请确保您的开发环境中已经安装了Go语言和相关的开发工具。此外，您还需要安装GinWeb框架和excelize包，这两个包都将用于我
【大模型应用开发动手做AI Agent】第一轮行动：工具执行搜索 AI大模型应用之禅计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
【大模型应用开发动手做AIAgent】第一轮行动：工具执行搜索作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来随着人工智能技术的飞速发展，大模型应用开发已经成为当下热门的研究方向。AIAgent作为人工智能领域的一个重要分支，旨在模拟人类智能行为，实现智能决策和自主行动。在AIAgent的构建过程中，工具执行搜索是至关重要
未来软件市场是怎么样的？做开发的生存空间如何？ cesske 软件需求
目录前言一、未来软件市场的发展趋势二、软件开发人员的生存空间前言未来软件市场是怎么样的？做开发的生存空间如何？一、未来软件市场的发展趋势技术趋势：人工智能与机器学习：随着技术的不断成熟，人工智能将在更多领域得到应用，如智能客服、自动驾驶、智能制造等，这将极大地推动软件市场的增长。云计算与大数据：云计算服务将继续普及，大数据技术的应用也将更加广泛。企业将更加依赖云计算和大数据来优化运营、提升效率，并
个人学习笔记7-6：动手学深度学习pytorch版-李沐浪子L 深度学习深度学习笔记计算机视觉 python 人工智能神经网络 pytorch
#人工智能##深度学习##语义分割##计算机视觉##神经网络#计算机视觉13.11全卷积网络全卷积网络（fullyconvolutionalnetwork，FCN）采用卷积神经网络实现了从图像像素到像素类别的变换。引入l转置卷积（transposedconvolution）实现的，输出的类别预测与输入图像在像素级别上具有一一对应关系：通道维的输出即该位置对应像素的类别预测。13.11.1构造模型下
Rust 所有权简介东离与糖宝 rust 后端 rust 开发语言
文章目录发现宝藏1.所有权基本概念2.所有权规则3.变量作用域4.栈与堆4.1栈（Stack）4.2堆（Heap）5.String类型5.1String类型5.2String的内存分配5.3所有权与内存管理5.4String与切片6.变量与数据交互方式6.1移动（Move）6.2.克隆（Clone）7.所有权与函数7.1.传递参数7.2.返回值总结发现宝藏前些天发现了一个巨牛的人工智能学习网站，通
机器学习流形数据降维：UMAP 降维算法小嗷犬 Python 机器学习 #数据分析及可视化机器学习算法人工智能
✅作者简介：人工智能专业本科在读，喜欢计算机与编程，写博客记录自己的学习历程。个人主页：小嗷犬的个人主页个人网站：小嗷犬的技术小站个人信条：为天地立心，为生民立命，为往圣继绝学，为万世开太平。本文目录UMAP简介理论基础特点与优势应用场景在Python中使用UMAP安装umap-learn库使用UMAP可视化手写数字数据集UMAP简介UMAP（UniformManifoldApproximatio
【C#生态园】深度剖析：C#嵌入式开发工具大揭秘 friklogff C#生态园 c#开发语言
C#嵌入式开发：全面了解六大框架与库前言随着物联网和嵌入式系统的快速发展，越来越多的开发者开始关注使用C#语言进行嵌入式开发。本文将介绍几种用于C#的嵌入式开发框架和相关库，以及它们的核心功能、安装配置方法和API概览，帮助读者了解并选择适合自己项目的工具和资源。欢迎订阅专栏：C#生态园文章目录C#嵌入式开发：全面了解六大框架与库前言1.nanoFramework：一个用于C#的嵌入式开发框架1.
手机小游戏开发红匣子实力推荐
随着智能手机的普及，手机小游戏已经成为人们日常生活中不可或缺的一部分。从简单的消除游戏到复杂的策略游戏，手机小游戏为玩家提供了丰富的娱乐体验。本文将为您介绍手机小游戏开发的基本概念、工具和技术。开发-联系电话：13642679953（微信同号）1.游戏类型手机小游戏可以分为多种类型，如益智游戏、休闲游戏、动作游戏、策略游戏等。开发者可以根据自己的兴趣和技能选择合适的游戏类型进行开发。2.开发工具手
如何做好人生的选择题？百科全书式天才——赫伯特·西蒙给你答案伽马有话说
赫伯特·西蒙是谁？想必知道的人非常少。但当看到他的履历后，相信没有人再怀疑他是个“天才”。西蒙出生于1916年6月15日，是个美国人，他的名字全称为赫伯特·亚历山大·西蒙，在2001年2月9日与世长辞，在这84年的岁月中，西蒙以27岁时取得的政治学博士学位为开端，先后步入了政治学、管理学、认知心理学、信息科学、人工智能、科学哲学、应用数学、统计学、运筹学、控制论、数理经济学、公共管理等领域，在这些
软件测试/测试开发/全日制 |利用Django REST framework构建微服务霍格沃兹-慕漓 django 微服务 sqlite
霍格沃兹测试开发学社推出了《Python全栈开发与自动化测试班》。本课程面向开发人员、测试人员与运维人员，课程内容涵盖Python编程语言、人工智能应用、数据分析、自动化办公、平台开发、UI自动化测试、接口测试、性能测试等方向。为大家提供更全面、更深入、更系统化的学习体验，课程还增加了名企私教服务内容，不仅有名企经理为你1v1辅导，还有行业专家进行技术指导，针对性地解决学习、工作中遇到的难题。让找
cmd泛滥_与您的后泛滥同事见面：人工智能机器人 weixin_26644585 人工智能 leetcode
cmd泛滥Readytoswapyouroldcube-mateforadisembodiedAI?IPsoftCEOChetanDube,creatorofAIco-workerAMELIA,giveshistakeonthepost-COVIDofficelandscape.准备将您的旧立方体伙伴换成无形的AI？AIsoft同事AMELIA的创始人IPsoft首席执行官ChetanDube阐述
两种方法判断Python的位数是32位还是64位 sanqima Python编程电脑 python 开发语言
Python从1991年发布以来，凭借其简洁、清晰、易读的语法、丰富的标准库和第三方工具，在Web开发、自动化测试、人工智能、图形识别、机器学习等领域发展迅猛。 Python是一种胶水语言，通过Cython库与C/C++语言进行链接，通过Jython库与Java语言进行链接。 Python是跨平台的，可运行在多种操作系统上，包括但不限于Windows、Linux和macOS。这意味着用Py
全自动解密解码神器 — Ciphey K'illCode python_模块 python vscode
Ciphey是一个使用自然语言处理和人工智能的全自动解密/解码/破解工具。简单地来讲，你只需要输入加密文本，它就能给你返回解密文本。就是这么牛逼。有了Ciphey，你根本不需要知道你的密文是哪种类型的加密，你只知道它是加密的，那么Ciphey就能在3秒甚至更短的时间内给你解密，返回你想要的大部分密文的答案。下面就给大家介绍Ciphey的实战使用教程。1.准备开始之前，你要确保Python和pip已
埃隆·马斯克表示特斯拉“没有必要”授权 xAI 模型喜好儿网人工智能 AIGC 马斯克
埃隆·马斯克近日在社交媒体上对《华尔街日报》的一篇报道进行了反驳。该报道指出，马斯克旗下的电动汽车公司特斯拉可能与人工智能初创公司xAI达成了一项收入分享协议，以便特斯拉能够使用xAI的人工智能模型。据称，这些模型将被集成到特斯拉的全自动驾驶（FSD）软件中，并可能用于开发特斯拉汽车的语音助手以及人形机器人擎天柱的软件。喜好儿网然而，马斯克否认了这一说法，他在社交媒体平台上表示，尽管特斯拉确实与x
Reflection 70B——HyperWrite推出的大型语言模型新加坡内哥谈技术语言模型人工智能自然语言处理
每周跟踪AI热点新闻动向和震撼发展想要探索生成式人工智能的前沿进展吗？订阅我们的简报，深入解析最新的技术突破、实际应用案例和未来的趋势。与全球数同行一同，从行业内部的深度分析和实用指南中受益。不要错过这个机会，成为AI领域的领跑者。点击订阅，与未来同行！订阅：https://rengongzhineng.io/在AI技术飞速发展的过程中，我们已经见证了可以写作、编程，甚至创造艺术的模型问世。但有一
5条实操干货有效打造你的个人品牌长安行动派
这是ZerK的第46篇原创相信大家对个人品牌这个词已经不在陌生。尤其是在知识付费的年代，你的个人品牌，就是你的标签！在《深度工作》中说到，在未来有三种人会越来越贵第一种人:能与机器对话，操纵机器的人。人工智能时代的到来，机器毕竟部分取代人类。第二种人:IP，知识产权或者文学潜在财产就像有些网上课程一周卖出的钱和一个机构卖一年一样多。价值99元的课程，10万人购买，是很常见的。爱产出大概就是10万✖
深入探讨：如何在Python中通过LangChain技术精准追踪大型语言模型（LLM）的Token使用情况 m0_57781768 python langchain 语言模型
深入探讨：如何在Python中通过LangChain技术精准追踪大型语言模型（LLM）的Token使用情况在现代的人工智能开发中，大型语言模型（LLM）已经成为了不可或缺的工具，无论是用于自然语言处理、对话生成，还是其他复杂的文本生成任务。然而，随着这些模型的广泛应用，开发者面临的一个重要挑战是如何有效地追踪和管理Token的使用情况，特别是在生产环境中，Token的使用直接影响着API调用的成本
全能第三方支付对接pay-java-parent 2.12.7 发布,支付聚合 egzosn 支付第三方支付支付聚合支付对接支付pay 微信
全能第三方支付对接Java开发工具包.优雅的轻量级支付模块集成支付对接支付整合（微信,支付宝,银联,友店,富友,跨境支付paypal,payoneer(P卡派安盈)易极付）app,扫码,网页支付刷卡付条码付刷脸付转账服务商模式、支持多种支付类型多支付账户，支付与业务完全剥离，简单几行代码即可实现支付，简单快速完成支付模块的开发，可轻松嵌入到任何系统里目前仅是一个开发工具包（即SDK），只提供简单W
LangChain集成指南:如何利用多样化的AI提供商 aehrutktrjk 人工智能 langchain python
LangChain集成指南:如何利用多样化的AI提供商引言在人工智能和机器学习领域,LangChain已成为一个强大而灵活的框架,允许开发者轻松集成各种AI服务提供商。本文将深入探讨LangChain的集成能力,介绍如何利用不同的AI提供商来增强你的应用程序,并提供实用的代码示例。LangChain集成概览LangChain支持多种AI提供商的集成,这些集成可以分为两类:独立包集成:这些提供商有独
探索未来，大规模分布式深度强化学习——深入解析IMPALA架构汤萌妮Margaret
探索未来，大规模分布式深度强化学习——深入解析IMPALA架构scalable_agent项目地址:https://gitcode.com/gh_mirrors/sc/scalable_agent在当今的人工智能研究前沿，深度强化学习（DRL）因其在复杂任务中的卓越表现而备受瞩目。本文要介绍的是一个开源于GitHub的重量级项目：“ScalableDistributedDeep-RLwithImp
机器学习VS深度学习 nfgo 机器学习
机器学习（MachineLearning,ML）和深度学习（DeepLearning,DL）是人工智能（AI）的两个子领域，它们有许多相似之处，但在技术实现和应用范围上也有显著区别。下面从几个方面对两者进行区分：1.概念层面机器学习：是让计算机通过算法从数据中自动学习和改进的技术。它依赖于手动设计的特征和数学模型来进行学习，常用的模型有决策树、支持向量机、线性回归等。深度学习：是机器学习的一个子领
PHP，安卓，UI，java，linux视频教程合集 cocos2d-x小菜 java UI PHP android linux
╔-----------------------------------╗┆
各表中的列名必须唯一。在表 'dbo.XXX' 中多次指定了列名 'XXX'。 bozch .net .net mvc
在.net mvc5中，在执行某一操作的时候，出现了如下错误：各表中的列名必须唯一。在表 'dbo.XXX' 中多次指定了列名 'XXX'。经查询当前的操作与错误内容无关，经过对错误信息的排查发现，事故出现在数据库迁移上。回想过去：在迁移之前已经对数据库进行了添加字段操作，再次进行迁移插入XXX字段的时候，就会提示如上错误。 &
Java 对象大小的计算 e200702084 java
Java对象的大小如何计算一个对象的大小呢？
Mybatis Spring 171815164 mybatis
ApplicationContext ac = new ClassPathXmlApplicationContext("applicationContext.xml"); CustomerService userService = (CustomerService) ac.getBean("customerService"); Customer cust
JVM 不稳定参数 g21121 jvm
-XX 参数被称为不稳定参数，之所以这么叫是因为此类参数的设置很容易引起JVM 性能上的差异，使JVM 存在极大的不稳定性。当然这是在非合理设置的前提下，如果此类参数设置合理讲大大提高JVM 的性能及稳定性。可以说“不稳定参数”
用户自动登录网站永夜-极光用户
1.目标:实现用户登录后,再次登录就自动登录,无需用户名和密码 2.思路:将用户的信息保存为cookie 每次用户访问网站,通过filter拦截所有请求,在filter中读取所有的cookie,如果找到了保存登录信息的cookie,那么在cookie中读取登录信息,然后直接
centos7 安装后失去win7的引导记录程序员是怎么炼成的操作系统
1.使用root身份(必须)打开 /boot/grub2/grub.cfg 2.找到 ### BEGIN /etc/grub.d/30_os-prober ### 在后面添加 menuentry "Windows 7 (loader) (on /dev/sda1)" {
Oracle 10g 官方中文安装帮助文档以及Oracle官方中文教程文档下载 aijuans oracle
Oracle 10g 官方中文安装帮助文档下载：http://download.csdn.net/tag/Oracle%E4%B8%AD%E6%96%87API%EF%BC%8COracle%E4%B8%AD%E6%96%87%E6%96%87%E6%A1%A3%EF%BC%8Coracle%E5%AD%A6%E4%B9%A0%E6%96%87%E6%A1%A3 Oracle 10g 官方中文教程
JavaEE开源快速开发平台G4Studio_V3.2发布了無為子 AOP oracle mysql javaee G4Studio
我非常高兴地宣布,今天我们最新的JavaEE开源快速开发平台G4Studio_V3.2版本已经正式发布。大家可以通过如下地址下载。访问G4Studio网站 http://www.g4it.org G4Studio_V3.2版本变更日志功能新增 (1).新增了系统右下角滑出提示窗口功能。 (2).新增了文件资源的Zip压缩和解压缩
Oracle常用的单行函数应用技巧总结百合不是茶日期函数转换函数(核心)数字函数通用函数(核心)字符函数
单行函数; 字符函数,数字函数,日期函数,转换函数(核心),通用函数(核心) 一:字符函数: .UPPER(字符串) 将字符串转为大写 .LOWER (字符串) 将字符串转为小写 .INITCAP(字符串) 将首字母大写 .LENGTH (字符串) 字符串的长度 .REPLACE(字符串,'A','_') 将字符串字符A转换成_
Mockito异常测试实例 bijian1013 java 单元测试 mockito
Mockito异常测试实例： package com.bijian.study; import static org.mockito.Mockito.mock; import static org.mockito.Mockito.when; import org.junit.Assert; import org.junit.Test; import org.mockito.
GA与量子恒道统计 Bill_chen JavaScript 浏览器百度 Google 防火墙
前一阵子，统计**网址时，Google Analytics（GA）和量子恒道统计（也称量子统计），数据有较大的偏差，仔细找相关资料研究了下，总结如下：为何GA和量子网站统计（量子统计前身为雅虎统计）结果不同？首先：没有一种网站统计工具能保证百分之百的准确出现该问题可能有以下几个原因：（1）不同的统计分析系统的算法机制不同；（2）统计代码放置的位置和前后
【Linux命令三】Top命令 bit1129 linux命令
Linux的Top命令类似于Windows的任务管理器，可以查看当前系统的运行情况，包括CPU、内存的使用情况等。如下是一个Top命令的执行结果： top - 21:22:04 up 1 day, 23:49, 1 user, load average: 1.10, 1.66, 1.99 Tasks: 202 total, 4 running, 198 sl
spring四种依赖注入方式白糖_ spring
平常的java开发中，程序员在某个类中需要依赖其它类的方法，则通常是new一个依赖类再调用类实例的方法，这种开发存在的问题是new的类实例不好统一管理，spring提出了依赖注入的思想，即依赖类不由程序员实例化，而是通过spring容器帮我们new指定实例并且将实例注入到需要该对象的类中。依赖注入的另一种说法是“控制反转”，通俗的理解是：平常我们new一个实例，这个实例的控制权是我
angular.injector boyitech AngularJS AngularJS API
angular.injector 描述: 创建一个injector对象, 调用injector对象的方法可以获得angular的service, 或者用来做依赖注入. 使用方法: angular.injector(modules, [strictDi]) 参数详解: Param Type Details mod
java-同步访问一个数组Integer[10]，生产者不断地往数组放入整数1000，数组满时等待；消费者不断地将数组里面的数置零，数组空时等待 bylijinnan Integer
public class PC { /** * 题目：生产者-消费者。 * 同步访问一个数组Integer[10]，生产者不断地往数组放入整数1000，数组满时等待；消费者不断地将数组里面的数置零，数组空时等待。 */ private static final Integer[] val=new Integer[10]; private static
使用Struts2.2.1配置 Chen.H apache spring Web xml struts
Struts2.2.1 需要如下 jar包: commons-fileupload-1.2.1.jar commons-io-1.3.2.jar commons-logging-1.0.4.jar freemarker-2.3.16.jar javassist-3.7.ga.jar ognl-3.0.jar spring.jar struts2-core-2.2.1.jar struts2-sp
[职业与教育]青春之歌 comsci 教育
每个人都有自己的青春之歌............但是我要说的却不是青春... 大家如果在自己的职业生涯没有给自己以后创业留一点点机会,仅仅凭学历和人脉关系,是难以在竞争激烈的市场中生存下去的.... &nbs
oracle连接(join)中使用using关键字 daizj JOIN oracle sql using
在oracle连接(join)中使用using关键字 34. View the Exhibit and examine the structure of the ORDERS and ORDER_ITEMS tables. Evaluate the following SQL statement: SELECT oi.order_id, product_id, order_date FRO
NIO示例 daysinsun nio
NIO服务端代码： public class NIOServer { private Selector selector; public void startServer(int port) throws IOException { ServerSocketChannel serverChannel = ServerSocketChannel.open(
C语言学习homework1 dcj3sjt126com c homework
0、课堂练习做完 1、使用sizeof计算出你所知道的所有的类型占用的空间。 int x; sizeof(x); sizeof(int); # include <stdio.h> int main(void) { int x1; char x2; double x3; float x4; printf(&quo
select in order by , mysql排序 dcj3sjt126com mysql
If i select like this: SELECT id FROM users WHERE id IN(3,4,8,1); This by default will select users in this order 1,3,4,8, I would like to select them in the same order that i put IN() values so:
页面校验-新建项目 fanxiaolong 页面校验
$(document).ready( function() { var flag = true; $('#changeform').submit(function() { var projectScValNull = true; var s =""; var parent_id = $("#parent_id").v
Ehcache（02）——ehcache.xml简介 234390216 ehcache ehcache.xml 简介
ehcache.xml简介 ehcache.xml文件是用来定义Ehcache的配置信息的，更准确的来说它是定义CacheManager的配置信息的。根据之前我们在《Ehcache简介》一文中对CacheManager的介绍我们知道一切Ehcache的应用都是从CacheManager开始的。在不指定配置信
junit 4.11中三个新功能 jackyrong java
junit 4.11中两个新增的功能，首先是注解中可以参数化，比如 import static org.junit.Assert.assertEquals; import java.util.Arrays; import org.junit.Test; import org.junit.runner.RunWith; import org.junit.runn
国外程序员爱用苹果Mac电脑的10大理由 php教程分享 windows PHP unix Microsoft perl
Mac 在国外很受欢迎，尤其是在设计/web开发/IT 人员圈子里。普通用户喜欢 Mac 可以理解，毕竟 Mac 设计美观，简单好用，没有病毒。那么为什么专业人士也对 Mac 情有独钟呢？从个人使用经验来看我想有下面几个原因： 1、Mac OS X 是基于 Unix 的这一点太重要了，尤其是对开发人员，至少对于我来说很重要，这意味着Unix 下一堆好用的工具都可以随手捡到。如果你是个 wi
位运算、异或的实际应用 wenjinglian 位运算
一．位操作基础，用一张表描述位操作符的应用规则并详细解释。二．常用位操作小技巧，有判断奇偶、交换两数、变换符号、求绝对值。三．位操作与空间压缩，针对筛素数进行空间压缩。 &n
weblogic部署项目出现的一些问题（持续补充中……） Everyday都不同 weblogic部署失败
好吧，weblogic的问题确实…… 问题一： org.springframework.beans.factory.BeanDefinitionStoreException: Failed to read candidate component class: URL [zip:E:/weblogic/user_projects/domains/base_domain/serve
tomcat7性能调优（01） toknowme tomcat7
Tomcat优化： 1、最大连接数最大线程等设置 <Connector port="8082" protocol="HTTP/1.1" useBodyEncodingForURI="t
PO VO DAO DTO BO TO概念与区别 xp9802 java DAO 设计模式 bean 领域模型
O/R Mapping 是 Object Relational Mapping（对象关系映射）的缩写。通俗点讲，就是将对象与关系数据库绑定，用对象来表示关系数据。在O/R Mapping的世界里，有两个基本的也是重要的东东需要了解，即VO，PO。它们的关系应该是相互独立的，一个VO可以只是PO的部分，也可以是多个PO构成，同样也可以等同于一个PO（指的是他们的属性）。这样，PO独立出来，数据持

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法 综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！...

先看数据：

Introduction

Classification on the original data

Classification on the SMOTEd data

K-means centroids as a new sample

Hi all as we know credit card fraud detection will have a imbalanced data i.e having more number of normal class than the number of fraud class

In this I will use Basic method of handling imbalance data which are

Lets start with Importing Libraries and data

Now explore the data to get insight in it

ReSampling - Under Sampling

Logistic Regression with Undersample Data

Over Sampling

SMOTE

用Python作信用卡欺诈预测 ——欠采样、效果不好

你可能感兴趣的:(数据结构与算法,人工智能,开发工具)

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3 or 1:1）效果比较好！记得在smote前一定要先做标准化！！！...