【一周算法进阶】--任务一数据预处理

Task1 数据预处理

说明:数据集是关于金融方面,预测贷款用户是否会逾期。表格中“status”是结果标签,0表示未逾期,1表示逾期。

1.导入相关包 &读取数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelBinarizer,OneHotEncoder,Imputer

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
data_original=pd.read_csv('data.csv',skipinitialspace=True)

将csv文件用UTF8编码才能用

data=data_original.copy()
data.head(5)
Unnamed: 0 custid trade_no bank_card_no low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter trans_days_interval regional_mobility student_feature repayment_capability is_high_user number_of_trans_from_2011 first_transaction_time historical_trans_amount historical_trans_day rank_trad_1_month trans_amount_3_month avg_consume_less_12_valid_month abs top_trans_count_last_1_month avg_price_last_12_month avg_price_top_last_12_valid_month reg_preference_for_trad trans_top_time_last_1_month trans_top_time_last_6_month consume_top_time_last_1_month consume_top_time_last_6_month cross_consume_count_last_1_month trans_fail_top_count_enum_last_1_month trans_fail_top_count_enum_last_6_month trans_fail_top_count_enum_last_12_month consume_mini_time_last_1_month max_cumulative_consume_later_1_month max_consume_count_later_6_month railway_consume_count_last_12_month pawns_auctions_trusts_consume_last_1_month pawns_auctions_trusts_consume_last_6_month jewelry_consume_count_last_6_month status source first_transaction_day trans_day_last_12_month id_name apply_score apply_credibility query_org_count query_finance_count query_cash_count query_sum_count latest_query_time latest_one_month_apply latest_three_month_apply latest_six_month_apply loans_score loans_credibility_behavior loans_count loans_settle_count loans_overdue_count loans_org_count_behavior consfin_org_count_behavior loans_cash_count latest_one_month_loan latest_three_month_loan latest_six_month_loan history_suc_fee history_fail_fee latest_one_month_suc latest_one_month_fail loans_long_time loans_latest_time loans_credit_limit loans_credibility_limit loans_org_count_current loans_product_count loans_max_limit loans_avg_limit consfin_credit_limit consfin_credibility consfin_org_count_current consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day
0 5 2791858 20180507115231274000000023057383 卡号1 0.01 0.99 0 0.90 0.55 0.313 17.0 27.0 26.0 3.0 NaN 19890 0 30.0 20130817.0 149050 151.0 0.40 34030 7.0 3920 0.15 1020 0.55 一线城市 4.0 19.0 4.0 19.0 1.0 1.0 2.0 2.0 5.0 2170 6.0 0.0 1970 18040 0.0 1 xs 1738.0 85.0 蒋红 583.0 79.0 8.0 2.0 6.0 10.0 2018-04-25 2.0 5.0 8.0 552.0 73.0 37.0 34.0 2.0 10.0 1.0 9.0 1.0 1.0 13.0 37.0 7.0 1.0 0.0 341.0 2018-04-19 2200.0 72.0 9.0 10.0 2900.0 1688.0 1200.0 75.0 1.0 2.0 1200.0 1200.0 12.0 18.0
1 10 534047 20180507121002192000000023073000 卡号1 0.02 0.94 2000 1.28 1.00 0.458 19.0 30.0 14.0 4.0 1.0 16970 0 23.0 20160402.0 302910 224.0 0.35 10590 5.0 6950 0.05 1210 0.50 一线城市 13.0 30.0 13.0 30.0 0.0 0.0 3.0 3.0 330.0 2100 9.0 0.0 1820 15680 0.0 0 xs 779.0 84.0 崔向朝 653.0 73.0 7.0 4.0 2.0 8.0 2018-05-03 2.0 6.0 8.0 635.0 76.0 37.0 36.0 0.0 17.0 5.0 12.0 2.0 2.0 8.0 49.0 4.0 2.0 1.0 353.0 2018-05-05 2000.0 74.0 12.0 12.0 3500.0 1758.0 15100.0 80.0 5.0 6.0 22800.0 9360.0 4.0 2.0
2 12 2849787 20180507125159718000000023114911 卡号1 0.04 0.96 0 1.00 1.00 0.114 13.0 68.0 22.0 1.0 NaN 9710 0 9.0 20170617.0 11520 31.0 1.00 5710 5.0 840 0.65 570 0.65 一线城市 0.0 68.0 0.0 68.0 0.0 3.0 6.0 6.0 0.0 0 3.0 0.0 0 0 0.0 1 xs 338.0 95.0 王中云 654.0 76.0 11.0 5.0 5.0 16.0 2018-05-05 5.0 5.0 14.0 633.0 83.0 4.0 2.0 0.0 3.0 1.0 2.0 2.0 2.0 4.0 2.0 2.0 1.0 1.0 157.0 2018-05-01 1500.0 77.0 2.0 2.0 1600.0 1250.0 4200.0 87.0 1.0 1.0 4200.0 4200.0 2.0 6.0
3 13 1809708 20180507121358683000000388283484 卡号1 0.00 0.96 2000 0.13 0.57 0.777 22.0 14.0 6.0 3.0 NaN 6210 0 33.0 20130516.0 491130 360.0 0.15 91690 7.0 46850 0.05 1290 0.45 三线城市 6.0 8.0 6.0 8.0 0.0 1.0 8.0 8.0 31700.0 8140 9.0 0.0 2700 27970 0.0 0 xs 1831.0 82.0 何洋洋 595.0 79.0 12.0 7.0 4.0 22.0 2018-05-05 3.0 16.0 17.0 542.0 75.0 85.0 81.0 4.0 22.0 5.0 17.0 2.0 4.0 34.0 91.0 26.0 2.0 0.0 355.0 2018-05-03 1800.0 74.0 17.0 18.0 3200.0 1541.0 16300.0 80.0 5.0 5.0 30000.0 12180.0 2.0 4.0
4 14 2499829 20180507115448545000000388205844 卡号1 0.01 0.99 0 0.46 1.00 0.175 13.0 66.0 42.0 1.0 NaN 11150 0 12.0 20170312.0 61470 63.0 0.65 9770 6.0 760 1.00 1110 0.50 一线城市 0.0 66.0 0.0 66.0 0.0 3.0 3.0 3.0 0.0 1000 3.0 0.0 0 6410 0.0 1 xs 435.0 88.0 赵洋 541.0 75.0 11.0 3.0 4.0 14.0 2018-04-15 6.0 8.0 9.0 479.0 73.0 37.0 32.0 6.0 12.0 2.0 10.0 0.0 0.0 10.0 36.0 25.0 0.0 0.0 360.0 2018-01-07 1800.0 72.0 10.0 10.0 2300.0 1630.0 8300.0 79.0 2.0 2.0 8400.0 8250.0 22.0 120.0

2.数据探索分析 EDA

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

有时候DataFrame中的行列数量太多,print打印出来会显示不完全。
#显示所有列 pd.set_option(‘display.max_columns’, None)
#显示所有行 pd.set_option(‘display.max_rows’, None)
#设置value的显示长度为100,默认为50 pd.set_option(‘max_colwidth’,100)

data.describe()
Unnamed: 0 custid low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter trans_days_interval regional_mobility student_feature repayment_capability is_high_user number_of_trans_from_2011 first_transaction_time historical_trans_amount historical_trans_day rank_trad_1_month trans_amount_3_month avg_consume_less_12_valid_month abs top_trans_count_last_1_month avg_price_last_12_month avg_price_top_last_12_valid_month trans_top_time_last_1_month trans_top_time_last_6_month consume_top_time_last_1_month consume_top_time_last_6_month cross_consume_count_last_1_month trans_fail_top_count_enum_last_1_month trans_fail_top_count_enum_last_6_month trans_fail_top_count_enum_last_12_month consume_mini_time_last_1_month max_cumulative_consume_later_1_month max_consume_count_later_6_month railway_consume_count_last_12_month pawns_auctions_trusts_consume_last_1_month pawns_auctions_trusts_consume_last_6_month jewelry_consume_count_last_6_month status first_transaction_day trans_day_last_12_month apply_score apply_credibility query_org_count query_finance_count query_cash_count query_sum_count latest_one_month_apply latest_three_month_apply latest_six_month_apply loans_score loans_credibility_behavior loans_count loans_settle_count loans_overdue_count loans_org_count_behavior consfin_org_count_behavior loans_cash_count latest_one_month_loan latest_three_month_loan latest_six_month_loan history_suc_fee history_fail_fee latest_one_month_suc latest_one_month_fail loans_long_time loans_credit_limit loans_credibility_limit loans_org_count_current loans_product_count loans_max_limit loans_avg_limit consfin_credit_limit consfin_credibility consfin_org_count_current consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day
count 4754.000000 4.754000e+03 4752.000000 4752.000000 4754.000000 4751.000000 4752.000000 4752.000000 4752.000000 4746.000000 4752.000000 4752.000000 1756.000000 4.754000e+03 4754.000000 4752.000000 4.752000e+03 4.754000e+03 4752.000000 4752.000000 4.754000e+03 4752.000000 4754.000000 4752.000000 4754.000000 4650.000000 4746.000000 4746.000000 4746.000000 4746.000000 4328.000000 4738.000000 4738.000000 4738.000000 4.728000e+03 4754.000000 4746.000000 4742.000000 4754.000000 4754.000000 4742.000000 4754.000000 4752.000000 4752.000000 4450.000000 4450.000000 4450.000000 4450.000000 4450.000000 4450.000000 4450.000000 4450.000000 4450.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4450.000000 4457.000000
mean 6008.414178 1.690993e+06 0.021806 0.901294 1940.197728 14.160674 0.804411 0.365425 17.502946 29.029920 21.751263 2.678662 1.001139 1.870201e+04 0.011149 23.033880 2.015109e+07 2.307359e+05 176.109428 0.476926 3.896430e+04 6.572601 9344.350021 0.355745 1237.088767 0.514667 7.134008 20.174673 7.047198 20.649600 0.642329 1.656184 4.529759 5.232165 1.553622e+05 2886.964661 6.055626 0.030789 1321.201094 18958.460244 0.014340 0.250947 1036.274621 89.006944 576.632584 75.998876 11.974382 6.020000 3.784719 16.891236 4.329438 8.771910 12.364270 543.205968 75.438636 35.952210 31.039937 2.308952 12.845412 4.732331 8.113081 0.965896 2.821853 13.926857 43.145614 17.708548 1.224366 1.311420 335.159973 2089.297734 71.992372 8.113081 8.685214 3390.038142 1820.357864 9187.009199 76.042630 4.732331 5.227507 16153.690823 8007.696881 24.112809 55.181512
std 3452.071428 1.034235e+06 0.041527 0.144856 3923.971494 694.180473 0.196920 0.170196 4.475616 22.722432 16.474916 0.890360 0.033739 5.221783e+04 0.105007 10.057837 1.480487e+04 3.204931e+05 99.687285 0.263769 1.017461e+05 1.390723 27007.597886 0.350595 765.873649 0.100397 5.318254 12.962979 5.456050 13.125224 2.343228 1.908887 4.455923 4.756974 3.742672e+05 10813.451908 5.684529 0.478499 6616.691843 28191.132260 0.201777 0.433603 537.108729 19.069927 51.167375 4.168916 7.041493 3.805369 2.599244 11.299787 4.525521 7.621961 9.274982 60.954266 2.231822 24.614363 21.694068 3.152881 7.448393 2.974596 5.374465 1.495566 3.455817 10.828475 30.353618 25.089348 1.944912 3.893607 35.770102 708.951406 10.851926 5.374465 5.759025 1474.206546 583.418291 7371.257043 14.536819 2.974596 3.409292 14301.037628 5679.418585 37.725724 53.486408
min 5.000000 1.140000e+02 0.000000 0.000000 0.000000 0.000000 0.120000 0.033000 2.000000 0.000000 4.000000 1.000000 1.000000 0.000000e+00 0.000000 1.000000 2.011010e+07 0.000000e+00 2.000000 0.050000 0.000000e+00 0.000000 0.000000 0.050000 0.000000 0.050000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 127.000000 82.000000 450.000000 50.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 413.000000 56.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 26.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -2.000000 -2.000000
25% 3106.000000 7.593358e+05 0.010000 0.880000 0.000000 0.615000 0.670000 0.233000 15.000000 16.000000 12.000000 2.000000 1.000000 8.590000e+03 0.000000 16.000000 2.014102e+07 7.949750e+04 102.000000 0.300000 1.168250e+04 6.000000 1290.000000 0.087500 920.000000 0.450000 3.250000 12.000000 3.000000 12.000000 0.000000 0.000000 2.000000 2.000000 0.000000e+00 700.000000 3.000000 0.000000 0.000000 5252.500000 0.000000 0.000000 632.000000 82.000000 535.000000 74.000000 7.000000 3.000000 2.000000 9.000000 1.000000 3.000000 6.000000 493.000000 74.000000 17.000000 15.000000 0.000000 7.000000 2.000000 4.000000 0.000000 0.000000 6.000000 21.000000 3.000000 0.000000 0.000000 329.000000 1700.000000 72.000000 4.000000 4.000000 2300.000000 1535.000000 4800.000000 77.000000 2.000000 3.000000 7800.000000 4737.000000 5.000000 10.000000
50% 6006.500000 1.634942e+06 0.010000 0.960000 500.000000 0.970000 0.860000 0.350000 17.000000 23.000000 17.000000 3.000000 1.000000 1.221000e+04 0.000000 21.000000 2.015111e+07 1.623350e+05 160.000000 0.450000 2.555500e+04 7.000000 3345.000000 0.200000 1140.000000 0.500000 7.000000 17.000000 7.000000 18.000000 0.000000 1.000000 3.000000 4.000000 2.400000e+01 1530.000000 5.000000 0.000000 70.000000 12725.000000 0.000000 0.000000 919.000000 83.000000 549.000000 76.000000 11.000000 5.000000 3.000000 15.000000 3.000000 7.000000 10.000000 511.000000 75.000000 31.000000 27.000000 1.000000 12.000000 4.000000 7.000000 0.000000 2.000000 11.000000 37.000000 10.000000 0.000000 0.000000 349.000000 2100.000000 74.000000 7.000000 8.000000 3100.000000 1810.000000 7700.000000 79.000000 4.000000 5.000000 13800.000000 7050.000000 14.000000 36.000000
75% 8999.000000 2.597905e+06 0.020000 0.990000 2000.000000 1.600000 1.000000 0.480000 20.000000 32.000000 27.000000 3.000000 1.000000 1.764750e+04 0.000000 29.000000 2.016083e+07 2.985600e+05 231.000000 0.600000 4.795000e+04 7.000000 8067.500000 0.650000 1400.000000 0.550000 10.000000 26.000000 10.000000 26.000000 1.000000 2.000000 6.000000 6.000000 7.478850e+04 2760.000000 7.000000 0.000000 980.000000 23740.000000 0.000000 1.000000 1310.250000 87.000000 629.000000 78.000000 16.000000 8.000000 5.000000 23.000000 6.000000 12.000000 17.000000 602.000000 77.000000 50.000000 43.000000 3.000000 17.000000 7.000000 11.000000 1.000000 4.000000 20.000000 59.000000 22.000000 2.000000 1.000000 356.000000 2400.000000 75.000000 11.000000 12.000000 4300.000000 2100.000000 11700.000000 80.000000 7.000000 7.000000 20400.000000 10000.000000 24.000000 91.000000
max 11992.000000 4.004694e+06 1.000000 1.000000 68000.000000 47596.740000 1.000000 0.941000 42.000000 285.000000 234.000000 5.000000 2.000000 2.459390e+06 1.000000 85.000000 2.018011e+07 1.360130e+07 907.000000 1.000000 6.024100e+06 11.000000 918450.000000 1.000000 23140.000000 1.000000 27.000000 124.000000 27.000000 151.000000 69.000000 30.000000 120.000000 120.000000 2.392316e+06 496010.000000 147.000000 30.000000 238380.000000 525360.000000 6.000000 1.000000 2697.000000 382.000000 687.000000 93.000000 54.000000 24.000000 16.000000 98.000000 38.000000 75.000000 80.000000 688.000000 85.000000 158.000000 154.000000 25.000000 41.000000 18.000000 31.000000 15.000000 52.000000 74.000000 254.000000 345.000000 20.000000 58.000000 360.000000 6900.000000 89.000000 31.000000 32.000000 10000.000000 6900.000000 87100.000000 87.000000 18.000000 20.000000 266400.000000 82800.000000 360.000000 323.000000
data.info()

RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0                                    4754 non-null int64
custid                                        4754 non-null int64
trade_no                                      4754 non-null object
bank_card_no                                  4754 non-null object
low_volume_percent                            4752 non-null float64
middle_volume_percent                         4752 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4751 non-null float64
trans_activity_month                          4752 non-null float64
trans_activity_day                            4752 non-null float64
transd_mcc                                    4752 non-null float64
trans_days_interval_filter                    4746 non-null float64
trans_days_interval                           4752 non-null float64
regional_mobility                             4752 non-null float64
student_feature                               1756 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4752 non-null float64
first_transaction_time                        4752 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4752 non-null float64
rank_trad_1_month                             4752 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4752 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4752 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4650 non-null float64
reg_preference_for_trad                       4752 non-null object
trans_top_time_last_1_month                   4746 non-null float64
trans_top_time_last_6_month                   4746 non-null float64
consume_top_time_last_1_month                 4746 non-null float64
consume_top_time_last_6_month                 4746 non-null float64
cross_consume_count_last_1_month              4328 non-null float64
trans_fail_top_count_enum_last_1_month        4738 non-null float64
trans_fail_top_count_enum_last_6_month        4738 non-null float64
trans_fail_top_count_enum_last_12_month       4738 non-null float64
consume_mini_time_last_1_month                4728 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4746 non-null float64
railway_consume_count_last_12_month           4742 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4742 non-null float64
status                                        4754 non-null int64
source                                        4754 non-null object
first_transaction_day                         4752 non-null float64
trans_day_last_12_month                       4752 non-null float64
id_name                                       4478 non-null object
apply_score                                   4450 non-null float64
apply_credibility                             4450 non-null float64
query_org_count                               4450 non-null float64
query_finance_count                           4450 non-null float64
query_cash_count                              4450 non-null float64
query_sum_count                               4450 non-null float64
latest_query_time                             4450 non-null object
latest_one_month_apply                        4450 non-null float64
latest_three_month_apply                      4450 non-null float64
latest_six_month_apply                        4450 non-null float64
loans_score                                   4457 non-null float64
loans_credibility_behavior                    4457 non-null float64
loans_count                                   4457 non-null float64
loans_settle_count                            4457 non-null float64
loans_overdue_count                           4457 non-null float64
loans_org_count_behavior                      4457 non-null float64
consfin_org_count_behavior                    4457 non-null float64
loans_cash_count                              4457 non-null float64
latest_one_month_loan                         4457 non-null float64
latest_three_month_loan                       4457 non-null float64
latest_six_month_loan                         4457 non-null float64
history_suc_fee                               4457 non-null float64
history_fail_fee                              4457 non-null float64
latest_one_month_suc                          4457 non-null float64
latest_one_month_fail                         4457 non-null float64
loans_long_time                               4457 non-null float64
loans_latest_time                             4457 non-null object
loans_credit_limit                            4457 non-null float64
loans_credibility_limit                       4457 non-null float64
loans_org_count_current                       4457 non-null float64
loans_product_count                           4457 non-null float64
loans_max_limit                               4457 non-null float64
loans_avg_limit                               4457 non-null float64
consfin_credit_limit                          4457 non-null float64
consfin_credibility                           4457 non-null float64
consfin_org_count_current                     4457 non-null float64
consfin_product_count                         4457 non-null float64
consfin_max_limit                             4457 non-null float64
consfin_avg_limit                             4457 non-null float64
latest_query_day                              4450 non-null float64
loans_latest_day                              4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB

可以看出数据集的数据类型有:float64(70), int64(13), object(7),部分特征有缺失情况。

(1)删除无用特征

data.nunique()
Unnamed: 0                                    4754
custid                                        4754
trade_no                                      4754
bank_card_no                                     1
low_volume_percent                              40
middle_volume_percent                           90
take_amount_in_later_12_month_highest          166
trans_amount_increase_rate_lately              782
trans_activity_month                            84
trans_activity_day                             512
transd_mcc                                      41
trans_days_interval_filter                     147
trans_days_interval                            114
regional_mobility                                5
student_feature                                  2
repayment_capability                          2390
is_high_user                                     2
number_of_trans_from_2011                       70
first_transaction_time                        1693
historical_trans_amount                       4524
historical_trans_day                           476
rank_trad_1_month                               20
trans_amount_3_month                          3524
avg_consume_less_12_valid_month                 12
abs                                           1697
top_trans_count_last_1_month                     8
avg_price_last_12_month                        330
avg_price_top_last_12_valid_month               20
reg_preference_for_trad                          5
trans_top_time_last_1_month                     28
trans_top_time_last_6_month                     97
consume_top_time_last_1_month                   28
consume_top_time_last_6_month                   94
cross_consume_count_last_1_month                19
trans_fail_top_count_enum_last_1_month          15
trans_fail_top_count_enum_last_6_month          25
trans_fail_top_count_enum_last_12_month         26
consume_mini_time_last_1_month                1971
max_cumulative_consume_later_1_month           863
max_consume_count_later_6_month                 29
railway_consume_count_last_12_month              6
pawns_auctions_trusts_consume_last_1_month     572
pawns_auctions_trusts_consume_last_6_month    2730
jewelry_consume_count_last_6_month               7
status                                           2
source                                           1
first_transaction_day                         1693
trans_day_last_12_month                        132
id_name                                       4309
apply_score                                    205
apply_credibility                               41
query_org_count                                 46
query_finance_count                             25
query_cash_count                                17
query_sum_count                                 74
latest_query_time                              207
latest_one_month_apply                          36
latest_three_month_apply                        56
latest_six_month_apply                          65
loans_score                                    247
loans_credibility_behavior                      25
loans_count                                    134
loans_settle_count                             123
loans_overdue_count                             26
loans_org_count_behavior                        41
consfin_org_count_behavior                      19
loans_cash_count                                32
latest_one_month_loan                           14
latest_three_month_loan                         31
latest_six_month_loan                           67
history_suc_fee                                171
history_fail_fee                               151
latest_one_month_suc                            19
latest_one_month_fail                           41
loans_long_time                                202
loans_latest_time                              232
loans_credit_limit                              54
loans_credibility_limit                         33
loans_org_count_current                         32
loans_product_count                             32
loans_max_limit                                 91
loans_avg_limit                                961
consfin_credit_limit                           327
consfin_credibility                             24
consfin_org_count_current                       19
consfin_product_count                           20
consfin_max_limit                              175
consfin_avg_limit                             1677
latest_query_day                               210
loans_latest_day                               235
dtype: int64

先看数据的nunique情况,看这个主要确定是否采用one-hot以及删除某些特征(主要是所有行和列都一样的),可以看出‘source’ ‘bank_card_no’ 的值只有一个,可以直接删除;‘Unnamed: 0’ ’custid ‘ ‘trade_no’ 这三个特征的唯一值为4754,并且根据属性名可知,可以删除。

data.drop(['Unnamed: 0', 'custid', 'trade_no', 'bank_card_no', 'source','id_name'], axis=1, inplace=True)

在删除DataFrame对象中的字段时,出现找不到字段的错误,可以在读取csv文件时添加一个参数:skipinitialspace=True 即可。

data.shape
(4754, 84)

(2)数据类型转换

object类型转换

object_cols = [col for col in data.columns if data[col].dtypes == 'O']
object_cols

#data.select_dtypes(include=[object]).columns
['reg_preference_for_trad', 'latest_query_time', 'loans_latest_time']
data[object_cols].head(5)
reg_preference_for_trad latest_query_time loans_latest_time
0 一线城市 2018-04-25 2018-04-19
1 一线城市 2018-05-03 2018-05-05
2 一线城市 2018-05-05 2018-05-01
3 三线城市 2018-05-05 2018-05-03
4 一线城市 2018-04-15 2018-01-07
data_obj=data[object_cols]
data_num=data.drop(object_cols,axis=1)

(3)缺失值的填充

缺失值处理大致可以分为删除和填充两种方法。删除又分为删除行(样本)和删除列(特征)两种,之前我们已经删除了缺失大量特征的样本和部分无用特征,目前剩下的特征所含缺失值不多,所以我们不采用删除的方法处理缺失值。

缺失值填充的方法有很多,需要根据特征的情况进行不同类型的填充,常见的有:均值填充、众数填充、中位数填充、前值填充等等。

imputer=Imputer(strategy='mean')
mean_num=imputer.fit_transform(data_num)
data_num=pd.DataFrame(mean_num,columns=data_num.columns)
data_obj.ffill(inplace=True)

object类型转换

encoder = LabelBinarizer()
reg_preference_1hot = encoder.fit_transform(data_obj[['reg_preference_for_trad']])
data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True)
reg_preference_df = pd.DataFrame(reg_preference_1hot, columns=encoder.classes_)
data_obj = pd.concat([data_obj, reg_preference_df], axis=1)

data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time'])
data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month
data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday

data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time'])
data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month
data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday

data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1)

data_obj.head(5)
一线城市 三线城市 二线城市 其他城市 境外 latest_query_time_month latest_query_time_weekday loans_latest_time_month loans_latest_time_weekday
0 1 0 0 0 0 4 2 4 3
1 1 0 0 0 0 5 3 5 5
2 1 0 0 0 0 5 5 5 1
3 0 1 0 0 0 5 5 5 3
4 1 0 0 0 0 4 6 1 6
data=pd.concat([data_num,data_obj],axis=1)
data.shape
(4754, 90)
data.info(5)

RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
low_volume_percent                            4754 non-null float64
middle_volume_percent                         4754 non-null float64
take_amount_in_later_12_month_highest         4754 non-null float64
trans_amount_increase_rate_lately             4754 non-null float64
trans_activity_month                          4754 non-null float64
trans_activity_day                            4754 non-null float64
transd_mcc                                    4754 non-null float64
trans_days_interval_filter                    4754 non-null float64
trans_days_interval                           4754 non-null float64
regional_mobility                             4754 non-null float64
student_feature                               4754 non-null float64
repayment_capability                          4754 non-null float64
is_high_user                                  4754 non-null float64
number_of_trans_from_2011                     4754 non-null float64
first_transaction_time                        4754 non-null float64
historical_trans_amount                       4754 non-null float64
historical_trans_day                          4754 non-null float64
rank_trad_1_month                             4754 non-null float64
trans_amount_3_month                          4754 non-null float64
avg_consume_less_12_valid_month               4754 non-null float64
abs                                           4754 non-null float64
top_trans_count_last_1_month                  4754 non-null float64
avg_price_last_12_month                       4754 non-null float64
avg_price_top_last_12_valid_month             4754 non-null float64
trans_top_time_last_1_month                   4754 non-null float64
trans_top_time_last_6_month                   4754 non-null float64
consume_top_time_last_1_month                 4754 non-null float64
consume_top_time_last_6_month                 4754 non-null float64
cross_consume_count_last_1_month              4754 non-null float64
trans_fail_top_count_enum_last_1_month        4754 non-null float64
trans_fail_top_count_enum_last_6_month        4754 non-null float64
trans_fail_top_count_enum_last_12_month       4754 non-null float64
consume_mini_time_last_1_month                4754 non-null float64
max_cumulative_consume_later_1_month          4754 non-null float64
max_consume_count_later_6_month               4754 non-null float64
railway_consume_count_last_12_month           4754 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null float64
pawns_auctions_trusts_consume_last_6_month    4754 non-null float64
jewelry_consume_count_last_6_month            4754 non-null float64
status                                        4754 non-null float64
first_transaction_day                         4754 non-null float64
trans_day_last_12_month                       4754 non-null float64
apply_score                                   4754 non-null float64
apply_credibility                             4754 non-null float64
query_org_count                               4754 non-null float64
query_finance_count                           4754 non-null float64
query_cash_count                              4754 non-null float64
query_sum_count                               4754 non-null float64
latest_one_month_apply                        4754 non-null float64
latest_three_month_apply                      4754 non-null float64
latest_six_month_apply                        4754 non-null float64
loans_score                                   4754 non-null float64
loans_credibility_behavior                    4754 non-null float64
loans_count                                   4754 non-null float64
loans_settle_count                            4754 non-null float64
loans_overdue_count                           4754 non-null float64
loans_org_count_behavior                      4754 non-null float64
consfin_org_count_behavior                    4754 non-null float64
loans_cash_count                              4754 non-null float64
latest_one_month_loan                         4754 non-null float64
latest_three_month_loan                       4754 non-null float64
latest_six_month_loan                         4754 non-null float64
history_suc_fee                               4754 non-null float64
history_fail_fee                              4754 non-null float64
latest_one_month_suc                          4754 non-null float64
latest_one_month_fail                         4754 non-null float64
loans_long_time                               4754 non-null float64
loans_credit_limit                            4754 non-null float64
loans_credibility_limit                       4754 non-null float64
loans_org_count_current                       4754 non-null float64
loans_product_count                           4754 non-null float64
loans_max_limit                               4754 non-null float64
loans_avg_limit                               4754 non-null float64
consfin_credit_limit                          4754 non-null float64
consfin_credibility                           4754 non-null float64
consfin_org_count_current                     4754 non-null float64
consfin_product_count                         4754 non-null float64
consfin_max_limit                             4754 non-null float64
consfin_avg_limit                             4754 non-null float64
latest_query_day                              4754 non-null float64
loans_latest_day                              4754 non-null float64
一线城市                                          4754 non-null int64
三线城市                                          4754 non-null int64
二线城市                                          4754 non-null int64
其他城市                                          4754 non-null int64
境外                                            4754 non-null int64
latest_query_time_month                       4754 non-null int64
latest_query_time_weekday                     4754 non-null int64
loans_latest_time_month                       4754 non-null int64
loans_latest_time_weekday                     4754 non-null int64
dtypes: float64(81), int64(9)
memory usage: 3.3 MB

参考:

1.https://blog.csdn.net/bear507/article/details/86649069
2.科大讯飞AI广告点击预测比赛

你可能感兴趣的:(#,数据挖掘比赛整理)