说明:数据集是关于金融方面,预测贷款用户是否会逾期。表格中“status”是结果标签,0表示未逾期,1表示逾期。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelBinarizer,OneHotEncoder,Imputer
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
data_original=pd.read_csv('data.csv',skipinitialspace=True)
将csv文件用UTF8编码才能用
data=data_original.copy()
data.head(5)
Unnamed: 0 | custid | trade_no | bank_card_no | low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | student_feature | repayment_capability | is_high_user | number_of_trans_from_2011 | first_transaction_time | historical_trans_amount | historical_trans_day | rank_trad_1_month | trans_amount_3_month | avg_consume_less_12_valid_month | abs | top_trans_count_last_1_month | avg_price_last_12_month | avg_price_top_last_12_valid_month | reg_preference_for_trad | trans_top_time_last_1_month | trans_top_time_last_6_month | consume_top_time_last_1_month | consume_top_time_last_6_month | cross_consume_count_last_1_month | trans_fail_top_count_enum_last_1_month | trans_fail_top_count_enum_last_6_month | trans_fail_top_count_enum_last_12_month | consume_mini_time_last_1_month | max_cumulative_consume_later_1_month | max_consume_count_later_6_month | railway_consume_count_last_12_month | pawns_auctions_trusts_consume_last_1_month | pawns_auctions_trusts_consume_last_6_month | jewelry_consume_count_last_6_month | status | source | first_transaction_day | trans_day_last_12_month | id_name | apply_score | apply_credibility | query_org_count | query_finance_count | query_cash_count | query_sum_count | latest_query_time | latest_one_month_apply | latest_three_month_apply | latest_six_month_apply | loans_score | loans_credibility_behavior | loans_count | loans_settle_count | loans_overdue_count | loans_org_count_behavior | consfin_org_count_behavior | loans_cash_count | latest_one_month_loan | latest_three_month_loan | latest_six_month_loan | history_suc_fee | history_fail_fee | latest_one_month_suc | latest_one_month_fail | loans_long_time | loans_latest_time | loans_credit_limit | loans_credibility_limit | loans_org_count_current | loans_product_count | loans_max_limit | loans_avg_limit | consfin_credit_limit | consfin_credibility | consfin_org_count_current | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 2791858 | 20180507115231274000000023057383 | 卡号1 | 0.01 | 0.99 | 0 | 0.90 | 0.55 | 0.313 | 17.0 | 27.0 | 26.0 | 3.0 | NaN | 19890 | 0 | 30.0 | 20130817.0 | 149050 | 151.0 | 0.40 | 34030 | 7.0 | 3920 | 0.15 | 1020 | 0.55 | 一线城市 | 4.0 | 19.0 | 4.0 | 19.0 | 1.0 | 1.0 | 2.0 | 2.0 | 5.0 | 2170 | 6.0 | 0.0 | 1970 | 18040 | 0.0 | 1 | xs | 1738.0 | 85.0 | 蒋红 | 583.0 | 79.0 | 8.0 | 2.0 | 6.0 | 10.0 | 2018-04-25 | 2.0 | 5.0 | 8.0 | 552.0 | 73.0 | 37.0 | 34.0 | 2.0 | 10.0 | 1.0 | 9.0 | 1.0 | 1.0 | 13.0 | 37.0 | 7.0 | 1.0 | 0.0 | 341.0 | 2018-04-19 | 2200.0 | 72.0 | 9.0 | 10.0 | 2900.0 | 1688.0 | 1200.0 | 75.0 | 1.0 | 2.0 | 1200.0 | 1200.0 | 12.0 | 18.0 |
1 | 10 | 534047 | 20180507121002192000000023073000 | 卡号1 | 0.02 | 0.94 | 2000 | 1.28 | 1.00 | 0.458 | 19.0 | 30.0 | 14.0 | 4.0 | 1.0 | 16970 | 0 | 23.0 | 20160402.0 | 302910 | 224.0 | 0.35 | 10590 | 5.0 | 6950 | 0.05 | 1210 | 0.50 | 一线城市 | 13.0 | 30.0 | 13.0 | 30.0 | 0.0 | 0.0 | 3.0 | 3.0 | 330.0 | 2100 | 9.0 | 0.0 | 1820 | 15680 | 0.0 | 0 | xs | 779.0 | 84.0 | 崔向朝 | 653.0 | 73.0 | 7.0 | 4.0 | 2.0 | 8.0 | 2018-05-03 | 2.0 | 6.0 | 8.0 | 635.0 | 76.0 | 37.0 | 36.0 | 0.0 | 17.0 | 5.0 | 12.0 | 2.0 | 2.0 | 8.0 | 49.0 | 4.0 | 2.0 | 1.0 | 353.0 | 2018-05-05 | 2000.0 | 74.0 | 12.0 | 12.0 | 3500.0 | 1758.0 | 15100.0 | 80.0 | 5.0 | 6.0 | 22800.0 | 9360.0 | 4.0 | 2.0 |
2 | 12 | 2849787 | 20180507125159718000000023114911 | 卡号1 | 0.04 | 0.96 | 0 | 1.00 | 1.00 | 0.114 | 13.0 | 68.0 | 22.0 | 1.0 | NaN | 9710 | 0 | 9.0 | 20170617.0 | 11520 | 31.0 | 1.00 | 5710 | 5.0 | 840 | 0.65 | 570 | 0.65 | 一线城市 | 0.0 | 68.0 | 0.0 | 68.0 | 0.0 | 3.0 | 6.0 | 6.0 | 0.0 | 0 | 3.0 | 0.0 | 0 | 0 | 0.0 | 1 | xs | 338.0 | 95.0 | 王中云 | 654.0 | 76.0 | 11.0 | 5.0 | 5.0 | 16.0 | 2018-05-05 | 5.0 | 5.0 | 14.0 | 633.0 | 83.0 | 4.0 | 2.0 | 0.0 | 3.0 | 1.0 | 2.0 | 2.0 | 2.0 | 4.0 | 2.0 | 2.0 | 1.0 | 1.0 | 157.0 | 2018-05-01 | 1500.0 | 77.0 | 2.0 | 2.0 | 1600.0 | 1250.0 | 4200.0 | 87.0 | 1.0 | 1.0 | 4200.0 | 4200.0 | 2.0 | 6.0 |
3 | 13 | 1809708 | 20180507121358683000000388283484 | 卡号1 | 0.00 | 0.96 | 2000 | 0.13 | 0.57 | 0.777 | 22.0 | 14.0 | 6.0 | 3.0 | NaN | 6210 | 0 | 33.0 | 20130516.0 | 491130 | 360.0 | 0.15 | 91690 | 7.0 | 46850 | 0.05 | 1290 | 0.45 | 三线城市 | 6.0 | 8.0 | 6.0 | 8.0 | 0.0 | 1.0 | 8.0 | 8.0 | 31700.0 | 8140 | 9.0 | 0.0 | 2700 | 27970 | 0.0 | 0 | xs | 1831.0 | 82.0 | 何洋洋 | 595.0 | 79.0 | 12.0 | 7.0 | 4.0 | 22.0 | 2018-05-05 | 3.0 | 16.0 | 17.0 | 542.0 | 75.0 | 85.0 | 81.0 | 4.0 | 22.0 | 5.0 | 17.0 | 2.0 | 4.0 | 34.0 | 91.0 | 26.0 | 2.0 | 0.0 | 355.0 | 2018-05-03 | 1800.0 | 74.0 | 17.0 | 18.0 | 3200.0 | 1541.0 | 16300.0 | 80.0 | 5.0 | 5.0 | 30000.0 | 12180.0 | 2.0 | 4.0 |
4 | 14 | 2499829 | 20180507115448545000000388205844 | 卡号1 | 0.01 | 0.99 | 0 | 0.46 | 1.00 | 0.175 | 13.0 | 66.0 | 42.0 | 1.0 | NaN | 11150 | 0 | 12.0 | 20170312.0 | 61470 | 63.0 | 0.65 | 9770 | 6.0 | 760 | 1.00 | 1110 | 0.50 | 一线城市 | 0.0 | 66.0 | 0.0 | 66.0 | 0.0 | 3.0 | 3.0 | 3.0 | 0.0 | 1000 | 3.0 | 0.0 | 0 | 6410 | 0.0 | 1 | xs | 435.0 | 88.0 | 赵洋 | 541.0 | 75.0 | 11.0 | 3.0 | 4.0 | 14.0 | 2018-04-15 | 6.0 | 8.0 | 9.0 | 479.0 | 73.0 | 37.0 | 32.0 | 6.0 | 12.0 | 2.0 | 10.0 | 0.0 | 0.0 | 10.0 | 36.0 | 25.0 | 0.0 | 0.0 | 360.0 | 2018-01-07 | 1800.0 | 72.0 | 10.0 | 10.0 | 2300.0 | 1630.0 | 8300.0 | 79.0 | 2.0 | 2.0 | 8400.0 | 8250.0 | 22.0 | 120.0 |
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
有时候DataFrame中的行列数量太多,print打印出来会显示不完全。
#显示所有列 pd.set_option(‘display.max_columns’, None)
#显示所有行 pd.set_option(‘display.max_rows’, None)
#设置value的显示长度为100,默认为50 pd.set_option(‘max_colwidth’,100)
data.describe()
Unnamed: 0 | custid | low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | student_feature | repayment_capability | is_high_user | number_of_trans_from_2011 | first_transaction_time | historical_trans_amount | historical_trans_day | rank_trad_1_month | trans_amount_3_month | avg_consume_less_12_valid_month | abs | top_trans_count_last_1_month | avg_price_last_12_month | avg_price_top_last_12_valid_month | trans_top_time_last_1_month | trans_top_time_last_6_month | consume_top_time_last_1_month | consume_top_time_last_6_month | cross_consume_count_last_1_month | trans_fail_top_count_enum_last_1_month | trans_fail_top_count_enum_last_6_month | trans_fail_top_count_enum_last_12_month | consume_mini_time_last_1_month | max_cumulative_consume_later_1_month | max_consume_count_later_6_month | railway_consume_count_last_12_month | pawns_auctions_trusts_consume_last_1_month | pawns_auctions_trusts_consume_last_6_month | jewelry_consume_count_last_6_month | status | first_transaction_day | trans_day_last_12_month | apply_score | apply_credibility | query_org_count | query_finance_count | query_cash_count | query_sum_count | latest_one_month_apply | latest_three_month_apply | latest_six_month_apply | loans_score | loans_credibility_behavior | loans_count | loans_settle_count | loans_overdue_count | loans_org_count_behavior | consfin_org_count_behavior | loans_cash_count | latest_one_month_loan | latest_three_month_loan | latest_six_month_loan | history_suc_fee | history_fail_fee | latest_one_month_suc | latest_one_month_fail | loans_long_time | loans_credit_limit | loans_credibility_limit | loans_org_count_current | loans_product_count | loans_max_limit | loans_avg_limit | consfin_credit_limit | consfin_credibility | consfin_org_count_current | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4754.000000 | 4.754000e+03 | 4752.000000 | 4752.000000 | 4754.000000 | 4751.000000 | 4752.000000 | 4752.000000 | 4752.000000 | 4746.000000 | 4752.000000 | 4752.000000 | 1756.000000 | 4.754000e+03 | 4754.000000 | 4752.000000 | 4.752000e+03 | 4.754000e+03 | 4752.000000 | 4752.000000 | 4.754000e+03 | 4752.000000 | 4754.000000 | 4752.000000 | 4754.000000 | 4650.000000 | 4746.000000 | 4746.000000 | 4746.000000 | 4746.000000 | 4328.000000 | 4738.000000 | 4738.000000 | 4738.000000 | 4.728000e+03 | 4754.000000 | 4746.000000 | 4742.000000 | 4754.000000 | 4754.000000 | 4742.000000 | 4754.000000 | 4752.000000 | 4752.000000 | 4450.000000 | 4450.000000 | 4450.000000 | 4450.000000 | 4450.000000 | 4450.000000 | 4450.000000 | 4450.000000 | 4450.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4457.000000 | 4450.000000 | 4457.000000 |
mean | 6008.414178 | 1.690993e+06 | 0.021806 | 0.901294 | 1940.197728 | 14.160674 | 0.804411 | 0.365425 | 17.502946 | 29.029920 | 21.751263 | 2.678662 | 1.001139 | 1.870201e+04 | 0.011149 | 23.033880 | 2.015109e+07 | 2.307359e+05 | 176.109428 | 0.476926 | 3.896430e+04 | 6.572601 | 9344.350021 | 0.355745 | 1237.088767 | 0.514667 | 7.134008 | 20.174673 | 7.047198 | 20.649600 | 0.642329 | 1.656184 | 4.529759 | 5.232165 | 1.553622e+05 | 2886.964661 | 6.055626 | 0.030789 | 1321.201094 | 18958.460244 | 0.014340 | 0.250947 | 1036.274621 | 89.006944 | 576.632584 | 75.998876 | 11.974382 | 6.020000 | 3.784719 | 16.891236 | 4.329438 | 8.771910 | 12.364270 | 543.205968 | 75.438636 | 35.952210 | 31.039937 | 2.308952 | 12.845412 | 4.732331 | 8.113081 | 0.965896 | 2.821853 | 13.926857 | 43.145614 | 17.708548 | 1.224366 | 1.311420 | 335.159973 | 2089.297734 | 71.992372 | 8.113081 | 8.685214 | 3390.038142 | 1820.357864 | 9187.009199 | 76.042630 | 4.732331 | 5.227507 | 16153.690823 | 8007.696881 | 24.112809 | 55.181512 |
std | 3452.071428 | 1.034235e+06 | 0.041527 | 0.144856 | 3923.971494 | 694.180473 | 0.196920 | 0.170196 | 4.475616 | 22.722432 | 16.474916 | 0.890360 | 0.033739 | 5.221783e+04 | 0.105007 | 10.057837 | 1.480487e+04 | 3.204931e+05 | 99.687285 | 0.263769 | 1.017461e+05 | 1.390723 | 27007.597886 | 0.350595 | 765.873649 | 0.100397 | 5.318254 | 12.962979 | 5.456050 | 13.125224 | 2.343228 | 1.908887 | 4.455923 | 4.756974 | 3.742672e+05 | 10813.451908 | 5.684529 | 0.478499 | 6616.691843 | 28191.132260 | 0.201777 | 0.433603 | 537.108729 | 19.069927 | 51.167375 | 4.168916 | 7.041493 | 3.805369 | 2.599244 | 11.299787 | 4.525521 | 7.621961 | 9.274982 | 60.954266 | 2.231822 | 24.614363 | 21.694068 | 3.152881 | 7.448393 | 2.974596 | 5.374465 | 1.495566 | 3.455817 | 10.828475 | 30.353618 | 25.089348 | 1.944912 | 3.893607 | 35.770102 | 708.951406 | 10.851926 | 5.374465 | 5.759025 | 1474.206546 | 583.418291 | 7371.257043 | 14.536819 | 2.974596 | 3.409292 | 14301.037628 | 5679.418585 | 37.725724 | 53.486408 |
min | 5.000000 | 1.140000e+02 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.120000 | 0.033000 | 2.000000 | 0.000000 | 4.000000 | 1.000000 | 1.000000 | 0.000000e+00 | 0.000000 | 1.000000 | 2.011010e+07 | 0.000000e+00 | 2.000000 | 0.050000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.050000 | 0.000000 | 0.050000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 127.000000 | 82.000000 | 450.000000 | 50.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 413.000000 | 56.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 26.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -2.000000 | -2.000000 |
25% | 3106.000000 | 7.593358e+05 | 0.010000 | 0.880000 | 0.000000 | 0.615000 | 0.670000 | 0.233000 | 15.000000 | 16.000000 | 12.000000 | 2.000000 | 1.000000 | 8.590000e+03 | 0.000000 | 16.000000 | 2.014102e+07 | 7.949750e+04 | 102.000000 | 0.300000 | 1.168250e+04 | 6.000000 | 1290.000000 | 0.087500 | 920.000000 | 0.450000 | 3.250000 | 12.000000 | 3.000000 | 12.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 0.000000e+00 | 700.000000 | 3.000000 | 0.000000 | 0.000000 | 5252.500000 | 0.000000 | 0.000000 | 632.000000 | 82.000000 | 535.000000 | 74.000000 | 7.000000 | 3.000000 | 2.000000 | 9.000000 | 1.000000 | 3.000000 | 6.000000 | 493.000000 | 74.000000 | 17.000000 | 15.000000 | 0.000000 | 7.000000 | 2.000000 | 4.000000 | 0.000000 | 0.000000 | 6.000000 | 21.000000 | 3.000000 | 0.000000 | 0.000000 | 329.000000 | 1700.000000 | 72.000000 | 4.000000 | 4.000000 | 2300.000000 | 1535.000000 | 4800.000000 | 77.000000 | 2.000000 | 3.000000 | 7800.000000 | 4737.000000 | 5.000000 | 10.000000 |
50% | 6006.500000 | 1.634942e+06 | 0.010000 | 0.960000 | 500.000000 | 0.970000 | 0.860000 | 0.350000 | 17.000000 | 23.000000 | 17.000000 | 3.000000 | 1.000000 | 1.221000e+04 | 0.000000 | 21.000000 | 2.015111e+07 | 1.623350e+05 | 160.000000 | 0.450000 | 2.555500e+04 | 7.000000 | 3345.000000 | 0.200000 | 1140.000000 | 0.500000 | 7.000000 | 17.000000 | 7.000000 | 18.000000 | 0.000000 | 1.000000 | 3.000000 | 4.000000 | 2.400000e+01 | 1530.000000 | 5.000000 | 0.000000 | 70.000000 | 12725.000000 | 0.000000 | 0.000000 | 919.000000 | 83.000000 | 549.000000 | 76.000000 | 11.000000 | 5.000000 | 3.000000 | 15.000000 | 3.000000 | 7.000000 | 10.000000 | 511.000000 | 75.000000 | 31.000000 | 27.000000 | 1.000000 | 12.000000 | 4.000000 | 7.000000 | 0.000000 | 2.000000 | 11.000000 | 37.000000 | 10.000000 | 0.000000 | 0.000000 | 349.000000 | 2100.000000 | 74.000000 | 7.000000 | 8.000000 | 3100.000000 | 1810.000000 | 7700.000000 | 79.000000 | 4.000000 | 5.000000 | 13800.000000 | 7050.000000 | 14.000000 | 36.000000 |
75% | 8999.000000 | 2.597905e+06 | 0.020000 | 0.990000 | 2000.000000 | 1.600000 | 1.000000 | 0.480000 | 20.000000 | 32.000000 | 27.000000 | 3.000000 | 1.000000 | 1.764750e+04 | 0.000000 | 29.000000 | 2.016083e+07 | 2.985600e+05 | 231.000000 | 0.600000 | 4.795000e+04 | 7.000000 | 8067.500000 | 0.650000 | 1400.000000 | 0.550000 | 10.000000 | 26.000000 | 10.000000 | 26.000000 | 1.000000 | 2.000000 | 6.000000 | 6.000000 | 7.478850e+04 | 2760.000000 | 7.000000 | 0.000000 | 980.000000 | 23740.000000 | 0.000000 | 1.000000 | 1310.250000 | 87.000000 | 629.000000 | 78.000000 | 16.000000 | 8.000000 | 5.000000 | 23.000000 | 6.000000 | 12.000000 | 17.000000 | 602.000000 | 77.000000 | 50.000000 | 43.000000 | 3.000000 | 17.000000 | 7.000000 | 11.000000 | 1.000000 | 4.000000 | 20.000000 | 59.000000 | 22.000000 | 2.000000 | 1.000000 | 356.000000 | 2400.000000 | 75.000000 | 11.000000 | 12.000000 | 4300.000000 | 2100.000000 | 11700.000000 | 80.000000 | 7.000000 | 7.000000 | 20400.000000 | 10000.000000 | 24.000000 | 91.000000 |
max | 11992.000000 | 4.004694e+06 | 1.000000 | 1.000000 | 68000.000000 | 47596.740000 | 1.000000 | 0.941000 | 42.000000 | 285.000000 | 234.000000 | 5.000000 | 2.000000 | 2.459390e+06 | 1.000000 | 85.000000 | 2.018011e+07 | 1.360130e+07 | 907.000000 | 1.000000 | 6.024100e+06 | 11.000000 | 918450.000000 | 1.000000 | 23140.000000 | 1.000000 | 27.000000 | 124.000000 | 27.000000 | 151.000000 | 69.000000 | 30.000000 | 120.000000 | 120.000000 | 2.392316e+06 | 496010.000000 | 147.000000 | 30.000000 | 238380.000000 | 525360.000000 | 6.000000 | 1.000000 | 2697.000000 | 382.000000 | 687.000000 | 93.000000 | 54.000000 | 24.000000 | 16.000000 | 98.000000 | 38.000000 | 75.000000 | 80.000000 | 688.000000 | 85.000000 | 158.000000 | 154.000000 | 25.000000 | 41.000000 | 18.000000 | 31.000000 | 15.000000 | 52.000000 | 74.000000 | 254.000000 | 345.000000 | 20.000000 | 58.000000 | 360.000000 | 6900.000000 | 89.000000 | 31.000000 | 32.000000 | 10000.000000 | 6900.000000 | 87100.000000 | 87.000000 | 18.000000 | 20.000000 | 266400.000000 | 82800.000000 | 360.000000 | 323.000000 |
data.info()
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0 4754 non-null int64
custid 4754 non-null int64
trade_no 4754 non-null object
bank_card_no 4754 non-null object
low_volume_percent 4752 non-null float64
middle_volume_percent 4752 non-null float64
take_amount_in_later_12_month_highest 4754 non-null int64
trans_amount_increase_rate_lately 4751 non-null float64
trans_activity_month 4752 non-null float64
trans_activity_day 4752 non-null float64
transd_mcc 4752 non-null float64
trans_days_interval_filter 4746 non-null float64
trans_days_interval 4752 non-null float64
regional_mobility 4752 non-null float64
student_feature 1756 non-null float64
repayment_capability 4754 non-null int64
is_high_user 4754 non-null int64
number_of_trans_from_2011 4752 non-null float64
first_transaction_time 4752 non-null float64
historical_trans_amount 4754 non-null int64
historical_trans_day 4752 non-null float64
rank_trad_1_month 4752 non-null float64
trans_amount_3_month 4754 non-null int64
avg_consume_less_12_valid_month 4752 non-null float64
abs 4754 non-null int64
top_trans_count_last_1_month 4752 non-null float64
avg_price_last_12_month 4754 non-null int64
avg_price_top_last_12_valid_month 4650 non-null float64
reg_preference_for_trad 4752 non-null object
trans_top_time_last_1_month 4746 non-null float64
trans_top_time_last_6_month 4746 non-null float64
consume_top_time_last_1_month 4746 non-null float64
consume_top_time_last_6_month 4746 non-null float64
cross_consume_count_last_1_month 4328 non-null float64
trans_fail_top_count_enum_last_1_month 4738 non-null float64
trans_fail_top_count_enum_last_6_month 4738 non-null float64
trans_fail_top_count_enum_last_12_month 4738 non-null float64
consume_mini_time_last_1_month 4728 non-null float64
max_cumulative_consume_later_1_month 4754 non-null int64
max_consume_count_later_6_month 4746 non-null float64
railway_consume_count_last_12_month 4742 non-null float64
pawns_auctions_trusts_consume_last_1_month 4754 non-null int64
pawns_auctions_trusts_consume_last_6_month 4754 non-null int64
jewelry_consume_count_last_6_month 4742 non-null float64
status 4754 non-null int64
source 4754 non-null object
first_transaction_day 4752 non-null float64
trans_day_last_12_month 4752 non-null float64
id_name 4478 non-null object
apply_score 4450 non-null float64
apply_credibility 4450 non-null float64
query_org_count 4450 non-null float64
query_finance_count 4450 non-null float64
query_cash_count 4450 non-null float64
query_sum_count 4450 non-null float64
latest_query_time 4450 non-null object
latest_one_month_apply 4450 non-null float64
latest_three_month_apply 4450 non-null float64
latest_six_month_apply 4450 non-null float64
loans_score 4457 non-null float64
loans_credibility_behavior 4457 non-null float64
loans_count 4457 non-null float64
loans_settle_count 4457 non-null float64
loans_overdue_count 4457 non-null float64
loans_org_count_behavior 4457 non-null float64
consfin_org_count_behavior 4457 non-null float64
loans_cash_count 4457 non-null float64
latest_one_month_loan 4457 non-null float64
latest_three_month_loan 4457 non-null float64
latest_six_month_loan 4457 non-null float64
history_suc_fee 4457 non-null float64
history_fail_fee 4457 non-null float64
latest_one_month_suc 4457 non-null float64
latest_one_month_fail 4457 non-null float64
loans_long_time 4457 non-null float64
loans_latest_time 4457 non-null object
loans_credit_limit 4457 non-null float64
loans_credibility_limit 4457 non-null float64
loans_org_count_current 4457 non-null float64
loans_product_count 4457 non-null float64
loans_max_limit 4457 non-null float64
loans_avg_limit 4457 non-null float64
consfin_credit_limit 4457 non-null float64
consfin_credibility 4457 non-null float64
consfin_org_count_current 4457 non-null float64
consfin_product_count 4457 non-null float64
consfin_max_limit 4457 non-null float64
consfin_avg_limit 4457 non-null float64
latest_query_day 4450 non-null float64
loans_latest_day 4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB
可以看出数据集的数据类型有:float64(70), int64(13), object(7),部分特征有缺失情况。
data.nunique()
Unnamed: 0 4754
custid 4754
trade_no 4754
bank_card_no 1
low_volume_percent 40
middle_volume_percent 90
take_amount_in_later_12_month_highest 166
trans_amount_increase_rate_lately 782
trans_activity_month 84
trans_activity_day 512
transd_mcc 41
trans_days_interval_filter 147
trans_days_interval 114
regional_mobility 5
student_feature 2
repayment_capability 2390
is_high_user 2
number_of_trans_from_2011 70
first_transaction_time 1693
historical_trans_amount 4524
historical_trans_day 476
rank_trad_1_month 20
trans_amount_3_month 3524
avg_consume_less_12_valid_month 12
abs 1697
top_trans_count_last_1_month 8
avg_price_last_12_month 330
avg_price_top_last_12_valid_month 20
reg_preference_for_trad 5
trans_top_time_last_1_month 28
trans_top_time_last_6_month 97
consume_top_time_last_1_month 28
consume_top_time_last_6_month 94
cross_consume_count_last_1_month 19
trans_fail_top_count_enum_last_1_month 15
trans_fail_top_count_enum_last_6_month 25
trans_fail_top_count_enum_last_12_month 26
consume_mini_time_last_1_month 1971
max_cumulative_consume_later_1_month 863
max_consume_count_later_6_month 29
railway_consume_count_last_12_month 6
pawns_auctions_trusts_consume_last_1_month 572
pawns_auctions_trusts_consume_last_6_month 2730
jewelry_consume_count_last_6_month 7
status 2
source 1
first_transaction_day 1693
trans_day_last_12_month 132
id_name 4309
apply_score 205
apply_credibility 41
query_org_count 46
query_finance_count 25
query_cash_count 17
query_sum_count 74
latest_query_time 207
latest_one_month_apply 36
latest_three_month_apply 56
latest_six_month_apply 65
loans_score 247
loans_credibility_behavior 25
loans_count 134
loans_settle_count 123
loans_overdue_count 26
loans_org_count_behavior 41
consfin_org_count_behavior 19
loans_cash_count 32
latest_one_month_loan 14
latest_three_month_loan 31
latest_six_month_loan 67
history_suc_fee 171
history_fail_fee 151
latest_one_month_suc 19
latest_one_month_fail 41
loans_long_time 202
loans_latest_time 232
loans_credit_limit 54
loans_credibility_limit 33
loans_org_count_current 32
loans_product_count 32
loans_max_limit 91
loans_avg_limit 961
consfin_credit_limit 327
consfin_credibility 24
consfin_org_count_current 19
consfin_product_count 20
consfin_max_limit 175
consfin_avg_limit 1677
latest_query_day 210
loans_latest_day 235
dtype: int64
先看数据的nunique情况,看这个主要确定是否采用one-hot以及删除某些特征(主要是所有行和列都一样的),可以看出‘source’ ‘bank_card_no’ 的值只有一个,可以直接删除;‘Unnamed: 0’ ’custid ‘ ‘trade_no’ 这三个特征的唯一值为4754,并且根据属性名可知,可以删除。
data.drop(['Unnamed: 0', 'custid', 'trade_no', 'bank_card_no', 'source','id_name'], axis=1, inplace=True)
在删除DataFrame对象中的字段时,出现找不到字段的错误,可以在读取csv文件时添加一个参数:skipinitialspace=True 即可。
data.shape
(4754, 84)
object类型转换
object_cols = [col for col in data.columns if data[col].dtypes == 'O']
object_cols
#data.select_dtypes(include=[object]).columns
['reg_preference_for_trad', 'latest_query_time', 'loans_latest_time']
data[object_cols].head(5)
reg_preference_for_trad | latest_query_time | loans_latest_time | |
---|---|---|---|
0 | 一线城市 | 2018-04-25 | 2018-04-19 |
1 | 一线城市 | 2018-05-03 | 2018-05-05 |
2 | 一线城市 | 2018-05-05 | 2018-05-01 |
3 | 三线城市 | 2018-05-05 | 2018-05-03 |
4 | 一线城市 | 2018-04-15 | 2018-01-07 |
data_obj=data[object_cols]
data_num=data.drop(object_cols,axis=1)
缺失值处理大致可以分为删除和填充两种方法。删除又分为删除行(样本)和删除列(特征)两种,之前我们已经删除了缺失大量特征的样本和部分无用特征,目前剩下的特征所含缺失值不多,所以我们不采用删除的方法处理缺失值。
缺失值填充的方法有很多,需要根据特征的情况进行不同类型的填充,常见的有:均值填充、众数填充、中位数填充、前值填充等等。
imputer=Imputer(strategy='mean')
mean_num=imputer.fit_transform(data_num)
data_num=pd.DataFrame(mean_num,columns=data_num.columns)
data_obj.ffill(inplace=True)
object类型转换
encoder = LabelBinarizer()
reg_preference_1hot = encoder.fit_transform(data_obj[['reg_preference_for_trad']])
data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True)
reg_preference_df = pd.DataFrame(reg_preference_1hot, columns=encoder.classes_)
data_obj = pd.concat([data_obj, reg_preference_df], axis=1)
data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time'])
data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month
data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday
data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time'])
data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month
data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday
data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1)
data_obj.head(5)
一线城市 | 三线城市 | 二线城市 | 其他城市 | 境外 | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 4 | 2 | 4 | 3 |
1 | 1 | 0 | 0 | 0 | 0 | 5 | 3 | 5 | 5 |
2 | 1 | 0 | 0 | 0 | 0 | 5 | 5 | 5 | 1 |
3 | 0 | 1 | 0 | 0 | 0 | 5 | 5 | 5 | 3 |
4 | 1 | 0 | 0 | 0 | 0 | 4 | 6 | 1 | 6 |
data=pd.concat([data_num,data_obj],axis=1)
data.shape
(4754, 90)
data.info(5)
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
low_volume_percent 4754 non-null float64
middle_volume_percent 4754 non-null float64
take_amount_in_later_12_month_highest 4754 non-null float64
trans_amount_increase_rate_lately 4754 non-null float64
trans_activity_month 4754 non-null float64
trans_activity_day 4754 non-null float64
transd_mcc 4754 non-null float64
trans_days_interval_filter 4754 non-null float64
trans_days_interval 4754 non-null float64
regional_mobility 4754 non-null float64
student_feature 4754 non-null float64
repayment_capability 4754 non-null float64
is_high_user 4754 non-null float64
number_of_trans_from_2011 4754 non-null float64
first_transaction_time 4754 non-null float64
historical_trans_amount 4754 non-null float64
historical_trans_day 4754 non-null float64
rank_trad_1_month 4754 non-null float64
trans_amount_3_month 4754 non-null float64
avg_consume_less_12_valid_month 4754 non-null float64
abs 4754 non-null float64
top_trans_count_last_1_month 4754 non-null float64
avg_price_last_12_month 4754 non-null float64
avg_price_top_last_12_valid_month 4754 non-null float64
trans_top_time_last_1_month 4754 non-null float64
trans_top_time_last_6_month 4754 non-null float64
consume_top_time_last_1_month 4754 non-null float64
consume_top_time_last_6_month 4754 non-null float64
cross_consume_count_last_1_month 4754 non-null float64
trans_fail_top_count_enum_last_1_month 4754 non-null float64
trans_fail_top_count_enum_last_6_month 4754 non-null float64
trans_fail_top_count_enum_last_12_month 4754 non-null float64
consume_mini_time_last_1_month 4754 non-null float64
max_cumulative_consume_later_1_month 4754 non-null float64
max_consume_count_later_6_month 4754 non-null float64
railway_consume_count_last_12_month 4754 non-null float64
pawns_auctions_trusts_consume_last_1_month 4754 non-null float64
pawns_auctions_trusts_consume_last_6_month 4754 non-null float64
jewelry_consume_count_last_6_month 4754 non-null float64
status 4754 non-null float64
first_transaction_day 4754 non-null float64
trans_day_last_12_month 4754 non-null float64
apply_score 4754 non-null float64
apply_credibility 4754 non-null float64
query_org_count 4754 non-null float64
query_finance_count 4754 non-null float64
query_cash_count 4754 non-null float64
query_sum_count 4754 non-null float64
latest_one_month_apply 4754 non-null float64
latest_three_month_apply 4754 non-null float64
latest_six_month_apply 4754 non-null float64
loans_score 4754 non-null float64
loans_credibility_behavior 4754 non-null float64
loans_count 4754 non-null float64
loans_settle_count 4754 non-null float64
loans_overdue_count 4754 non-null float64
loans_org_count_behavior 4754 non-null float64
consfin_org_count_behavior 4754 non-null float64
loans_cash_count 4754 non-null float64
latest_one_month_loan 4754 non-null float64
latest_three_month_loan 4754 non-null float64
latest_six_month_loan 4754 non-null float64
history_suc_fee 4754 non-null float64
history_fail_fee 4754 non-null float64
latest_one_month_suc 4754 non-null float64
latest_one_month_fail 4754 non-null float64
loans_long_time 4754 non-null float64
loans_credit_limit 4754 non-null float64
loans_credibility_limit 4754 non-null float64
loans_org_count_current 4754 non-null float64
loans_product_count 4754 non-null float64
loans_max_limit 4754 non-null float64
loans_avg_limit 4754 non-null float64
consfin_credit_limit 4754 non-null float64
consfin_credibility 4754 non-null float64
consfin_org_count_current 4754 non-null float64
consfin_product_count 4754 non-null float64
consfin_max_limit 4754 non-null float64
consfin_avg_limit 4754 non-null float64
latest_query_day 4754 non-null float64
loans_latest_day 4754 non-null float64
一线城市 4754 non-null int64
三线城市 4754 non-null int64
二线城市 4754 non-null int64
其他城市 4754 non-null int64
境外 4754 non-null int64
latest_query_time_month 4754 non-null int64
latest_query_time_weekday 4754 non-null int64
loans_latest_time_month 4754 non-null int64
loans_latest_time_weekday 4754 non-null int64
dtypes: float64(81), int64(9)
memory usage: 3.3 MB
1.https://blog.csdn.net/bear507/article/details/86649069
2.科大讯飞AI广告点击预测比赛