chenburong2021

sklearn_逻辑回归制作评分卡_菜菜视频学习笔记

逻辑回归制作评分卡

- 3.0 前言
- - 逻辑回归与线性回归的关系
  - 消除特征间的多重共线性
  - 为什么使用逻辑回归处理金融领域数据
  - 正则化的选择
  - 特征选择的方法
  - 分箱的作用
- 3.1导库
- 3.2数据预处理
- - 3.2.1 去重复值
  - 3.2.2 填补缺失值
  - 3.2.3描述性统计处理异常值
  - 3.2.4 以业务为中心，保持数据原貌，不统一量纲，与标准化数据分布
  - 3.2.5 样本不均衡问题
  - 3.2.6 分训练集和测试集
- 3.3 分箱
- - 3.3.1 等频分箱
  - 3.3.2 确保每个箱中都有0和1，这里没做
  - 3.3.3 定义WOE和IV函数
  - 3.3.4 卡方检验，合并箱体，画出IV曲线
  - 3.3.5 用最佳分箱个数分箱，并验证分箱结果
  - 3.3.6 将选取最佳分箱个数的过程包装为函数
  - 3.3.7 对所有特征进行分箱选择
- 3.4 计算各箱的WOE并映射到数据中
- 3.5建模与模型验证
- 3.6 制作评分卡

3.0 前言

逻辑回归与线性回归的关系

逻辑回归是用来处理连续型标签的算法

算法	线性回归	逻辑回归
输出	连续型（可找到x与y的线性关系）	连续型（可对应二分类）
算法类型	回归算法	广义回归算法(可处理回归问题)

逻辑回归使用了sigmoid函数，把线性回归方程z =>g(z)，令其值在0-1，接近0为0，接近1为1，以此实现分类

函数	sigmoid	MinMaxSclaer
数据压缩范围	(0,1)	[0,1]

ln( g(z) / (1- g(z) ) )=z(线性回归方程)

消除特征间的多重共线性

线性回归对数据要求满足正态分布，需要消除特征间的多重共线性；
可使用方差过滤，方差膨胀因子VIF

为什么使用逻辑回归处理金融领域数据

正则化的选择

：处理过拟合

L1范数	L2范数
参数绝对值之和	每个参数的平方和开方

C：平衡损失函数与正则项

L1正则化	L2正则化
部分参数为0，稀疏性	部分参数接近0

特征选择的方法

除了PCA，SVD此类会消除标签与特征之间的关系的算法

可以用Embedded嵌入法，方差，卡方，互信息法，包装法等

互信息法：是用来捕捉特征和标签之间的任意关系（包括线性和非线性关系）的过滤方法，和F检验相似

分箱的作用

特征离散化以后，简化了逻辑回归模型，降低了模型过拟合的风险。

离散特征数目易于更改，益于模型迭代；

离散特征数目少，运算速度快；

特征离散化后，模型会更稳定，组内差距减小，组间差距扩大，边缘数据仍不稳定

3.1导库

%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression as LR#从线性模型导入逻辑回归

data = pd.read_csv(r"D:\class_file\day08_05\rankingcard.csv",index_col=0)
#data = pd.read_csv(r"D:\class_file\day08_05\rankingcard.csv",index_col=0)

3.2数据预处理

data.head()

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
1	1	0.766127	45	2	0.802982	9120.0	13	0	6	2.0
2	0	0.957151	40	0	0.121876	2600.0	4	0	0	1.0
3	0	0.658180	38	1	0.085113	3042.0	2	1	0	0.0
4	0	0.233810	30	0	0.036050	3300.0	5	0	0	0.0
5	0	0.907239	49	1	0.024926	63588.0	7	0	1	0.0

data.info()


Int64Index: 150000 entries, 1 to 150000
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      150000 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 2   age                                   150000 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 4   DebtRatio                             150000 non-null  float64
 5   MonthlyIncome                         120269 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 7   NumberOfTimes90DaysLate               150000 non-null  int64  
 8   NumberRealEstateLoansOrLines          150000 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 10  NumberOfDependents                    146076 non-null  float64
dtypes: float64(4), int64(7)
memory usage: 13.7 MB

3.2.1 去重复值

data.drop_duplicates(inplace=True)#把重复行删掉覆盖原数据
data.info()


Int64Index: 149391 entries, 1 to 150000
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      149391 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  149391 non-null  float64
 2   age                                   149391 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  149391 non-null  int64  
 4   DebtRatio                             149391 non-null  float64
 5   MonthlyIncome                         120170 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       149391 non-null  int64  
 7   NumberOfTimes90DaysLate               149391 non-null  int64  
 8   NumberRealEstateLoansOrLines          149391 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  149391 non-null  int64  
 10  NumberOfDependents                    145563 non-null  float64
dtypes: float64(4), int64(7)
memory usage: 13.7 MB

data.index = range(data.shape[0])#恢复索引为0-特征数排序

data.shape[0]

3.2.2 填补缺失值

data.info()


RangeIndex: 149391 entries, 0 to 149390
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      149391 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  149391 non-null  float64
 2   age                                   149391 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  149391 non-null  int64  
 4   DebtRatio                             149391 non-null  float64
 5   MonthlyIncome                         120170 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       149391 non-null  int64  
 7   NumberOfTimes90DaysLate               149391 non-null  int64  
 8   NumberRealEstateLoansOrLines          149391 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  149391 non-null  int64  
 10  NumberOfDependents                    145563 non-null  float64
dtypes: float64(4), int64(7)
memory usage: 12.5 MB

data.isnull().sum()/data.shape[0]#返回布尔值，计算空值数；

SeriousDlqin2yrs                        0.000000
RevolvingUtilizationOfUnsecuredLines    0.000000
age                                     0.000000
NumberOfTime30-59DaysPastDueNotWorse    0.000000
DebtRatio                               0.000000
MonthlyIncome                           0.195601
NumberOfOpenCreditLinesAndLoans         0.000000
NumberOfTimes90DaysLate                 0.000000
NumberRealEstateLoansOrLines            0.000000
NumberOfTime60-89DaysPastDueNotWorse    0.000000
NumberOfDependents                      0.025624
dtype: float64

data.isnull().mean()

SeriousDlqin2yrs                        0.000000
RevolvingUtilizationOfUnsecuredLines    0.000000
age                                     0.000000
NumberOfTime30-59DaysPastDueNotWorse    0.000000
DebtRatio                               0.000000
MonthlyIncome                           0.195601
NumberOfOpenCreditLinesAndLoans         0.000000
NumberOfTimes90DaysLate                 0.000000
NumberRealEstateLoansOrLines            0.000000
NumberOfTime60-89DaysPastDueNotWorse    0.000000
NumberOfDependents                      0.025624
dtype: float64

data["NumberOfDependents"].fillna(int(data["NumberOfDependents"].mean()),inplace=True)#均值填补，去浮点

data.info()


RangeIndex: 149391 entries, 0 to 149390
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      149391 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  149391 non-null  float64
 2   age                                   149391 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  149391 non-null  int64  
 4   DebtRatio                             149391 non-null  float64
 5   MonthlyIncome                         120170 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       149391 non-null  int64  
 7   NumberOfTimes90DaysLate               149391 non-null  int64  
 8   NumberRealEstateLoansOrLines          149391 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  149391 non-null  int64  
 10  NumberOfDependents                    149391 non-null  float64
dtypes: float64(4), int64(7)
memory usage: 12.5 MB

data.isnull().sum()/data.shape[0]

SeriousDlqin2yrs                        0.000000
RevolvingUtilizationOfUnsecuredLines    0.000000
age                                     0.000000
NumberOfTime30-59DaysPastDueNotWorse    0.000000
DebtRatio                               0.000000
MonthlyIncome                           0.195601
NumberOfOpenCreditLinesAndLoans         0.000000
NumberOfTimes90DaysLate                 0.000000
NumberRealEstateLoansOrLines            0.000000
NumberOfTime60-89DaysPastDueNotWorse    0.000000
NumberOfDependents                      0.000000
dtype: float64

#随机深林填补缺失值，以缺失特征为新标签Y，其余特征与标签为新特征X;特征不缺失值的为训练数据，缺值的为测试数据

def fill_missing_rf(X,y,to_fill):
    """
    随机森林填补某一个缺失值的函数
    参数：  
    X：要填补的特征矩阵
    y：完整的，没有缺失值的标签
    to_fill：字符串，要填补的那一列的名称
    """
    #构建新特征矩阵和新标签
    df = X.copy()#备份特性矩阵
    fill = df.loc[:,to_fill]#带缺失值的整列特征，有的为空，有的为真；
    df = pd.concat([df.loc[:,df.columns != to_fill],pd.DataFrame(y)],axis=1)#将原未缺失特征和标签组成新特征矩阵
    #找出我们的训练集和测试集
    Ytrain = fill[fill.notnull()]#特征值为非空的行做训练标签，特征值为空的行做测试标签
    Ytest = fill[fill.isnull()]
    Xtrain = df.iloc[Ytrain.index,:]#切出对应新训练标签的特征矩阵
    Xtest = df.iloc[Ytest.index,:]
    #切出对应为空新训练标签的特征矩阵
    #用随机森林回归来填补缺失值
    from sklearn.ensemble import RandomForestRegressor as rfr
    rfr = rfr(n_estimators=100)
    rfr = rfr.fit(Xtrain, Ytrain)
    Ypredict = rfr.predict(Xtest)
    return Ypredict

data.head()

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
0	1	0.766127	45	2	0.802982	9120.0	13	0	6	2.0
1	0	0.957151	40	0	0.121876	2600.0	4	0	0	1.0
2	0	0.658180	38	1	0.085113	3042.0	2	1	0	0.0
3	0	0.233810	30	0	0.036050	3300.0	5	0	0	0.0
4	0	0.907239	49	1	0.024926	63588.0	7	0	1	0.0

X = data.iloc[:,1:]
y = data["SeriousDlqin2yrs"]

X.shape

(149391, 10)

y_pred = fill_missing_rf(X,y,"MonthlyIncome") #随机深林填补缺失值，如果结果正常，我们就可以将数据覆盖了

y_pred

array([0.15, 0.28, 0.12, ..., 0.24, 0.16, 0.  ])

y_pred.shape

(29221,)

data.loc[:,"MonthlyIncome"].isnull()

0         False
1         False
2         False
3         False
4         False
          ...  
149386    False
149387    False
149388     True
149389    False
149390    False
Name: MonthlyIncome, Length: 149391, dtype: bool

data.loc[data.loc[:,"MonthlyIncome"].isnull(),"MonthlyIncome"]#替换（所有行该列的值（不等于空）中的行的行坐标，列类别）的值

6        NaN
8        NaN
16       NaN
32       NaN
41       NaN
          ..
149368   NaN
149369   NaN
149376   NaN
149384   NaN
149388   NaN
Name: MonthlyIncome, Length: 29221, dtype: float64

data.loc[data.loc[:,"MonthlyIncome"].isnull(),"MonthlyIncome"] = y_pred#替换（所有行该列的值（不等于空）中的行的行坐标，列类别）的值

data.info()


RangeIndex: 149391 entries, 0 to 149390
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      149391 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  149391 non-null  float64
 2   age                                   149391 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  149391 non-null  int64  
 4   DebtRatio                             149391 non-null  float64
 5   MonthlyIncome                         149391 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       149391 non-null  int64  
 7   NumberOfTimes90DaysLate               149391 non-null  int64  
 8   NumberRealEstateLoansOrLines          149391 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  149391 non-null  int64  
 10  NumberOfDependents                    149391 non-null  float64
dtypes: float64(4), int64(7)
memory usage: 12.5 MB

3.2.3描述性统计处理异常值

#3.2.3描述性统计处理异常值
data.describe([0.01,0.1,0.25,.5,.75,.9,.99]).T

	count	mean	std	1%	10%	25%	50%	75%	90%	99%	max
SeriousDlqin2yrs	149391.0	0.066999	0.250021	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.0
RevolvingUtilizationOfUnsecuredLines	149391.0	6.071087	250.263672	0.0	0.003199	0.030132	0.154235	0.556494	0.978007	1.093922	50708.0
age	149391.0	52.306237	14.725962	24.0	33.000000	41.000000	52.000000	63.000000	72.000000	87.000000	109.0
NumberOfTime30-59DaysPastDueNotWorse	149391.0	0.393886	3.852953	0.0	0.000000	0.000000	0.000000	0.000000	1.000000	4.000000	98.0
DebtRatio	149391.0	354.436740	2041.843455	0.0	0.034991	0.177441	0.368234	0.875279	1275.000000	4985.100000	329664.0
MonthlyIncome	149391.0	5423.900169	13228.153266	0.0	0.170000	1800.000000	4424.000000	7416.000000	10800.000000	23200.000000	3008750.0
NumberOfOpenCreditLinesAndLoans	149391.0	8.480892	5.136515	0.0	3.000000	5.000000	8.000000	11.000000	15.000000	24.000000	58.0
NumberOfTimes90DaysLate	149391.0	0.238120	3.826165	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	3.000000	98.0
NumberRealEstateLoansOrLines	149391.0	1.022391	1.130196	0.0	0.000000	0.000000	1.000000	2.000000	2.000000	4.000000	54.0
NumberOfTime60-89DaysPastDueNotWorse	149391.0	0.212503	3.810523	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	2.000000	98.0
NumberOfDependents	149391.0	0.740393	1.108272	0.0	0.000000	0.000000	0.000000	1.000000	2.000000	4.000000	20.0

data=data[data["age"]!=0]

data.shape

(149390, 11)

data[data.loc[:,"NumberOfTimes90DaysLate"] > 90]

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
1732	1	1.0	27	98	0.0	2700.000000	0	98	0	98	0.0
2285	0	1.0	22	98	0.0	1448.611748	0	98	0	98	0.0
3883	0	1.0	38	98	12.0	1977.580000	0	98	0	98	0.0
4416	0	1.0	21	98	0.0	0.000000	0	98	0	98	0.0
4704	0	1.0	21	98	0.0	2000.000000	0	98	0	98	0.0
...	...	...	...	...	...	...	...	...	...	...	...
146667	1	1.0	25	98	0.0	2132.805238	0	98	0	98	0.0
147180	1	1.0	68	98	255.0	86.120000	0	98	0	98	0.0
148548	1	1.0	24	98	54.0	385.430000	0	98	0	98	0.0
148634	0	1.0	26	98	0.0	2000.000000	0	98	0	98	0.0
148833	1	1.0	34	98	9.0	1786.770000	0	98	0	98	0.0

225 rows × 11 columns

data=data[data.loc[:,"NumberOfTimes90DaysLate"] <90]

data.index=range(data.shape[0])#恢复索引

data.shape

(149165, 11)

data.describe([0.01,0.1,0.25,.5,.75,.9,.99]).T#显示百分比间距数据分布，转置

	count	mean	std	min	1%	10%	25%	50%	75%	90%	99%	max
SeriousDlqin2yrs	149165.0	0.066188	0.248612	0.0	0.0	0.000000	0.000000	0.000000	0.000000	0.00000	1.000000	1.0
RevolvingUtilizationOfUnsecuredLines	149165.0	6.078770	250.453111	0.0	0.0	0.003174	0.030033	0.153615	0.553698	0.97502	1.094061	50708.0
age	149165.0	52.331076	14.714114	21.0	24.0	33.000000	41.000000	52.000000	63.000000	72.00000	87.000000	109.0
NumberOfTime30-59DaysPastDueNotWorse	149165.0	0.246720	0.698935	0.0	0.0	0.000000	0.000000	0.000000	0.000000	1.00000	3.000000	13.0
DebtRatio	149165.0	354.963542	2043.344496	0.0	0.0	0.036385	0.178211	0.368619	0.876994	1277.30000	4989.360000	329664.0
MonthlyIncome	149165.0	5428.186556	13237.334090	0.0	0.0	0.170000	1800.000000	4436.000000	7417.000000	10800.00000	23218.000000	3008750.0
NumberOfOpenCreditLinesAndLoans	149165.0	8.493688	5.129841	0.0	1.0	3.000000	5.000000	8.000000	11.000000	15.00000	24.000000	58.0
NumberOfTimes90DaysLate	149165.0	0.090725	0.486354	0.0	0.0	0.000000	0.000000	0.000000	0.000000	0.00000	2.000000	17.0
NumberRealEstateLoansOrLines	149165.0	1.023927	1.130350	0.0	0.0	0.000000	0.000000	1.000000	2.000000	2.00000	4.000000	54.0
NumberOfTime60-89DaysPastDueNotWorse	149165.0	0.065069	0.330675	0.0	0.0	0.000000	0.000000	0.000000	0.000000	0.00000	2.000000	11.0
NumberOfDependents	149165.0	0.740911	1.108534	0.0	0.0	0.000000	0.000000	0.000000	1.000000	2.00000	4.000000	20.0

3.2.4 以业务为中心，保持数据原貌，不统一量纲，与标准化数据分布

3.2.5 样本不均衡问题

在逻辑回归中使用上采样（增加少数类样本，来平衡标签）

X = data.iloc[:,1:]
y = data.iloc[:,0]
y

0         1
1         0
2         0
3         0
4         0
         ..
149160    0
149161    0
149162    0
149163    0
149164    0
Name: SeriousDlqin2yrs, Length: 149165, dtype: int64

y.value_counts()#值分布计数；

0    139292
1      9873
Name: SeriousDlqin2yrs, dtype: int64

n_sample = X.shape[0]
n_1_sample = y.value_counts()[1]#[]显示值为指定数的计数
n_0_sample = y.value_counts()[0]
print('样本个数：{}; 1占{:.2%}; 0占{:.2%}'.format(n_sample,n_1_sample/n_sample,n_0_sample/n_sample))

样本个数：149165; 1占6.62%; 0占93.38%

#如果报错，就在prompt安装：pip install imblearn
import imblearn
#imblearn是专门用来处理不平衡数据集的库，在处理样本不均衡问题中性能高过sklearn很多
#imblearn里面也是一个个的类，也需要进行实例化，fit拟合(训练)，和sklearn用法相似
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42) #实例化
X,y = sm.fit_resample(X,y)
n_sample_ = X.shape[0]
y.value_counts()

1    139292
0    139292
Name: SeriousDlqin2yrs, dtype: int64

n_1_sample = y.value_counts()[1]
n_0_sample = y.value_counts()[0]

print('样本个数：{}; 1占{:.2%}; 0占{:.2%}'.format(n_sample_,n_1_sample/n_sample_,n_0_sample/n_sample_))

样本个数：278584; 1占50.00%; 0占50.00%

3.2.6 分训练集和测试集

#3.2.6分训练数据和测试集
from sklearn.model_selection import train_test_split
X = pd.DataFrame(X)
y = pd.DataFrame(y)
X_train, X_vali, Y_train, Y_vali = train_test_split(X,y,test_size=0.3,random_state=420)

model_data = pd.concat([Y_train, X_train], axis=1)#分箱需要标签和特征链接，标签为第一列

model_data

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
81602	0	0.015404	53	0	0.121802	4728.000000	5	0	0	0	0.000000
149043	0	0.168311	63	0	0.141964	1119.000000	5	0	0	0	0.000000
215073	1	1.063570	39	1	0.417663	3500.000000	5	1	0	2	3.716057
66278	0	0.088684	73	0	0.522822	5301.000000	11	0	2	0	0.000000
157084	1	0.622999	53	0	0.423650	13000.000000	9	0	2	0	0.181999
...	...	...	...	...	...	...	...	...	...	...	...
178094	1	0.916269	32	2	0.548132	6000.000000	10	0	1	0	3.966830
62239	1	0.484728	50	1	0.370603	5258.000000	12	0	1	0	2.000000
152127	1	0.850447	46	0	0.562610	8000.000000	9	0	1	0	2.768793
119174	0	1.000000	64	0	0.364694	10309.000000	7	0	3	0	0.000000
193608	1	0.512881	53	0	1968.401488	0.172907	12	0	1	0	0.000000

195008 rows × 11 columns

data.columns

Index(['SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
       'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
       'NumberOfDependents'],
      dtype='object')

model_data.index = range(model_data.shape[0])
model_data.columns = data.columns#一样不需要修改

model_data

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
0	0	0.015404	53	0	0.121802	4728.000000	5	0	0	0	0.000000
1	0	0.168311	63	0	0.141964	1119.000000	5	0	0	0	0.000000
2	1	1.063570	39	1	0.417663	3500.000000	5	1	0	2	3.716057
3	0	0.088684	73	0	0.522822	5301.000000	11	0	2	0	0.000000
4	1	0.622999	53	0	0.423650	13000.000000	9	0	2	0	0.181999
...	...	...	...	...	...	...	...	...	...	...	...
195003	1	0.916269	32	2	0.548132	6000.000000	10	0	1	0	3.966830
195004	1	0.484728	50	1	0.370603	5258.000000	12	0	1	0	2.000000
195005	1	0.850447	46	0	0.562610	8000.000000	9	0	1	0	2.768793
195006	0	1.000000	64	0	0.364694	10309.000000	7	0	3	0	0.000000
195007	1	0.512881	53	0	1968.401488	0.172907	12	0	1	0	0.000000

195008 rows × 11 columns

vali_data = pd.concat([Y_vali, X_vali], axis=1)
vali_data.index = range(vali_data.shape[0])
vali_data.columns = data.columns

#model_data.to_csv(r"‪D:\class_file\day08_05\day08_vali_data.csv")
model_data.to_csv(r"D:\class_file\day08_05\model_data.csv")

vali_data.to_csv(r"D:\class_file\day08_05\vali_data.csv")

3.3 分箱

3.3.1 等频分箱

model_data["age"]

0         53
1         63
2         39
3         73
4         53
          ..
195003    32
195004    50
195005    46
195006    64
195007    53
Name: age, Length: 195008, dtype: int64

model_data["qcut"],updown=pd.qcut(model_data["age"],retbins=True,q=20)
#retbins返回索引为样本索引，元素为分箱上下限  
#pd.qcut一维数据分箱，返回，各个箱子的上下限列表
#q分箱个数
#dataframe["列名"]，当列名不存在时自动生成一个名叫这个列个新列

model_data["qcut"]

0         (52.0, 54.0]
1         (61.0, 64.0]
2         (36.0, 39.0]
3         (68.0, 74.0]
4         (52.0, 54.0]
              ...     
195003    (31.0, 34.0]
195004    (48.0, 50.0]
195005    (45.0, 46.0]
195006    (61.0, 64.0]
195007    (52.0, 54.0]
Name: qcut, Length: 195008, dtype: category
Categories (20, interval[float64, right]): [(20.999, 28.0] < (28.0, 31.0] < (31.0, 34.0] < (34.0, 36.0] ... (61.0, 64.0] < (64.0, 68.0] < (68.0, 74.0] < (74.0, 107.0]]

updown.shape

(21,)

updown

array([ 21.,  28.,  31.,  34.,  36.,  39.,  41.,  43.,  45.,  46.,  48.,
        50.,  52.,  54.,  56.,  58.,  61.,  64.,  68.,  74., 107.])

model_data.head()

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents	qcut
0	0	0.015404	53	0	0.121802	4728.0	5	0	0	0	0.000000	(52.0, 54.0]
1	0	0.168311	63	0	0.141964	1119.0	5	0	0	0	0.000000	(61.0, 64.0]
2	1	1.063570	39	1	0.417663	3500.0	5	1	0	2	3.716057	(36.0, 39.0]
3	0	0.088684	73	0	0.522822	5301.0	11	0	2	0	0.000000	(68.0, 74.0]
4	1	0.622999	53	0	0.423650	13000.0	9	0	2	0	0.181999	(52.0, 54.0]

model_data["qcut"].value_counts()#计算各值数

(36.0, 39.0]      12647
(20.999, 28.0]    11786
(58.0, 61.0]      11386
(48.0, 50.0]      11104
(46.0, 48.0]      10968
(31.0, 34.0]      10867
(50.0, 52.0]      10529
(43.0, 45.0]      10379
(61.0, 64.0]      10197
(39.0, 41.0]       9768
(52.0, 54.0]       9726
(41.0, 43.0]       9682
(28.0, 31.0]       9498
(74.0, 107.0]      9110
(64.0, 68.0]       8917
(54.0, 56.0]       8713
(68.0, 74.0]       8655
(56.0, 58.0]       7887
(34.0, 36.0]       7521
(45.0, 46.0]       5668
Name: qcut, dtype: int64

count_y0=model_data[model_data["SeriousDlqin2yrs"]==0].groupby(by="qcut").count()["SeriousDlqin2yrs"]
#取出布尔值为True的样本的列，按“qcut”分组，并计数,[]提取目标列
count_y0

qcut
(20.999, 28.0]    4243
(28.0, 31.0]      3571
(31.0, 34.0]      4075
(34.0, 36.0]      2908
(36.0, 39.0]      5182
(39.0, 41.0]      3956
(41.0, 43.0]      4002
(43.0, 45.0]      4389
(45.0, 46.0]      2419
(46.0, 48.0]      4813
(48.0, 50.0]      4900
(50.0, 52.0]      4728
(52.0, 54.0]      4681
(54.0, 56.0]      4677
(56.0, 58.0]      4483
(58.0, 61.0]      6583
(61.0, 64.0]      6968
(64.0, 68.0]      6623
(68.0, 74.0]      6753
(74.0, 107.0]     7737
Name: SeriousDlqin2yrs, dtype: int64

count_y1=model_data[model_data["SeriousDlqin2yrs"]==1].groupby(by="qcut").count()["SeriousDlqin2yrs"]

num_bins=[*zip(updown,updown[1:],count_y0,count_y1)]
#按照最短的列，两列依次连接

num_bins

[(21.0, 28.0, 4243, 7543),
 (28.0, 31.0, 3571, 5927),
 (31.0, 34.0, 4075, 6792),
 (34.0, 36.0, 2908, 4613),
 (36.0, 39.0, 5182, 7465),
 (39.0, 41.0, 3956, 5812),
 (41.0, 43.0, 4002, 5680),
 (43.0, 45.0, 4389, 5990),
 (45.0, 46.0, 2419, 3249),
 (46.0, 48.0, 4813, 6155),
 (48.0, 50.0, 4900, 6204),
 (50.0, 52.0, 4728, 5801),
 (52.0, 54.0, 4681, 5045),
 (54.0, 56.0, 4677, 4036),
 (56.0, 58.0, 4483, 3404),
 (58.0, 61.0, 6583, 4803),
 (61.0, 64.0, 6968, 3229),
 (64.0, 68.0, 6623, 2294),
 (68.0, 74.0, 6753, 1902),
 (74.0, 107.0, 7737, 1373)]

3.3.2 确保每个箱中都有0和1，这里没做

3.3.3 定义WOE和IV函数

def get_woe(num_bins):
    columns = ["min","max","count_0","count_1"]
    df = pd.DataFrame(num_bins,columns=columns)
#创建列表给其赋名
    df.head()
    #给列表添加列
    df["total"] = df.count_0 + df.count_1
    df["percentage"] = df.total / df.total.sum()#分箱样本占各个分箱样本和的比例
    df["bad_rate"] = df.count_1 / df.total#（该特征分箱中）坏的样本占（该特征分箱中的）样本比例
    df["good%"] = df.count_0/df.count_0.sum()#某一分箱中坏的样本占所有坏的样本的比例
    df["bad%"] = df.count_1/df.count_1.sum()
    df["woe"] = np.log(df["good%"] / df["bad%"])#评价违约概率的指标，证据权重（weight of Evidence）
    return df

def get_iv(df):
    rate = df["good%"] - df["bad%"]
    iv = np.sum(rate * df.woe)
    return iv
#wof证据权重（weight of Evidence），计算特征在单个分箱中的违约概率的指标
#IV代表的意义是我们特征（所有分箱）上的信息量以及这个特征（所有分箱）对模型的贡献

3.3.4 卡方检验，合并箱体，画出IV曲线

num_bins_=num_bins.copy()

import matplotlib.pyplot as plt
import scipy

x1=num_bins_[0][2:]

x1

(4243, 7543)

x2=num_bins_[1][2:]

x2

(3571, 5927)

scipy.stats.chi2_contingency([x1,x2])[0]

5.705081033738888

len(num_bins_)

pvs=[]
for i in range(len(num_bins_)-1):#循环19次
    x1=num_bins_[i][2:]
    x2=num_bins_[i+1][2:]
    #[0]返回卡方chi2-value,[1]返回p-value
    #p值大：两箱合并
    pv=scipy.stats.chi2_contingency([x1,x2])[1]
    #chi2=scipy.states.chi2_contingency(x1,x2)[0]
    pvs.append(pv)

len(pvs)

pvs.index(max(pvs))

num_bins_

[(21.0, 28.0, 4243, 7543),
 (28.0, 31.0, 3571, 5927),
 (31.0, 34.0, 4075, 6792),
 (34.0, 36.0, 2908, 4613),
 (36.0, 39.0, 5182, 7465),
 (39.0, 41.0, 3956, 5812),
 (41.0, 43.0, 4002, 5680),
 (43.0, 45.0, 4389, 5990),
 (45.0, 46.0, 2419, 3249),
 (46.0, 48.0, 4813, 6155),
 (48.0, 50.0, 4900, 6204),
 (50.0, 52.0, 4728, 5801),
 (52.0, 54.0, 4681, 5045),
 (54.0, 56.0, 4677, 4036),
 (56.0, 58.0, 4483, 3404),
 (58.0, 61.0, 6583, 4803),
 (61.0, 64.0, 6968, 3229),
 (64.0, 68.0, 6623, 2294),
 (68.0, 74.0, 6753, 1902),
 (74.0, 107.0, 7737, 1373)]

num_bins_[7:9] = [(
num_bins_[7][0],
num_bins_[7+1][1],
num_bins_[7][2]+num_bins_[7+1][2],
num_bins_[7][3]+num_bins_[7+1][3])]
num_bins_
#合并7，8分箱

[(21.0, 28.0, 4243, 7543),
 (28.0, 31.0, 3571, 5927),
 (31.0, 34.0, 4075, 6792),
 (34.0, 36.0, 2908, 4613),
 (36.0, 39.0, 5182, 7465),
 (39.0, 41.0, 3956, 5812),
 (41.0, 43.0, 4002, 5680),
 (43.0, 46.0, 6808, 9239),
 (46.0, 48.0, 4813, 6155),
 (48.0, 50.0, 4900, 6204),
 (50.0, 52.0, 4728, 5801),
 (52.0, 54.0, 4681, 5045),
 (54.0, 56.0, 4677, 4036),
 (56.0, 58.0, 4483, 3404),
 (58.0, 61.0, 6583, 4803),
 (61.0, 64.0, 6968, 3229),
 (64.0, 68.0, 6623, 2294),
 (68.0, 74.0, 6753, 1902),
 (74.0, 107.0, 7737, 1373)]

len(num_bins_)

num_bins_ = num_bins.copy()
import matplotlib.pyplot as plt
import scipy
IV=[]
axisx=[]
while len(num_bins_)>2:#计算各箱p值
    pvs=[]
    for i in range(len(num_bins_)-1):
        x1 = num_bins_[i][2:]
   #     X2=num_bins_[i+1][2:]
        x2=num_bins_[i+1][2:]
        pv=scipy.stats.chi2_contingency([x1,x2])[1]
        pvs.append(pv)
    
    i=pvs.index(max(pvs))#P值最大两箱合并
    num_bins_[i:i+2]=[(
        num_bins_[i][0],
        num_bins_[i+1][1],
        num_bins_[i][2]+num_bins_[i+1][2],
        num_bins_[i][3]+num_bins_[i+1][3])]
    
    bins_df=get_woe(num_bins_)
    axisx.append(len(num_bins_))
    IV.append(get_iv(bins_df))
    
plt.figure()
plt.plot(axisx,IV)
plt.xticks(axisx)
plt.xlabel("number of box")
plt.ylabel("IV")
plt.show()

3.3.5 用最佳分箱个数分箱，并验证分箱结果

#鉴定为7最佳
#定义分箱函数
def get_bin(num_bins_,n):
    while len(num_bins_)>n:#计算各箱p值
        pvs=[]
        for i in range(len(num_bins_)-1):
            x1 = num_bins_[i][2:]
       #     X2=num_bins_[i+1][2:]
            x2=num_bins_[i+1][2:]
            pv=scipy.stats.chi2_contingency([x1,x2])[1]
            pvs.append(pv)

        i=pvs.index(max(pvs))#P值最大两箱合并
        num_bins_[i:i+2]=[(
            num_bins_[i][0],
            num_bins_[i+1][1],
            num_bins_[i][2]+num_bins_[i+1][2],
            num_bins_[i][3]+num_bins_[i+1][3])]
    return num_bins_

num_bins_ = num_bins.copy()#防止覆盖原数据
afterbins=get_bin(num_bins_,7)

afterbins

[(21.0, 36.0, 14797, 24875),
 (36.0, 46.0, 19948, 28196),
 (46.0, 54.0, 19122, 23205),
 (54.0, 61.0, 15743, 12243),
 (61.0, 64.0, 6968, 3229),
 (64.0, 74.0, 13376, 4196),
 (74.0, 107.0, 7737, 1373)]

bins_df=get_woe(afterbins)

bins_df
#woe应该单调（or一个转折点）
#bad_rate差距要大

	min	max	count_0	count_1	total	percentage	bad_rate	good%	bad%	woe
0	21.0	36.0	14797	24875	39672	0.203438	0.627017	0.151467	0.255608	-0.523275
1	36.0	46.0	19948	28196	48144	0.246882	0.585660	0.204195	0.289734	-0.349887
2	46.0	54.0	19122	23205	42327	0.217053	0.548232	0.195740	0.238448	-0.197364
3	54.0	61.0	15743	12243	27986	0.143512	0.437469	0.161151	0.125805	0.247606
4	61.0	64.0	6968	3229	10197	0.052290	0.316662	0.071327	0.033180	0.765320
5	64.0	74.0	13376	4196	17572	0.090109	0.238789	0.136922	0.043117	1.155495
6	74.0	107.0	7737	1373	9110	0.046716	0.150714	0.079199	0.014109	1.725180

3.3.6 将选取最佳分箱个数的过程包装为函数

def graphforbestbin(DF, X, Y, n=5,q=20,graph=True):
    """
    自动最优分箱函数，基于卡方检验的分箱
    参数：
    DF: 需要输入的数据
    X: 需要分箱的列名
    Y: 分箱数据对应的标签 Y 列名
    n: 保留分箱个数
    q: 初始分箱的个数
    graph: 是否要画出IV图像
    区间为前开后闭 (]
    """
    bins_df=[] 
    DF = DF[[X,Y]].copy()
    DF["qcut"],bins = pd.qcut(DF[X], retbins=True, q=q,duplicates="drop")
    coount_y0 = DF.loc[DF[Y]==0].groupby(by="qcut").count()[Y]
    coount_y1 = DF.loc[DF[Y]==1].groupby(by="qcut").count()[Y]
    num_bins = [*zip(bins,bins[1:],coount_y0,coount_y1)]
    for i in range(q):
        if 0 in num_bins[0][2:]:
            num_bins[0:2] = [(
                num_bins[0][0],
                num_bins[1][1],
                num_bins[0][2]+num_bins[1][2],
                    num_bins[0][3]+num_bins[1][3])]
            continue

        for i in range(len(num_bins)):
            if 0 in num_bins[i][2:]:
                num_bins[i-1:i+1] = [(
                    num_bins[i-1][0],
                    num_bins[i][1],
                    num_bins[i-1][2]+num_bins[i][2],
                    num_bins[i-1][3]+num_bins[i][3])]
                break
        else:
            break
    def get_woe(num_bins):
        columns = ["min","max","count_0","count_1"]
        df = pd.DataFrame(num_bins,columns=columns)
        df["total"] = df.count_0 + df.count_1
        df["percentage"] = df.total / df.total.sum()
        df["bad_rate"] = df.count_1 / df.total
        df["good%"] = df.count_0/df.count_0.sum()
        df["bad%"] = df.count_1/df.count_1.sum()
        df["woe"] = np.log(df["good%"] / df["bad%"])
        return df
    def get_iv(df):
        rate = df["good%"] - df["bad%"]
        iv = np.sum(rate * df.woe)
        return iv
    IV = []
    axisx = []
    while len(num_bins) > n:
        pvs = []
        for i in range(len(num_bins)-1):
            x1 = num_bins[i][2:]
            x2 = num_bins[i+1][2:]
            pv = scipy.stats.chi2_contingency([x1,x2])[1]
            pvs.append(pv)
        i = pvs.index(max(pvs))
        num_bins[i:i+2] = [(
            num_bins[i][0],
            num_bins[i+1][1],
            num_bins[i][2]+num_bins[i+1][2],
            num_bins[i][3]+num_bins[i+1][3])]
        bins_df = pd.DataFrame(get_woe(num_bins))
        axisx.append(len(num_bins))
        IV.append(get_iv(bins_df))
    if graph: 
        plt.figure()
        plt.plot(axisx,IV) 
        plt.xticks(axisx)
        plt.xlabel("number of box")
        plt.ylabel("IV")
        plt.show() 
    return bins_df

3.3.7 对所有特征进行分箱选择

model_data.columns

Index(['SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
       'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
       'NumberOfDependents', 'qcut'],
      dtype='object')

for i in model_data.columns[1:-1]:
    print(i)
    graphforbestbin(model_data,i,"SeriousDlqin2yrs",n=2,q=20)

RevolvingUtilizationOfUnsecuredLines

age

NumberOfTime30-59DaysPastDueNotWorse

DebtRatio

MonthlyIncome

NumberOfOpenCreditLinesAndLoans

NumberOfTimes90DaysLate

NumberRealEstateLoansOrLines

NumberOfTime60-89DaysPastDueNotWorse

NumberOfDependents

model_data.describe([0.01,0.1,0.25,.5,.75,.9,.99]).T

	count	mean	std	min	1%	10%	25%	50%	75%	90%	99%	max
SeriousDlqin2yrs	195008.0	0.499041	0.500000	0.0	0.0	0.000000	0.000000	0.000000	1.000000	1.00000	1.000000	1.0
RevolvingUtilizationOfUnsecuredLines	195008.0	4.716073	151.934558	0.0	0.0	0.015492	0.099128	0.465003	0.874343	1.00000	1.353400	18300.0
age	195008.0	49.054131	13.916980	21.0	24.0	31.000000	39.000000	48.000000	58.000000	68.00000	84.000000	107.0
NumberOfTime30-59DaysPastDueNotWorse	195008.0	0.422429	0.869693	0.0	0.0	0.000000	0.000000	0.000000	1.000000	1.00000	4.000000	13.0
DebtRatio	195008.0	332.471143	1845.205520	0.0	0.0	0.076050	0.208088	0.401763	0.848021	994.00000	5147.758259	329664.0
MonthlyIncome	195008.0	5151.967974	11369.838287	0.0	0.0	0.280000	2000.000000	4167.000000	6900.000000	10240.00000	22000.000000	3008750.0
NumberOfOpenCreditLinesAndLoans	195008.0	7.989144	5.015858	0.0	0.0	2.000000	4.000000	7.000000	11.000000	15.00000	23.000000	57.0
NumberOfTimes90DaysLate	195008.0	0.224883	0.718839	0.0	0.0	0.000000	0.000000	0.000000	0.000000	1.00000	3.000000	17.0
NumberRealEstateLoansOrLines	195008.0	0.878108	1.124510	0.0	0.0	0.000000	0.000000	1.000000	1.000000	2.00000	5.000000	32.0
NumberOfTime60-89DaysPastDueNotWorse	195008.0	0.117472	0.423076	0.0	0.0	0.000000	0.000000	0.000000	0.000000	0.00000	2.000000	9.0
NumberOfDependents	195008.0	0.837991	1.081541	0.0	0.0	0.000000	0.000000	0.258888	1.434680	2.32316	4.000000	20.0

#为什么有空图或直线图?可分类目标分类数不够

auto_col_bins = {"RevolvingUtilizationOfUnsecuredLines":6,
                 "age":5,
                 "DebtRatio":4,
                 "MonthlyIncome":3,
                 "NumberOfOpenCreditLinesAndLoans":5} #不能使用自动分箱的变量
hand_bins = {"NumberOfTime30-59DaysPastDueNotWorse":[0,1,2,13]
             ,"NumberOfTimes90DaysLate":[0,1,2,17]
             ,"NumberRealEstateLoansOrLines":[0,1,2,4,54]
             ,"NumberOfTime60-89DaysPastDueNotWorse":[0,1,2,8]
             ,"NumberOfDependents":[0,1,2,3]}
#保证区间覆盖使用 np.inf替换最大值，用-np.inf替换最小值
hand_bins = {k:[-np.inf,*v[1:-1],np.inf] for k,v in hand_bins.items()}

auto_col_bins

{'RevolvingUtilizationOfUnsecuredLines': 6,
 'age': 5,
 'DebtRatio': 4,
 'MonthlyIncome': 3,
 'NumberOfOpenCreditLinesAndLoans': 5}

hand_bins.items()#字典化

dict_items([('NumberOfTime30-59DaysPastDueNotWorse', [-inf, 1, 2, inf]), ('NumberOfTimes90DaysLate', [-inf, 1, 2, inf]), ('NumberRealEstateLoansOrLines', [-inf, 1, 2, 4, inf]), ('NumberOfTime60-89DaysPastDueNotWorse', [-inf, 1, 2, inf]), ('NumberOfDependents', [-inf, 1, 2, inf])])

hand_bins

{'NumberOfTime30-59DaysPastDueNotWorse': [-inf, 1, 2, inf],
 'NumberOfTimes90DaysLate': [-inf, 1, 2, inf],
 'NumberRealEstateLoansOrLines': [-inf, 1, 2, 4, inf],
 'NumberOfTime60-89DaysPastDueNotWorse': [-inf, 1, 2, inf],
 'NumberOfDependents': [-inf, 1, 2, inf]}

bins_of_col = {}
# 生成自动分箱的分箱区间和分箱后的 IV 值
for col in auto_col_bins:
    #数据，分箱目标列，标签列
    bins_df = graphforbestbin(model_data,col
                             ,"SeriousDlqin2yrs"
                             ,n=auto_col_bins[col]
                             #使用字典的性质来取出每个特征所对应的箱的数量
                             ,q=20
                             ,graph=False)
    #set删除重复数据，sorted排序，union合并集合
    #合并每个箱体分界值的最小最大，并且去重排序
    bins_list = sorted(set(bins_df["min"]).union(bins_df["max"]))
    #保证区间覆盖使用 np.inf 替换最大值 -np.inf 替换最小值
    bins_list[0],bins_list[-1] = -np.inf,np.inf
    bins_of_col[col] = bins_list

bins_of_col

{'RevolvingUtilizationOfUnsecuredLines': [-inf,
  0.09912842857080169,
  0.29777151691114556,
  0.4650029378240421,
  0.9824623886712799,
  0.9999999,
  inf],
 'age': [-inf, 36.0, 54.0, 61.0, 74.0, inf],
 'DebtRatio': [-inf,
  0.0174953101,
  0.5034625055732768,
  1.4722640672420035,
  inf],
 'MonthlyIncome': [-inf, 0.1, 5594.0, inf],
 'NumberOfOpenCreditLinesAndLoans': [-inf, 1.0, 3.0, 5.0, 17.0, inf]}

    bins_list[0],bins_list[-1] = -np.inf,np.inf
#合并手动分箱数据    
bins_of_col.update(hand_bins)
bins_of_col

{'RevolvingUtilizationOfUnsecuredLines': [-inf,
  0.09912842857080169,
  0.29777151691114556,
  0.4650029378240421,
  0.9824623886712799,
  0.9999999,
  inf],
 'age': [-inf, 36.0, 54.0, 61.0, 74.0, inf],
 'DebtRatio': [-inf,
  0.0174953101,
  0.5034625055732768,
  1.4722640672420035,
  inf],
 'MonthlyIncome': [-inf, 0.1, 5594.0, inf],
 'NumberOfOpenCreditLinesAndLoans': [-inf, 1.0, 3.0, 5.0, 17.0, inf],
 'NumberOfTime30-59DaysPastDueNotWorse': [-inf, 1, 2, inf],
 'NumberOfTimes90DaysLate': [-inf, 1, 2, inf],
 'NumberRealEstateLoansOrLines': [-inf, 1, 2, 4, inf],
 'NumberOfTime60-89DaysPastDueNotWorse': [-inf, 1, 2, inf],
 'NumberOfDependents': [-inf, 1, 2, inf]}

3.4 计算各箱的WOE并映射到数据中

data = model_data.copy()
#函数pd.cut，根据所给所有各箱的上下界分箱
#参数为 pd.cut(数据，所有分箱的上下界)
data = data[["age","SeriousDlqin2yrs"]].copy()
data["cut"] = pd.cut(data["age"],[-np.inf, 48.49986200790144, 58.757170160044694, 64.0, 
74.0, np.inf])
data

	age	SeriousDlqin2yrs	cut
0	53	0	(48.5, 58.757]
1	63	0	(58.757, 64.0]
2	39	1	(-inf, 48.5]
3	73	0	(64.0, 74.0]
4	53	1	(48.5, 58.757]
...	...	...	...
195003	32	1	(-inf, 48.5]
195004	50	1	(48.5, 58.757]
195005	46	1	(-inf, 48.5]
195006	64	0	(58.757, 64.0]
195007	53	1	(48.5, 58.757]

195008 rows × 3 columns

data.groupby("cut")["SeriousDlqin2yrs"].value_counts()
#（分箱的各个上下界），将数据分箱，并将各样本的标签分类计数

cut             SeriousDlqin2yrs
(-inf, 48.5]    1                   59226
                0                   39558
(48.5, 58.757]  1                   24490
                0                   23469
(58.757, 64.0]  0                   13551
                1                    8032
(64.0, 74.0]    0                   13376
                1                    4196
(74.0, inf]     0                    7737
                1                    1373
Name: SeriousDlqin2yrs, dtype: int64

data.groupby("cut")["SeriousDlqin2yrs"].value_counts().unstack()
#根据索引行列转换，直观上将整个表转置

SeriousDlqin2yrs	0	1
cut
(-inf, 48.5]	39558	59226
(48.5, 58.757]	23469	24490
(58.757, 64.0]	13551	8032
(64.0, 74.0]	13376	4196
(74.0, inf]	7737	1373

bins_df = data.groupby("cut")["SeriousDlqin2yrs"].value_counts().unstack()
bins_df["woe"] = np.log((bins_df[0]/bins_df[0].sum())/(bins_df[1]/bins_df[1].sum()))

bins_df

SeriousDlqin2yrs	0	1	woe
cut
(-inf, 48.5]	39558	59226	-0.407428
(48.5, 58.757]	23469	24490	-0.046420
(58.757, 64.0]	13551	8032	0.519191
(64.0, 74.0]	13376	4196	1.155495
(74.0, inf]	7737	1373	1.725180

def get_woe(df,col,y,bins):
    df = df[[col,y]].copy()
    df["cut"] = pd.cut(df[col],bins)
    bins_df = df.groupby("cut")[y].value_counts().unstack()
    woe = bins_df["woe"] = np.log((bins_df[0]/bins_df[0].sum())/(bins_df[1]/bins_df[1].sum()))
    return woe

#将所有特征的WOE存储到字典woeall当中
woeall = {}
for col in bins_of_col:
    woeall[col] = get_woe(model_data,col,"SeriousDlqin2yrs",bins_of_col[col])
woeall
#0,1列被忽略

{'RevolvingUtilizationOfUnsecuredLines': cut
 (-inf, 0.0991]     2.205113
 (0.0991, 0.298]    0.665610
 (0.298, 0.465]    -0.127577
 (0.465, 0.982]    -1.073125
 (0.982, 1.0]      -0.471851
 (1.0, inf]        -2.040304
 dtype: float64,
 'age': cut
 (-inf, 36.0]   -0.523275
 (36.0, 54.0]   -0.278138
 (54.0, 61.0]    0.247606
 (61.0, 74.0]    1.004098
 (74.0, inf]     1.725180
 dtype: float64,
 'DebtRatio': cut
 (-inf, 0.0175]     1.521696
 (0.0175, 0.503]   -0.011220
 (0.503, 1.472]    -0.472690
 (1.472, inf]       0.175196
 dtype: float64,
 'MonthlyIncome': cut
 (-inf, 0.1]      1.348333
 (0.1, 5594.0]   -0.238024
 (5594.0, inf]    0.232036
 dtype: float64,
 'NumberOfOpenCreditLinesAndLoans': cut
 (-inf, 1.0]   -0.842253
 (1.0, 3.0]    -0.331046
 (3.0, 5.0]    -0.055325
 (5.0, 17.0]    0.123566
 (17.0, inf]    0.464595
 dtype: float64,
 'NumberOfTime30-59DaysPastDueNotWorse': cut
 (-inf, 1.0]    0.133757
 (1.0, 2.0]    -1.377944
 (2.0, inf]    -1.548467
 dtype: float64,
 'NumberOfTimes90DaysLate': cut
 (-inf, 1.0]    0.088506
 (1.0, 2.0]    -2.256812
 (2.0, inf]    -2.414750
 dtype: float64,
 'NumberRealEstateLoansOrLines': cut
 (-inf, 1.0]   -0.146831
 (1.0, 2.0]     0.620994
 (2.0, 4.0]     0.388716
 (4.0, inf]    -0.291518
 dtype: float64,
 'NumberOfTime60-89DaysPastDueNotWorse': cut
 (-inf, 1.0]    0.028093
 (1.0, 2.0]    -1.779675
 (2.0, inf]    -1.827914
 dtype: float64,
 'NumberOfDependents': cut
 (-inf, 1.0]    0.202748
 (1.0, 2.0]    -0.531219
 (2.0, inf]    -0.477951
 dtype: float64}

#创建一个以woe值覆盖原模型数据的新列表
#还原索引
model_woe = pd.DataFrame(index=model_data.index)
#（依据分箱上下界）将原数据分箱后，将结果映射到新列表中
model_woe["age"] = pd.cut(model_data["age"],bins_of_col["age"]).map(woeall["age"])
#循环所有特征
for col in bins_of_col:
    model_woe[col] = pd.cut(model_data[col],bins_of_col[col]).map(woeall[col])
#补充标签
model_woe["SeriousDlqin2yrs"] = model_data["SeriousDlqin2yrs"]
model_woe.head()

	age	RevolvingUtilizationOfUnsecuredLines	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTime30-59DaysPastDueNotWorse	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents	SeriousDlqin2yrs
0	-0.278138	2.205113	-0.01122	-0.238024	-0.055325	0.133757	0.088506	-0.146831	0.028093	0.202748	0
1	1.004098	0.665610	-0.01122	-0.238024	-0.055325	0.133757	0.088506	-0.146831	0.028093	0.202748	0
2	-0.278138	-2.040304	-0.01122	-0.238024	-0.055325	0.133757	0.088506	-0.146831	-1.779675	-0.477951	1
3	1.004098	2.205113	-0.47269	-0.238024	0.123566	0.133757	0.088506	0.620994	0.028093	0.202748	0
4	-0.278138	-1.073125	-0.01122	0.232036	0.123566	0.133757	0.088506	0.620994	0.028093	0.202748	1

3.5建模与模型验证

#3.5建模与模型验证
#处理测试集
vali_woe = pd.DataFrame(index=vali_data.index)
for col in bins_of_col:
    vali_woe[col] = pd.cut(vali_data[col],bins_of_col[col]).map(woeall[col])
vali_woe["SeriousDlqin2yrs"] = vali_data["SeriousDlqin2yrs"]

vali_woe.head()

	RevolvingUtilizationOfUnsecuredLines	age	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTime30-59DaysPastDueNotWorse	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents	SeriousDlqin2yrs
0	2.205113	0.247606	1.521696	-0.238024	-0.055325	0.133757	0.088506	-0.146831	0.028093	0.202748	0
1	-1.073125	-0.278138	-0.011220	0.232036	0.123566	0.133757	0.088506	0.620994	0.028093	-0.477951	1
2	2.205113	1.004098	-0.011220	0.232036	-0.055325	0.133757	0.088506	-0.146831	0.028093	0.202748	0
3	2.205113	-0.278138	-0.011220	-0.238024	0.123566	0.133757	0.088506	-0.146831	0.028093	0.202748	0
4	-1.073125	-0.278138	-0.011220	-0.238024	0.123566	0.133757	0.088506	-0.146831	0.028093	0.202748	1

#训练集
X = model_woe.iloc[:,:-1]
y = model_woe.iloc[:,-1]
#测试集
vali_X = vali_woe.iloc[:,:-1]
vali_y = vali_woe.iloc[:,-1]

from sklearn.linear_model import LogisticRegression as LR

lr = LR().fit(X,y)
lr.score(vali_X,vali_y)

D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)





0.7587824255767206

c_1 = np.linspace(0.01,1,20)
c_2 = np.linspace(0.01,0.2,20)

score = []
for i in c_2: 
    lr = LR(solver='liblinear',C=i).fit(X,y)
    score.append(lr.score(vali_X,vali_y))
plt.figure()
plt.plot(c_2,score)
plt.show()
#警告提示：文件名在未来将更改

D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)

lr.n_iter_

array([5], dtype=int32)

score = []
for i in [1,2,3,4,5,6]: 
    lr = LR(solver='liblinear',C=0.025,max_iter=i).fit(X,y)
    score.append(lr.score(vali_X,vali_y))
plt.figure()
plt.plot([1,2,3,4,5,6],score)
plt.show()

D:\py1.1\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)

import scikitplot as skplt
#%%cmd
#pip install scikit-plot
vali_proba_df = pd.DataFrame(lr.predict_proba(vali_X))
skplt.metrics.plot_roc(vali_y, vali_proba_df,
                       plot_micro=False,figsize=(6,6),
                       plot_macro=False)
#模型在捕捉少数类时误判其他类的比例

D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)

3.6 制作评分卡

B = 20/np.log(2) 
A = 600 + B*np.log(1/60) 
B,A

(28.85390081777927, 481.8621880878296)

lr.intercept_

array([-0.00672073])

base_score = A - B*lr.intercept_
base_score

array([482.05610739])

model_data.head()

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents	qcut
0	0	0.015404	53	0	0.121802	4728.0	5	0	0	0	0.000000	(52.0, 54.0]
1	0	0.168311	63	0	0.141964	1119.0	5	0	0	0	0.000000	(61.0, 64.0]
2	1	1.063570	39	1	0.417663	3500.0	5	1	0	2	3.716057	(36.0, 39.0]
3	0	0.088684	73	0	0.522822	5301.0	11	0	2	0	0.000000	(68.0, 74.0]
4	1	0.622999	53	0	0.423650	13000.0	9	0	2	0	0.181999	(52.0, 54.0]

woeall

{'RevolvingUtilizationOfUnsecuredLines': cut
 (-inf, 0.0991]     2.205113
 (0.0991, 0.298]    0.665610
 (0.298, 0.465]    -0.127577
 (0.465, 0.982]    -1.073125
 (0.982, 1.0]      -0.471851
 (1.0, inf]        -2.040304
 dtype: float64,
 'age': cut
 (-inf, 36.0]   -0.523275
 (36.0, 54.0]   -0.278138
 (54.0, 61.0]    0.247606
 (61.0, 74.0]    1.004098
 (74.0, inf]     1.725180
 dtype: float64,
 'DebtRatio': cut
 (-inf, 0.0175]     1.521696
 (0.0175, 0.503]   -0.011220
 (0.503, 1.472]    -0.472690
 (1.472, inf]       0.175196
 dtype: float64,
 'MonthlyIncome': cut
 (-inf, 0.1]      1.348333
 (0.1, 5594.0]   -0.238024
 (5594.0, inf]    0.232036
 dtype: float64,
 'NumberOfOpenCreditLinesAndLoans': cut
 (-inf, 1.0]   -0.842253
 (1.0, 3.0]    -0.331046
 (3.0, 5.0]    -0.055325
 (5.0, 17.0]    0.123566
 (17.0, inf]    0.464595
 dtype: float64,
 'NumberOfTime30-59DaysPastDueNotWorse': cut
 (-inf, 1.0]    0.133757
 (1.0, 2.0]    -1.377944
 (2.0, inf]    -1.548467
 dtype: float64,
 'NumberOfTimes90DaysLate': cut
 (-inf, 1.0]    0.088506
 (1.0, 2.0]    -2.256812
 (2.0, inf]    -2.414750
 dtype: float64,
 'NumberRealEstateLoansOrLines': cut
 (-inf, 1.0]   -0.146831
 (1.0, 2.0]     0.620994
 (2.0, 4.0]     0.388716
 (4.0, inf]    -0.291518
 dtype: float64,
 'NumberOfTime60-89DaysPastDueNotWorse': cut
 (-inf, 1.0]    0.028093
 (1.0, 2.0]    -1.779675
 (2.0, inf]    -1.827914
 dtype: float64,
 'NumberOfDependents': cut
 (-inf, 1.0]    0.202748
 (1.0, 2.0]    -0.531219
 (2.0, inf]    -0.477951
 dtype: float64}

score_age = woeall["age"] * (-B*lr.coef_[0][1])
score_age

cut
(-inf, 36.0]   -12.517538
(36.0, 54.0]    -6.653503
(54.0, 61.0]     5.923113
(61.0, 74.0]    24.019569
(74.0, inf]     41.268981
dtype: float64

lr.coef_[0]

array([-0.40528794, -0.8290577 , -0.64610093, -0.61661619, -0.37484361,
       -0.60937781, -0.5959787 , -0.80402665, -0.30654535, -0.55908073])

file = "D:\class_file\day08_05\ScoreData.csv"
#open：打开文件，第一个参数是：文件的路径
#第二个参数是打开方式，"w"写入，"r"阅读
#首先写入基准分数
#之后使用循环，每次生成一组score_age类似的分档和分数，不断写入文件之中
with open(file,"w") as fdata:
    fdata.write("base_score,{}\n".format(base_score))
for i,col in enumerate(X.columns):
    score = woeall[col] * (-B*lr.coef_[0][i])
    score.name = "Score"
    score.index.name = col
    score.to_csv(file,header=True,mode="a")

你可能感兴趣的:(sklearn,逻辑回归,机器学习,python)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
理解Gunicorn：Python WSGI服务器的基石范范0825 ipython linux 运维
理解Gunicorn：PythonWSGI服务器的基石介绍Gunicorn，全称GreenUnicorn，是一个为PythonWSGI（WebServerGatewayInterface）应用设计的高效、轻量级HTTP服务器。作为PythonWeb应用部署的常用工具，Gunicorn以其高性能和易用性著称。本文将介绍Gunicorn的基本概念、安装和配置，帮助初学者快速上手。1.什么是Gunico
Python数据分析与可视化实战指南 William数据分析 python python 数据
在数据驱动的时代，Python因其简洁的语法、强大的库生态系统以及活跃的社区，成为了数据分析与可视化的首选语言。本文将通过一个详细的案例，带领大家学习如何使用Python进行数据分析，并通过可视化来直观呈现分析结果。一、环境准备1.1安装必要库在开始数据分析和可视化之前，我们需要安装一些常用的库。主要包括pandas、numpy、matplotlib和seaborn等。这些库分别用于数据处理、数学
python os.environ 江湖偌大 python 深度学习
os.environ['TF_CPP_MIN_LOG_LEVEL']='0'#默认值，输出所有信息os.environ['TF_CPP_MIN_LOG_LEVEL']='1'#屏蔽通知信息（INFO）os.environ['TF_CPP_MIN_LOG_LEVEL']='2'#屏蔽通知信息和警告信息（INFO\WARNING）os.environ['TF_CPP_MIN_LOG_LEVEL']='
Python中os.environ基本介绍及使用方法鹤冲天Pro #Python python 服务器开发语言
文章目录python中os.environos.environ简介os.environ进行环境变量的增删改查python中os.environ的使用详解1.简介2.key字段详解2.1常见key字段3.os.environ.get()用法4.环境变量的增删改查和判断是否存在4.1新增环境变量4.2更新环境变量4.3获取环境变量4.4删除环境变量4.5判断环境变量是否存在python中os.envi
Pyecharts数据可视化大屏：打造沉浸式数据分析体验我的运维人生信息可视化数据分析数据挖掘运维开发技术共享
Pyecharts数据可视化大屏：打造沉浸式数据分析体验在当今这个数据驱动的时代，如何将海量数据以直观、生动的方式展现出来，成为了数据分析师和企业决策者关注的焦点。Pyecharts，作为一款基于Python的开源数据可视化库，凭借其丰富的图表类型、灵活的配置选项以及高度的定制化能力，成为了构建数据可视化大屏的理想选择。本文将深入探讨如何利用Pyecharts打造数据可视化大屏，并通过实际代码案例
Python教程：一文了解使用Python处理XPath 旦莫 Python进阶 python 开发语言
目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath？2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代
python os.environ_python os.environ 读取和设置环境变量 weixin_39605414 python os.environ
>>>importos>>>os.environ.keys()['LC_NUMERIC','GOPATH','GOROOT','GOBIN','LESSOPEN','SSH_CLIENT','LOGNAME','USER','HOME','LC_PAPER','PATH','DISPLAY','LANG','TERM','SHELL','J2REDIR','LC_MONETARY','QT_QPA
使用Faiss进行高效相似度搜索 llzwxh888 faiss python
在现代AI应用中，快速和高效的相似度搜索是至关重要的。Faiss（FacebookAISimilaritySearch）是一个专门用于快速相似度搜索和聚类的库，特别适用于高维向量。本文将介绍如何使用Faiss来进行相似度搜索，并结合Python代码演示其基本用法。什么是Faiss？Faiss是一个由FacebookAIResearch团队开发的开源库，主要用于高维向量的相似性搜索和聚类。Faiss
python是什么意思中文-在python中%是什么意思编程大乐趣
Python中%有两种：1、数值运算：%代表取模，返回除法的余数。如：>>>7%212、%操作符（字符串格式化，stringformatting），说明如下：%[(name)][flags][width].[precision]typecode(name)为命名flags可以有+，-，''或0。+表示右对齐。-表示左对齐。''为一个空格，表示在正数的左侧填充一个空格，从而与负数对齐。0表示使用0填
Day1笔记-Python简介&标识符和关键字&输入输出 ~在杰难逃~ Python python 开发语言大数据数据分析数据挖掘
大家好，从今天开始呢，杰哥开展一个新的专栏，当然，数据分析部分也会不定时更新的，这个新的专栏主要是讲解一些Python的基础语法和知识，帮助0基础的小伙伴入门和学习Python，感兴趣的小伙伴可以开始认真学习啦！一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码，再通过语言处理程序执行向计算机发送指令，让计算机完成对应的工作，编程
python八股文面试题分享及解析(1) Shawn________ python
#1.'''a=1b=2不用中间变量交换a和b'''#1.a=1b=2a,b=b,aprint(a)print(b)结果：21#2.ll=[]foriinrange(3):ll.append({'num':i})print(11)结果:#[{'num':0},{'num':1},{'num':2}]#3.kk=[]a={'num':0}foriinrange(3):#0,12#可变类型，不仅仅改变
每日算法&面试题，大厂特训二十八天——第二十天（树）肥学 ⚡算法题⚡面试题每日精进 java 算法数据结构
目录标题导读算法特训二十八天面试题点击直接资料领取导读肥友们为了更好的去帮助新同学适应算法和面试题，最近我们开始进行专项突击一步一步来。上一期我们完成了动态规划二十一天现在我们进行下一项对各类算法进行二十八天的一个小总结。还在等什么快来一起肥学进行二十八天挑战吧！！特别介绍小白练手专栏，适合刚入手的新人欢迎订阅编程小白进阶python有趣练手项目里面包括了像《机器人尬聊》《恶搞程序》这样的有趣文章
Python快速入门 —— 第三节：类与对象孤华暗香 Python快速入门 python 开发语言
第三节：类与对象目标：了解面向对象编程的基础概念，并学会如何定义类和创建对象。内容：类与对象：定义类：class关键字。类的构造函数：__init__()。类的属性和方法。对象的创建与使用。示例：classStudent:def__init__(self,name,age,major):self.name&#
pyecharts——绘制柱形图折线图 2224070247 信息可视化 python java 数据可视化
一、pyecharts概述自2013年6月百度EFE(ExcellentFrontEnd）数据可视化团队研发的ECharts1.0发布到GitHub网站以来，ECharts一直备受业界权威的关注并获得广泛好评，成为目前成熟且流行的数据可视化图表工具，被应用到诸多数据可视化的开发领域。Python作为数据分析领域最受欢迎的语言，也加入ECharts的使用行列，并研发出方便Python开发者使用的数据
Python 实现图片裁剪（附代码） | Python工具剑客阿良_ALiang
前言本文提供将图片按照自定义尺寸进行裁剪的工具方法，一如既往的实用主义。环境依赖ffmpeg环境安装，可以参考我的另一篇文章：windowsffmpeg安装部署_阿良的博客-CSDN博客本文主要使用到的不是ffmpeg，而是ffprobe也在上面这篇文章中的zip包中。ffmpy安装：pipinstallffmpy-ihttps://pypi.douban.com/simple代码不废话了，上代码
【华为OD技术面试真题 - 技术面】- python八股文真题题库（4) 算法大师华为od 面试 python
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选**1.Python中的`with`**用途和功能自动资源管理示例：文件操作上下文管理协议示例代码工作流程解析优点2.\_\_new\_\_和**\_\_init\_\_**区别__new____init__区别总结3.**切片（Slicing）操作**基本切片语法
python os 环境变量 CV矿工 python 开发语言 numpy
环境变量：环境变量是程序和操作系统之间的通信方式。有些字符不宜明文写进代码里，比如数据库密码，个人账户密码，如果写进自己本机的环境变量里，程序用的时候通过os.environ.get（）取出来就行了。os.environ是一个环境变量的字典。环境变量的相关操作importos"""设置/修改环境变量：os.environ[‘环境变量名称’]=‘环境变量值’#其中key和value均为string类
Python爬虫解析工具之xpath使用详解 eqa11 python 爬虫开发语言
文章目录Python爬虫解析工具之xpath使用详解一、引言二、环境准备1、插件安装2、依赖库安装三、xpath语法详解1、路径表达式2、通配符3、谓语4、常用函数四、xpath在Python代码中的使用1、文档树的创建2、使用xpath表达式3、获取元素内容和属性五、总结Python爬虫解析工具之xpath使用详解一、引言在Python爬虫开发中，数据提取是一个至关重要的环节。xpath作为一门
【华为OD技术面试真题 - 技术面】- python八股文真题题库（1）算法大师华为od 面试 python
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选1.数据预处理流程数据预处理的主要步骤工具和库2.介绍线性回归、逻辑回归模型线性回归（LinearRegression）模型形式：关键点：逻辑回归（LogisticRegression）模型形式：关键点：参数估计与评估：3.python浅拷贝及深拷贝浅拷贝（Shal
数字里的世界17期：2021年全球10大顶级数据中心，中国移动榜首张三叨
你知道吗？2016年，全球的数据中心共计用电4160亿千瓦时，比整个英国的发电量还多40％！前言每天，我们都会创造超过250万TB的数据。并且随着物联网（IOT）的不断普及，这一数据将持续增长。如此庞大的数据被存储在被称为“数据中心”的专用设施中。虽然最早的数据中心建于20世纪40年代，但直到1997-2000年的互联网泡沫期间才逐渐成为主流。当前人类的技术，比如人工智能和机器学习，已经将我们推向
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
《Python数据分析实战终极指南》 xjt921122 python 数据分析开发语言
对于分析师来说，大家在学习Python数据分析的路上，多多少少都遇到过很多大坑**，有关于技能和思维的**：Excel已经没办法处理现有的数据量了，应该学Python吗？找了一大堆Python和Pandas的资料来学习，为什么自己动手就懵了？跟着比赛类公开数据分析案例练了很久，为什么当自己面对数据需求还是只会数据处理而没有分析思路？学了对比、细分、聚类分析，也会用PEST、波特五力这类分析法，为啥
Python中深拷贝与浅拷贝的区别 yuxiaoyu.
转自：http://blog.csdn.net/u014745194/article/details/70271868定义：在Python中对象的赋值其实就是对象的引用。当创建一个对象，把它赋值给另一个变量的时候，python并没有拷贝这个对象，只是拷贝了这个对象的引用而已。浅拷贝：拷贝了最外围的对象本身，内部的元素都只是拷贝了一个引用而已。也就是，把对象复制一遍，但是该对象中引用的其他对象我不复
Python开发常用的三方模块如下：换个网名有点难 python 开发语言
Python是一门功能强大的编程语言，拥有丰富的第三方库，这些库为开发者提供了极大的便利。以下是100个常用的Python库，涵盖了多个领域：1、NumPy，用于科学计算的基础库。2、Pandas，提供数据结构和数据分析工具。3、Matplotlib，一个绘图库。4、Scikit-learn，机器学习库。5、SciPy，用于数学、科学和工程的库。6、TensorFlow，由Google开发的开源机
Python编译器鹿鹿~ Python编译器 Python python 开发语言后端
嘿嘿嘿我又来了啊有些小盆友可能不知道Python其实是有编译器的，也就是PyCharm。你们可能会问到这个是干嘛的又不可以吃也不可以穿好像没有什么用，其实你还说对了这个还真的不可以吃也不可以穿，但是它用来干嘛的呢。用来编译你所打出的代码进行运行（可能这里说的有点不对但是只是个人认为）现在我们来说说PyCharm是用来干嘛的。PyCharm是一种PythonIDE，带有一整套可以帮助用户在使用Pyt
一文掌握python面向对象魔术方法（二）程序员neil python python 开发语言
接上篇：一文掌握python面向对象魔术方法（一）-CSDN博客目录六、迭代和序列化：1、__iter__(self):定义迭代器，使得类可以被for循环迭代。2、__getitem__(self,key):定义索引操作，如obj[key]。3、__setitem__(self,key,value):定义赋值操作，如obj[key]=value。4、__delitem__(self,key):定义
一文掌握python常用的list（列表）操作程序员neil python python 开发语言
目录一、创建列表1.直接创建列表：2.使用list()构造器3.使用列表推导式4.创建空列表二、访问列表元素1.列表支持通过索引访问元素，索引从0开始：2.还可以使用切片操作访问列表的一部分：三、修改列表元素四、添加元素1.append()：在末尾添加元素2.insert()：在指定位置插入元素五、删除元素1.del：删除指定位置的元素2.remove()：删除指定值的第一个匹配项3.pop()：
Python实现简单的机器学习算法 master_chenchengg python python 办公效率 python开发 IT
Python实现简单的机器学习算法开篇：初探机器学习的奇妙之旅搭建环境：一切从安装开始必备工具箱第一步：安装Anaconda和JupyterNotebook小贴士：如何配置Python环境变量算法初体验：从零开始的Python机器学习线性回归：让数据说话数据准备：从哪里找数据编码实战：Python实现线性回归模型评估：如何判断模型好坏逻辑回归：从分类开始理论入门：什么是逻辑回归代码实现：使用skl
K近邻算法_分类鸢尾花数据集 _feivirus_ 算法机器学习和数学分类机器学习 K近邻
importnumpyasnpimportpandasaspdfromsklearn.datasetsimportload_irisfromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportaccuracy_score1.数据预处理iris=load_iris()df=pd.DataFrame(data=ir
矩阵求逆（JAVA）利用伴随矩阵 qiuwanchi 利用伴随矩阵求逆矩阵
package gaodai.matrix; import gaodai.determinant.DeterminantCalculation; import java.util.ArrayList; import java.util.List; import java.util.Scanner; /** * 矩阵求逆(利用伴随矩阵) * @author 邱万迟
单例（Singleton）模式 aoyouzi 单例模式 Singleton
3.1 概述如果要保证系统里一个类最多只能存在一个实例时，我们就需要单例模式。这种情况在我们应用中经常碰到，例如缓存池，数据库连接池，线程池，一些应用服务实例等。在多线程环境中，为了保证实例的唯一性其实并不简单，这章将和读者一起探讨如何实现单例模式。 3.2
[开源与自主研发]就算可以轻易获得外部技术支持,自己也必须研发 comsci 开源
现在国内有大量的信息技术产品，都是通过盗版，免费下载，开源，附送等方式从国外的开发者那里获得的。。。。。。虽然这种情况带来了国内信息产业的短暂繁荣，也促进了电子商务和互联网产业的快速发展，但是实际上，我们应该清醒的看到，这些产业的核心力量是被国外的
页面有两个frame,怎样点击一个的链接改变另一个的内容 Array_06 UI XHTML
<a src="地址" targets="这里写你要操作的Frame的名字" />搜索然后你点击连接以后你的新页面就会显示在你设置的Frame名字的框那里 targerts="",就是你要填写目标的显示页面位置 ===================== 例如： <frame src=&
Struts2实现单个/多个文件上传和下载 oloz 文件上传 struts
struts2单文件上传：步骤01:jsp页面  　　<form action="fileUplo
推荐10个在线logo设计网站 362217990 logo
在线设计Logo网站。 1、http://flickr.nosv.org（这个太简单） 2、http://www.logomaker.com/?source=1.5770.1 3、http://www.simwebsol.com/ImageTool 4、http://www.logogenerator.com/logo.php?nal=1&tpl_catlist[]=2 5、ht
jsp上传文件香水浓 jsp fileupload
1. jsp上传 Notice： 1. form表单 method 属性必须设置为 POST 方法，不能使用 GET 方法 2. form表单 enctype 属性需要设置为 multipart/form-data 3. form表单 action 属性需要设置为提交到后台处理文件上传的jsp文件地址或者servlet地址。例如 uploadFile.jsp 程序文件用来处理上传的文
我的架构经验系列文章 - 前端架构 agevs JavaScript Web 框架 UI jQuer
框架层面：近几年前端发展很快，前端之所以叫前端因为前端是已经可以独立成为一种职业了，js也不再是十年前的玩具了，以前富客户端RIA的应用可能会用flash/flex或是silverlight，现在可以使用js来完成大部分的功能，因此js作为一门前端的支撑语言也不仅仅是进行的简单的编码，越来越多框架性的东西出现了。越来越多的开发模式转变为后端只是吐json的数据源，而前端做所有UI的事情。MVCMV
android ksoap2 中把XML(DataSet) 当做参数传递 aijuans android
我的android app中需要发送webservice ，于是我使用了 ksop2 进行发送，在测试过程中不是很顺利,不能正常工作.我的web service 请求格式如下 [html] view plain copy <Envelope xmlns="http://schemas.
使用Spring进行统一日志管理 + 统一异常管理 baalwolf spring
统一日志和异常管理配置好后，SSH项目中，代码以往散落的log.info() 和 try..catch..finally 再也不见踪影！统一日志异常实现类： [java] view plain copy package com.pilelot.web.util; impor
Android SDK 国内镜像 BigBird2012 android sdk
一、镜像地址： 1、东软信息学院的 Android SDK 镜像，比配置代理下载快多了。配置地址， http://mirrors.neusoft.edu.cn/configurations.we#android 2、北京化工大学的： IPV4:ubuntu.buct.edu.cn IPV4:ubuntu.buct.cn IPV6:ubuntu.buct6.edu.cn
HTML无害化和Sanitize模块 bijian1013 JavaScript AngularJS Linky Sanitize
一.ng-bind-html、ng-bind-html-unsafe AngularJS非常注重安全方面的问题，它会尽一切可能把大多数攻击手段最小化。其中一个攻击手段是向你的web页面里注入不安全的HTML，然后利用它触发跨站攻击或者注入攻击。考虑这样一个例子，假设我们有一个变量存
[Maven学习笔记二]Maven命令 bit1129 maven
mvn compile compile编译命令将src/main/java和src/main/resources中的代码和配置文件编译到target/classes中，不会对src/test/java中的测试类进行编译 MVN编译使用 maven-resources-plugin:2.6:resources maven-compiler-plugin:2.5.1:compile &nbs
【Java命令二】jhat bit1129 Java命令
jhat用于分析使用jmap dump的文件，，可以将堆中的对象以html的形式显示出来，包括对象的数量，大小等等，并支持对象查询语言。 jhat默认开启监听端口7000的HTTP服务，jhat是Java Heap Analysis Tool的缩写 1. 用法： [hadoop@hadoop bin]$ jhat -help Usage: jhat [-stack <bool&g
JBoss 5.1.0 GA:Error installing to Instantiated: name=AttachmentStore state=Desc ronin47
进到类似目录 server/default/conf/bootstrap，打开文件 profile.xml找到： Xml代码<bean name="AttachmentStore" class="org.jboss.system.server.profileservice.repository.AbstractAtta
写给初学者的6条网页设计安全配色指南 brotherlamp UI ui自学 ui视频 ui教程 ui资料
网页设计中最基本的原则之一是，不管你花多长时间创造一个华丽的设计，其最终的角色都是这场秀中真正的明星——内容的衬托我仍然清楚地记得我最早的一次美术课，那时我还是一个小小的、对凡事都充满渴望的孩子，我摆放出一大堆漂亮的彩色颜料。我仍然记得当我第一次看到原色与另一种颜色混合变成第二种颜色时的那种兴奋，并且我想，既然两种颜色能创造出一种全新的美丽色彩，那所有颜色
有一个数组，每次从中间随机取一个，然后放回去，当所有的元素都被取过，返回总共的取的次数。写一个函数实现。复杂度是什么。 bylijinnan java 算法面试
import java.util.Random; import java.util.Set; import java.util.TreeSet; /** * http://weibo.com/1915548291/z7HtOF4sx * #面试题#有一个数组，每次从中间随机取一个，然后放回去，当所有的元素都被取过，返回总共的取的次数。 * 写一个函数实现。复杂度是什么
struts2获得request、session、application方式 chiangfai application
1、与Servlet API解耦的访问方式。 a.Struts2对HttpServletRequest、HttpSession、ServletContext进行了封装，构造了三个Map对象来替代这三种对象要获取这三个Map对象，使用ActionContext类。 -----> package pro.action; import java.util.Map; imp
改变python的默认语言设置 chenchao051 python
import sys sys.getdefaultencoding() 可以测试出默认语言，要改变的话，需要在python lib的site-packages文件夹下新建： sitecustomize.py，这个文件比较特殊，会在python启动时来加载，所以就可以在里面写上： import sys sys.setdefaultencoding('utf-8') &n
mysql导入数据load data infile用法 daizj mysql 导入数据
我们常常导入数据！mysql有一个高效导入方法，那就是load data infile 下面来看案例说明基本语法： load data [low_priority] [local] infile 'file_name txt' [replace | ignore] into table tbl_name [fields [terminated by't'] [OPTI
phpexcel导入excel表到数据库简单入门示例 dcj3sjt126com PHP Excel
跟导出相对应的，同一个数据表，也是将phpexcel类放在class目录下，将Excel表格中的内容读取出来放到数据库中 <?php error_reporting(E_ALL); set_time_limit(0); ?> <html> <head> <meta http-equiv="Content-Type"
22岁到72岁的男人对女人的要求 dcj3sjt126com
22岁男人对女人的要求是：一，美丽，二，性感，三，有份具品味的职业，四，极有耐性，善解人意，五，该聪明的时候聪明，六，作小鸟依人状时尽量自然，七，怎样穿都好看，八，懂得适当地撒娇，九，虽作惊喜反应，但看起来自然，十，上了床就是个无条件荡妇。 32岁的男人对女人的要求，略作修定，是：一，入得厨房，进得睡房，二，不必服侍皇太后，三，不介意浪漫蜡烛配盒饭，四，听多过说，五，不再傻笑，六，懂得独
Spring和HIbernate对DDM设计的支持 e200702084 DAO 设计模式 spring Hibernate 领域模型
A：数据访问对象 DAO和资源库在领域驱动设计中都很重要。DAO是关系型数据库和应用之间的契约。它封装了Web应用中的数据库CRUD操作细节。另一方面，资源库是一个独立的抽象，它与DAO进行交互，并提供到领域模型的“业务接口”。资源库使用领域的通用语言，处理所有必要的DAO，并使用领域理解的语言提供对领域模型的数据访问服务。
NoSql 数据库的特性比较 geeksun NoSQL
Redis 是一个开源的使用ANSI C语言编写、支持网络、可基于内存亦可持久化的日志型、Key-Value数据库，并提供多种语言的API。目前由VMware主持开发工作。 1. 数据模型作为Key-value型数据库，Redis也提供了键（Key）和值（Value）的映射关系。除了常规的数值或字符串，Redis的键值还可以是以下形式之一： Lists （列表） Sets
使用 Nginx Upload Module 实现上传文件功能 hongtoushizi nginx
转载自： http://www.tuicool.com/wx/aUrAzm 普通网站在实现文件上传功能的时候，一般是使用Python，Java等后端程序实现，比较麻烦。Nginx有一个Upload模块，可以非常简单的实现文件上传功能。此模块的原理是先把用户上传的文件保存到临时文件，然后在交由后台页面处理，并且把文件的原名，上传后的名称，文件类型，文件大小set到页面。下
spring-boot-web-ui及thymeleaf基本使用 jishiweili spring thymeleaf
视图控制层代码demo如下： @Controller @RequestMapping("/") public class MessageController { private final MessageRepository messageRepository; @Autowired public MessageController(Mes
数据源架构模式之活动记录 home198979 PHP 架构活动记录数据映射
hello!架构一、概念活动记录（Active Record）：一个对象，它包装数据库表或视图中某一行，封装数据库访问，并在这些数据上增加了领域逻辑。对象既有数据又有行为。活动记录使用直截了当的方法，把数据访问逻辑置于领域对象中。二、实现简单活动记录活动记录在php许多框架中都有应用，如cakephp。 <?php /** * 行数据入口类 *
Linux Shell脚本之自动修改IP pda158 linux centos Debian 脚本
作为一名 Linux SA，日常运维中很多地方都会用到脚本，而服务器的ip一般采用静态ip或者MAC绑定，当然后者比较操作起来相对繁琐，而前者我们可以设置主机名、ip信息、网关等配置。修改成特定的主机名在维护和管理方面也比较方便。如下脚本用途为：修改ip和主机名等相关信息，可以根据实际需求修改，举一反三！ #!/bin/sh #auto Change ip netmask ga
开发环境搭建独浮云 eclipse jdk tomcat
最近在开发过程中，经常出现MyEclipse内存溢出等错误，需要重启的情况，好麻烦。对于一般的JAVA+TOMCAT项目开发，其实没有必要使用重量级的MyEclipse，使用eclipse就足够了。尤其是开发机器硬件配置一般的人。 &n