所以我有一个银行数据集,在这里我必须预测客户是否需要定期存款。我有一栏叫做工作;这是绝对的,具有每个客户的工作类型。我目前处于EDA流程中,想确定哪种工作类别对正面预测的贡献最大。
我打算通过逻辑回归来做到这一点(不确定这是否正确,欢迎使用其他方法。)
所以这就是我所做的;
我对每个工作类别进行了一次k-hot编码(每种工作类型都有1-0值),而目标i对k-1进行了一次热门编码,并且对Target_yes具有1-0值(1 =客户接受了定期存款和0(客户不接受)。
job_management job_technician job_entrepreneur job_blue-collar job_unknown job_retired job_admin.job_services job_self-employed job_unemployed job_housemaid job_student01000000000001010000000000200100000000030001000000004000010000000.......................................452060100000000004520700000100000045208000001000000452090001000000004521000100000000045211rows×12columns
目标列如下所示;
0010203040..452061452071452081452090452100Name:Target_yes,Length:45211,dtype:int32
我将此拟合为sklearn logistic回归模型并获得了系数。无法解释它们,我寻找了替代方案并遇到了统计模型版本。使用logit函数也是如此。在我在线看到的示例中,他使用了sm.get_constants作为x变量。
fromsklearn.linear_modelimportLogisticRegressionfromsklearnimportmetrics
model=LogisticRegression(solver='liblinear')model.fit(vari,tgt)model.score(vari,tgt)df=pd.DataFrame(model.coef_)df['inter']=model.intercept_print(df)
模型得分和print()df结果如下:
0.8830151954170445(model score)print(df)0123456\0-0.040404-0.289274-0.604957-0.748797-0.2062010.573717-0.1777787891011inter0-0.530802-0.2105490.099326-0.5391090.879504-1.795323
当我使用sm.get_constats时,我得到的系数与sklearn logisticRegression相似,但是Zscores(我打算用来查找对正预测贡献最大的工作类型)变得很困难。
importstatsmodels.apiassm
logit=sm.Logit(tgt,sm.add_constant(vari)).fit()logit.summary2()
结果是:
E:\Programs\Anaconda\lib\site-packages\numpy\core\fromnumeric.py:2495:FutureWarning:Method.ptpisdeprecatedandwill be removedina future version.Usenumpy.ptp instead.E:\Programs\Anaconda\lib\site-packages\statsmodels\base\model.py:1286:RuntimeWarning:invalid value encounteredinsqrt
E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901:RuntimeWarning:invalid value encounteredingreater
E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901:RuntimeWarning:invalid value encounteredinless
E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:1892:RuntimeWarning:invalid value encounteredinless_equalOptimizationterminated successfully.Currentfunction value:0.352610Iterations13Model:LogitPseudoR-squared:0.023DependentVariable:Target_yesAIC:31907.6785Date:2019-11-1810:17BIC:32012.3076No.Observations:45211Log-Likelihood:-15942.DfModel:11LL-Null:-16315.DfResiduals:45199LLR p-value:3.9218e-153Converged:1.0000Scale:1.0000No.Iterations:13.0000Coef.Std.Err.z P>|z|[0.0250.975]const-1.7968nan nan nan nan nan
job_management-0.0390nan nan nan nan nan
job_technician-0.2882nan nan nan nan nan
job_entrepreneur-0.6092nan nan nan nan nan
job_blue-collar-0.7484nan nan nan nan nan
job_unknown-0.2142nan nan nan nan nan
job_retired0.5766nan nan nan nan nan
job_admin.-0.1766nan nan nan nan nan
job_services-0.5312nan nan nan nan nan
job_self-employed-0.2106nan nan nan nan nan
job_unemployed0.1011nan nan nan nan nan
job_housemaid-0.5427nan nan nan nan nan
job_student0.8857nan nan nan nan nan
如果我使用不带sm.get_constats的Stat模型logit,则得到的系数与sklearn Logistic回归有很大不同,但是我得到的zscore值(均为负)
importstatsmodels.apiassm
logit=sm.Logit(tgt,vari).fit()logit.summary2()
结果是:
Optimizationterminated successfully.Currentfunction value:0.352610Iterations6Model:LogitPseudoR-squared:0.023DependentVariable:Target_yesAIC:31907.6785Date:2019-11-1810:18BIC:32012.3076No.Observations:45211Log-Likelihood:-15942.DfModel:11LL-Null:-16315.DfResiduals:45199LLR p-value:3.9218e-153Converged:1.0000Scale:1.0000No.Iterations:6.0000Coef.Std.Err.z P>|z|[0.0250.975]job_management-1.83570.0299-61.49170.0000-1.8943-1.7772job_technician-2.08490.0366-56.98850.0000-2.1566-2.0132job_entrepreneur-2.40600.0941-25.55630.0000-2.5905-2.2215job_blue-collar-2.54520.0390-65.21340.0000-2.6217-2.4687job_unknown-2.01100.1826-11.01200.0000-2.3689-1.6531job_retired-1.22010.0501-24.35340.0000-1.3183-1.1219job_admin.-1.97340.0425-46.44780.0000-2.0566-1.8901job_services-2.32800.0545-42.68710.0000-2.4349-2.2211job_self-employed-2.00740.0779-25.77390.0000-2.1600-1.8547job_unemployed-1.69570.0765-22.15380.0000-1.8457-1.5457job_housemaid-2.33950.1003-23.32700.0000-2.5361-2.1429job_student-0.91110.0722-12.61950.0000-1.0526-0.7696
两者哪一个更好?还是我应该使用完全不同的方法?
解决方案
I fit this to a sklearn logistic regression model and got the
coefficients. Unable to interpret them, i looked for an alternative
and came across stat model version.
print(df)0123456\0-0.040404-0.289274-0.604957-0.748797-0.2062010.573717-0.1777787891011inter0-0.530802-0.2105490.099326-0.5391090.879504-1.795323
The interpreation works like this:
Exponentiating the log odds gives you the odds ratio for a one-unit increase in your variable. So for example, if Target_yes(1 = the customer accepted the term deposit and 0 the customer did not accept) = 1 and a logistic regression coefficient of 0.573717, then you can assert that the odds of your outcome for "accept" are exp(0.573717) = 1.7748519304802 times that of the odds of your outcome in "no accept".