原文地址:https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
翻译出处:http://blog.csdn.net/han_xiaoyang/article/details/52663170
看的是大神寒小阳([email protected])翻译的一篇关于GBM算法的blog,原文链接和译文链接已给出,目前详细学习了数据分析的部分,原文中一笔带过,自己找到源码进行学习,调通并写下注释,分享自己的心得。
数据集来自Data Hackathon 3.x AV hackathon。比赛的细节可以在比赛网站上找到(http://datahack.analyticsvidhya.com/contest/data-hackathon-3x),数据可以从这里下载:http://www.analyticsvidhya.com/wp-content/uploads/2016/02/Dataset.rar
数据分析(代码+注释):
下图是进行的操作汇总
代码实现
# coding: utf-8
# In[2]:
import pandas as pd
import numpy as np
get_ipython().run_line_magic('matplotlib', 'inline')
# In[6]:
#Load data:
train = pd.read_csv('Train_nyOWmfK.csv')
test = pd.read_csv('Test_bCtAN1w.csv')
# In[7]:
train.shape, test.shape
# In[8]:
train.dtypes#查看每个属性的类型
# In[15]:
#Combine into data:
train['source']= 'train'
test['source'] = 'test'
data=pd.concat([train, test],ignore_index=True)#将train.csv与test.csv合并,且各自原来的索引忽略,合并后的数据在新表中的用统一排列新的索引
print(data.shape)
print(train.dtypes)
# ## Check missing:
# In[6]:
data.apply(lambda x: sum(x.isnull()))
'''
lambda只是一个表达式,函数体比def简单很多。
lambda的主体是一个表达式,而不是一个代码块。仅仅能在lambda表达式中封装有限的逻辑进去。
lambda表达式是起到一个函数速写的作用。允许在代码内嵌入一个函数的定义。
此处作用是看data数据集中每个属性的数据为null的个数
'''
# ## Look at categories of all object variables:
# In[21]:
var = ['Gender','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source']
for v in var:
print('\n%s这一列数据的不同取值和出现的次数\n'%v)
print(data[v].value_counts())
# ## Handle Individual Variables:
# ### City Variable:
# In[17]:
'''
舍弃"City"属性,因为这一属性的取值种类太过复杂
axis=0表示的是要对横坐标操作,axis=1是要对纵坐标操作
inplace=False表示要对结果显示,而True表示对结果不显示
'''
len(data['City'].unique())
data.drop('City',axis=1,inplace=True)
# ### Determine Age from DOB
# In[18]:
data['DOB'].head()
# In[44]:
'''
DOB是出生的具体日期,咱们要具体日期作用没那么大,年龄段可能对我们有用,所以算一下年龄好了
创建一个年龄的字段Age
'''
#print(data['DOB'])
data['Age'] = data['DOB'].apply(lambda x: 115 - int(x[-3:]))
data['Age'].head()
# In[41]:
#删除原先的字段
data.drop('DOB',axis=1,inplace=True)
# ### EMI_Load_Submitted
# In[55]:
data.boxplot(column=['EMI_Loan_Submitted'],return_type='axes')#画出箱线图
# In[46]:
#创建了EMI_Loan_Submitted_Missing这个变量,当EMI_Loan_Submitted 变量值缺失时它的值为1,否则为0。然后舍弃了EMI_Loan_Submitted。
data['EMI_Loan_Submitted_Missing'] = data['EMI_Loan_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
data[['EMI_Loan_Submitted','EMI_Loan_Submitted_Missing']].head(10)
# In[56]:
#drop original vaiables:
data.drop('EMI_Loan_Submitted',axis=1,inplace=True)
# ### Employer Name
# In[57]:
len(data['Employer_Name'].value_counts())
# In[59]:
#EmployerName的值也太多了,我把它也舍弃了
data.drop('Employer_Name',axis=1,inplace=True)
# ### Existing EMI
# In[60]:
#Existing_EMI的缺失值被填补为0(中位数),因为只有111个缺失值
data.boxplot(column='Existing_EMI',return_type='axes')
# In[61]:
data['Existing_EMI'].describe()
# In[19]:
#Impute by median (0) because just 111 missing:
data['Existing_EMI'].fillna(0, inplace=True)
# ### Interest Rate:
# In[63]:
#Majority values missing so I'll create a new variable stating whether this is missing or note:
data['Interest_Rate_Missing'] = data['Interest_Rate'].apply(lambda x: 1 if pd.isnull(x) else 0)
print data[['Interest_Rate','Interest_Rate_Missing']].head(10)
# In[62]:
data.drop('Interest_Rate',axis=1,inplace=True)
# ### Lead Creation Date:
# In[64]:
#Drop this variable because doesn't appear to affect much intuitively
data.drop('Lead_Creation_Date',axis=1,inplace=True)
# ### Loan Amount and Tenure applied:
# In[65]:
#Impute with median because only 111 missing:
data['Loan_Amount_Applied'].fillna(data['Loan_Amount_Applied'].median(),inplace=True)
data['Loan_Tenure_Applied'].fillna(data['Loan_Tenure_Applied'].median(),inplace=True)
# ### Loan Amount and Tenure selected
# In[68]:
#High proportion missing so create a new var whether present or not
data['Loan_Amount_Submitted_Missing'] = data['Loan_Amount_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
data['Loan_Tenure_Submitted_Missing'] = data['Loan_Tenure_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
# In[69]:
#创建了Loan_Amount_Submitted_Missing变量,当Loan_Amount_Submitted有缺失值时为1,反之为0,原本的Loan_Amount_Submitted变量被舍弃
#创建了Loan_Tenure_Submitted_Missing变量,当Loan_Tenure_Submitted有缺失值时为1,反之为0,原本的Loan_Tenure_Submitted变量被舍弃
data.drop(['Loan_Amount_Submitted','Loan_Tenure_Submitted'],axis=1,inplace=True)
# ### Remove logged-in
# In[26]:
#舍弃了LoggedIn,和Salary_Account
data.drop('LoggedIn',axis=1,inplace=True)
# ### Remove salary account
# In[27]:
#Salary account has mnay banks which have to be manually grouped
data.drop('Salary_Account',axis=1,inplace=True)
# ### Processing_Fee
# In[28]:
#High proportion missing so create a new var whether present or not
data['Processing_Fee_Missing'] = data['Processing_Fee'].apply(lambda x: 1 if pd.isnull(x) else 0)
#drop old
data.drop('Processing_Fee',axis=1,inplace=True)
# ### Source
# In[78]:
#Source-top保留了2个,其他组合成了不同的类别
data['Source'] = data['Source'].apply(lambda x: 'others' if x not in ['S122','S133'] else x)
data['Source'].value_counts()
print(data['Source'])
# ## Final Data:
# In[30]:
data.apply(lambda x: sum(x.isnull()))
# In[31]:
data.dtypes
# ### Numerical Coding:
# In[80]:
#给不同的数字编码,起到区分作用的
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
var_to_encode = ['Device_Type','Filled_Form','Gender','Var1','Var2','Mobile_Verified','Source']
for col in var_to_encode:
data[col] = le.fit_transform(data[col])
# ### One-Hot Coding
# In[81]:
#get_dummies 是利用pandas实现one hot encode的方式。
data = pd.get_dummies(data, columns=var_to_encode)
print(data)
# ### Separate train & test:
# In[77]:
print(data['source'])
train = data.loc[data['source']=='train']
test = data.loc[data['source']=='test']
#print(train.source)
#print(test.source)
# In[35]:
train.drop('source',axis=1,inplace=True)
test.drop(['source','Disbursed'],axis=1,inplace=True)
# In[36]:
train.to_csv('train_modified.csv',index=False)
test.to_csv('test_modified.csv',index=False)
目前只学习了数据分析部分,模型建立及调参会尽快学习。