机器学习_Python中Gradient Boosting Machine(GBM)学习笔记1_数据分析

原文地址:https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

翻译出处:http://blog.csdn.net/han_xiaoyang/article/details/52663170 

     看的是大神寒小阳([email protected])翻译的一篇关于GBM算法的blog,原文链接和译文链接已给出,目前详细学习了数据分析的部分,原文中一笔带过,自己找到源码进行学习,调通并写下注释,分享自己的心得。

 数据集来自Data Hackathon 3.x AV hackathon。比赛的细节可以在比赛网站上找到(http://datahack.analyticsvidhya.com/contest/data-hackathon-3x),数据可以从这里下载:http://www.analyticsvidhya.com/wp-content/uploads/2016/02/Dataset.rar

  数据分析(代码+注释):

  下图是进行的操作汇总

机器学习_Python中Gradient Boosting Machine(GBM)学习笔记1_数据分析_第1张图片

代码实现


# coding: utf-8

# In[2]:


import pandas as pd
import numpy as np
get_ipython().run_line_magic('matplotlib', 'inline')


# In[6]:


#Load data:
train = pd.read_csv('Train_nyOWmfK.csv')
test = pd.read_csv('Test_bCtAN1w.csv')


# In[7]:


train.shape, test.shape


# In[8]:


train.dtypes#查看每个属性的类型


# In[15]:


#Combine into data:
train['source']= 'train'
test['source'] = 'test'
data=pd.concat([train, test],ignore_index=True)#将train.csv与test.csv合并,且各自原来的索引忽略,合并后的数据在新表中的用统一排列新的索引
print(data.shape)
print(train.dtypes)


# ## Check missing:

# In[6]:


data.apply(lambda x: sum(x.isnull()))
'''
lambda只是一个表达式,函数体比def简单很多。
lambda的主体是一个表达式,而不是一个代码块。仅仅能在lambda表达式中封装有限的逻辑进去。
lambda表达式是起到一个函数速写的作用。允许在代码内嵌入一个函数的定义。
此处作用是看data数据集中每个属性的数据为null的个数
'''


# ## Look at categories of all object variables:

# In[21]:


var = ['Gender','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source']
for v in var:
    print('\n%s这一列数据的不同取值和出现的次数\n'%v)
    print(data[v].value_counts())


# ## Handle Individual Variables:

# ### City Variable:

# In[17]:


'''
舍弃"City"属性,因为这一属性的取值种类太过复杂
axis=0表示的是要对横坐标操作,axis=1是要对纵坐标操作
inplace=False表示要对结果显示,而True表示对结果不显示
'''
len(data['City'].unique())
data.drop('City',axis=1,inplace=True)


# ### Determine Age from DOB

# In[18]:


data['DOB'].head()


# In[44]:


'''
DOB是出生的具体日期,咱们要具体日期作用没那么大,年龄段可能对我们有用,所以算一下年龄好了
创建一个年龄的字段Age
'''
#print(data['DOB'])
data['Age'] = data['DOB'].apply(lambda x: 115 - int(x[-3:]))
data['Age'].head()


# In[41]:


#删除原先的字段
data.drop('DOB',axis=1,inplace=True)


# ### EMI_Load_Submitted

# In[55]:


data.boxplot(column=['EMI_Loan_Submitted'],return_type='axes')#画出箱线图


# In[46]:


#创建了EMI_Loan_Submitted_Missing这个变量,当EMI_Loan_Submitted 变量值缺失时它的值为1,否则为0。然后舍弃了EMI_Loan_Submitted。
data['EMI_Loan_Submitted_Missing'] = data['EMI_Loan_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
data[['EMI_Loan_Submitted','EMI_Loan_Submitted_Missing']].head(10)


# In[56]:


#drop original vaiables:
data.drop('EMI_Loan_Submitted',axis=1,inplace=True)


# ### Employer Name

# In[57]:


len(data['Employer_Name'].value_counts())


# In[59]:


#EmployerName的值也太多了,我把它也舍弃了
data.drop('Employer_Name',axis=1,inplace=True)


# ### Existing EMI

# In[60]:


#Existing_EMI的缺失值被填补为0(中位数),因为只有111个缺失值

data.boxplot(column='Existing_EMI',return_type='axes')


# In[61]:


data['Existing_EMI'].describe()


# In[19]:


#Impute by median (0) because just 111 missing:
data['Existing_EMI'].fillna(0, inplace=True)


# ### Interest Rate:

# In[63]:


#Majority values missing so I'll create a new variable stating whether this is missing or note:
data['Interest_Rate_Missing'] = data['Interest_Rate'].apply(lambda x: 1 if pd.isnull(x) else 0)
print data[['Interest_Rate','Interest_Rate_Missing']].head(10)


# In[62]:


data.drop('Interest_Rate',axis=1,inplace=True)


# ### Lead Creation Date:

# In[64]:


#Drop this variable because doesn't appear to affect much intuitively
data.drop('Lead_Creation_Date',axis=1,inplace=True)


# ### Loan Amount and Tenure applied:

# In[65]:


#Impute with median because only 111 missing:
data['Loan_Amount_Applied'].fillna(data['Loan_Amount_Applied'].median(),inplace=True)
data['Loan_Tenure_Applied'].fillna(data['Loan_Tenure_Applied'].median(),inplace=True)


# ### Loan Amount and Tenure selected

# In[68]:


#High proportion missing so create a new var whether present or not
data['Loan_Amount_Submitted_Missing'] = data['Loan_Amount_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
data['Loan_Tenure_Submitted_Missing'] = data['Loan_Tenure_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)


# In[69]:


#创建了Loan_Amount_Submitted_Missing变量,当Loan_Amount_Submitted有缺失值时为1,反之为0,原本的Loan_Amount_Submitted变量被舍弃
#创建了Loan_Tenure_Submitted_Missing变量,当Loan_Tenure_Submitted有缺失值时为1,反之为0,原本的Loan_Tenure_Submitted变量被舍弃
data.drop(['Loan_Amount_Submitted','Loan_Tenure_Submitted'],axis=1,inplace=True)


# ### Remove logged-in

# In[26]:


#舍弃了LoggedIn,和Salary_Account
data.drop('LoggedIn',axis=1,inplace=True)


# ### Remove salary account

# In[27]:


#Salary account has mnay banks which have to be manually grouped
data.drop('Salary_Account',axis=1,inplace=True)


# ### Processing_Fee

# In[28]:


#High proportion missing so create a new var whether present or not
data['Processing_Fee_Missing'] = data['Processing_Fee'].apply(lambda x: 1 if pd.isnull(x) else 0)
#drop old
data.drop('Processing_Fee',axis=1,inplace=True)


# ### Source

# In[78]:


#Source-top保留了2个,其他组合成了不同的类别

data['Source'] = data['Source'].apply(lambda x: 'others' if x not in ['S122','S133'] else x)
data['Source'].value_counts()
print(data['Source'])


# ## Final Data:

# In[30]:


data.apply(lambda x: sum(x.isnull()))


# In[31]:


data.dtypes


# ### Numerical Coding:

# In[80]:


#给不同的数字编码,起到区分作用的
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
var_to_encode = ['Device_Type','Filled_Form','Gender','Var1','Var2','Mobile_Verified','Source']
for col in var_to_encode:
    data[col] = le.fit_transform(data[col])


# ### One-Hot Coding

# In[81]:


#get_dummies 是利用pandas实现one hot encode的方式。
data = pd.get_dummies(data, columns=var_to_encode)
print(data)


# ### Separate train & test:

# In[77]:


print(data['source'])
train = data.loc[data['source']=='train']
test = data.loc[data['source']=='test']
#print(train.source)
#print(test.source)


# In[35]:


train.drop('source',axis=1,inplace=True)
test.drop(['source','Disbursed'],axis=1,inplace=True)


# In[36]:


train.to_csv('train_modified.csv',index=False)
test.to_csv('test_modified.csv',index=False)

     目前只学习了数据分析部分,模型建立及调参会尽快学习。

你可能感兴趣的:(机器学习)