1. 背景
2. 数据描述
train_users_2.csv - the training set of users (训练数据)
test_users.csv - the test set of users (测试数据)
- id: user id (用户id)
- date_account_created(帐号注册时间): the date of account creation
- timestamp_first_active(首次活跃时间): timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
- date_first_booking(首次订房时间): date of first booking
- gender(性别)
- age(年龄)
- signup_method(注册方式)
- signup_flow(注册页面): the page a user came to signup up from
- language(语言): international language preference
- affiliate_channel(付费市场渠道): what kind of paid marketing
- affiliate_provider(付费市场渠道名称): where the marketing is e.g. google, craigslist, other
- first_affiliate_tracked(注册前第一个接触的市场渠道): whats the first marketing the user interacted with before the signing up
- signup_app(注册app)
- first_device_type(设备类型)
- first_browser(浏览器类型)
- country_destination(订房国家-需要预测的量): this is the target variable you are to predict
sessions.csv - web sessions log for users(网页浏览数据)
- user_id(用户id): to be joined with the column ‘id’ in users table
- action(用户行为)
- action_type(用户行为类型)
- action_detail(用户行为具体)
- device_type(设备类型)
- secs_elapsed(停留时长)
sample_submission.csv - 提交预测的正确格式
3. 数据探索
基于jupyter notebook 和 python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
%matplotlib inline
import datetime
import os
import seaborn as sns # 数据可视化
from datetime import date
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelBinarizer
import pickle # 用于存储模型
from sklearn.metrics import *
from sklearn.model_selection import *
# 训练数据
train = pd.read_csv("train_users_2.csv")
# 测试数据
test = pd.read_csv("test_users.csv")
# 训练数据的列名
print('the columns name of training dataset:\n',train.columns)
# 测试数据的列名
print('the columns name of test dataset:\n',test.columns)
the columns name of training dataset:
Index(['id', 'date_account_created', 'timestamp_first_active',
'date_first_booking', 'gender', 'age', 'signup_method', 'signup_flow',
'language', 'affiliate_channel', 'affiliate_provider',
'first_affiliate_tracked', 'signup_app', 'first_device_type',
'first_browser', 'country_destination'],
the columns name of test dataset:
Index(['id', 'date_account_created', 'timestamp_first_active',
'date_first_booking', 'gender', 'age', 'signup_method', 'signup_flow',
'language', 'affiliate_channel', 'affiliate_provider',
'first_affiliate_tracked', 'signup_app', 'first_device_type',
- train文件比test文件多了特征-country_destination
- country_destination是需要预测的目标变量
- 数据探索时着重分析train文件,test文件类似
查看数据信息 info()
RangeIndex: 213451 entries, 0 to 213450
Data columns (total 16 columns):
id 213451 non-null object
date_account_created 213451 non-null object
timestamp_first_active 213451 non-null int64
date_first_booking 88908 non-null object
gender 213451 non-null object
age 125461 non-null float64
signup_method 213451 non-null object
signup_flow 213451 non-null int64
language 213451 non-null object
affiliate_channel 213451 non-null object
affiliate_provider 213451 non-null object
first_affiliate_tracked 207386 non-null object
signup_app 213451 non-null object
first_device_type 213451 non-null object
first_browser 213451 non-null object
country_destination 213451 non-null object
dtypes: float64(1), int64(2), object(13)
memory usage: 26.1+ MB
- trian文件包含213451行数据,16个特征
- 每个特征的数据类型和非空数值
- date_first_booking空值较多,在特征提取时可以考虑删除
date_account_created 帐号注册时间
0 2010-06-28
1 2011-05-25
2 2010-09-28
3 2011-12-05
4 2010-09-14
Name: date_account_created, dtype: object
2014-05-13 674
2014-06-24 670
2014-06-25 636
2014-05-20 632
2014-05-14 622
Name: date_account_created, dtype: int64
2010-01-01 1
2010-01-02 1
2010-06-18 1
2010-01-31 1
2010-02-14 1
Name: date_account_created, dtype: int64
count 213451
unique 1634
top 2014-05-13
freq 674
Name: date_account_created, dtype: object
dac_train = train.date_account_created.value_counts()
dac_test = test.date_account_created.value_counts()
# 将数据类型转换为datatime类型
dac_train_date = pd.to_datetime(train.date_account_created.value_counts().index)
dac_test_date = pd.to_datetime(test.date_account_created.value_counts().index)
# 计算离首次注册时间相差的天数
dac_train_day = dac_train_date - dac_train_date.min()
dac_test_day = dac_test_date - dac_train_date.min()
# motplotlib作图
plt.scatter(dac_train_day.days, dac_train.values, color = 'r', label = 'train dataset')
plt.scatter(dac_test_day.days, dac_test.values, color = 'b', label = 'test dataset')
plt.title("Accounts created vs day")
plt.ylabel("Accounts created")
plt.legend(loc = 'upper left')
- x轴:离首次注册时间相差的天数
- y轴:当天注册的用户数量
- 随着时间的增长,用户注册的数量在急剧上升
timestamp_first_active 首次活跃时间
0 20090319043255
1 20090523174809
2 20090609231247
3 20091031060129
4 20091208061105
Name: timestamp_first_active, dtype: int64
分析: 结果[1]表明timestamp_first_active没有重复数据
tfa_train_dt = train.timestamp_first_active.astype(str).apply(lambda x:
count 213451
unique 213451
top 2013-07-01 05:26:34
freq 1
first 2009-03-19 04:32:55
last 2014-06-30 23:58:24
Name: timestamp_first_active, dtype: object
0 2009-03-19 04:32:55
1 2009-05-23 17:48:09
2 2009-06-09 23:12:47
3 2009-10-31 06:01:29
4 2009-12-08 06:11:05
Name: timestamp_first_active, dtype: datetime64[ns]
date_first_booking 首次订房时间
count 88908
unique 1976
top 2014-05-22
freq 248
Name: date_first_booking, dtype: object
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: date_first_booking, dtype: float64
- train文件中date_first_booking有大量缺失值
- test文件中date_first_booking全是缺失值
- 可以删除特征date_first_booking
30.0 6124
31.0 6016
29.0 5963
28.0 5939
32.0 5855
Name: age, dtype: int64
# 首先将年龄进行分成4组missing values, too small age, reasonable age, too large age
age_train =[train[train.age.isnull()].age.shape[0],
train.query('age < 15').age.shape[0],
train.query("age >= 15 & age <= 90").age.shape[0],
train.query('age > 90').age.shape[0]]
age_test = [test[test.age.isnull()].age.shape[0],
test.query('age < 15').age.shape[0],
test.query("age >= 15 & age <= 90").age.shape[0],
test.query('age > 90').age.shape[0]]
columns = ['Null', 'age < 15', 'age', 'age > 90']
# plot
fig, (ax1,ax2) = plt.subplots(1, 2, sharex=True, sharey = True, figsize=(10,5))
sns.barplot(columns, age_train, ax = ax1)
sns.barplot(columns, age_test, ax = ax2)
ax1.set_title('training dataset')
ax2.set_title('test dataset')
Text(0, 0.5, 'counts')
- train文件中其他特征由于labels较少,我们可以在特征工程中直接进行one hot encoding即可
def feature_barplot(feature, df_train = train, df_test = test, figsize=(10,5), rot = 90, saveimg = False):
feat_train = df_train[feature].value_counts()
feat_test = df_test[feature].value_counts()
fig_feature, (axis1,axis2) = plt.subplots(1, 2, sharex=True, sharey=True, figsize=figsize)
sns.barplot(feat_train.index.values, feat_train.values, ax = axis1)
sns.barplot(feat_test.index.values, feat_test.values, ax = axis2)
axis1.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)
axis2.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)
axis1.set_title(feature + ' of training dataset')
axis2.set_title(feature + ' of test dataset')
if saveimg == True:
figname = feature + ".png"
fig_feature.savefig(figname, dpi = 75)
gender 性别
feature_barplot('gender', saveimg = True)
signup_method 注册方式
signup_flow 注册页面
language 语言
affiliate_channel 付费市场渠道
first_affiliate_tracked 注册前第一个接触的市场渠道
signup_app 注册app
first_device_type 设备类型
first_browser 浏览器类型
sesion文件 web sessions log for users(网页浏览数据)
df_sessions = pd.read_csv('sessions.csv')
user_id | action | action_type | action_detail | device_type | secs_elapsed | |
0 | d1mm9tcy42 | lookup | NaN | NaN | Windows Desktop | 319.0 |
1 | d1mm9tcy42 | search_results | click | view_search_results | Windows Desktop | 67753.0 |
2 | d1mm9tcy42 | lookup | NaN | NaN | Windows Desktop | 301.0 |
3 | d1mm9tcy42 | search_results | click | view_search_results | Windows Desktop | 22141.0 |
4 | d1mm9tcy42 | lookup | NaN | NaN | Windows Desktop | 435.0 |
5 | d1mm9tcy42 | search_results | click | view_search_results | Windows Desktop | 7703.0 |
6 | d1mm9tcy42 | lookup | NaN | NaN | Windows Desktop | 115.0 |
7 | d1mm9tcy42 | personalize | data | wishlist_content_update | Windows Desktop | 831.0 |
8 | d1mm9tcy42 | index | view | view_search_results | Windows Desktop | 20842.0 |
9 | d1mm9tcy42 | lookup | NaN | NaN | Windows Desktop | 683.0 |
# 这是为了后面的数据合并
df_sessions['id'] = df_sessions['user_id']
df_sessions = df_sessions.drop(['user_id'],axis=1) # 按行删除
(10567737, 6)
action 79626
action_type 1126204
action_detail 1126204
device_type 0
secs_elapsed 136031
id 34496
dtype: int64
分析:action,action_type,action_detail, secs_elapsed缺失值较多
df_sessions.action = df_sessions.action.fillna('NAN')
df_sessions.action_type = df_sessions.action_type.fillna('NAN')
df_sessions.action_detail = df_sessions.action_detail.fillna('NAN')
action 0
action_type 0
action_detail 0
device_type 0
secs_elapsed 136031
id 34496
dtype: int64
- 填充后缺失值已经为0了
- secs_elapsed 在后续做填充处理
4. 特征提取
4.1 对session文件特征提取
0 lookup
1 search_results
2 lookup
3 search_results
4 lookup
Name: action, dtype: object
# Action values with low frequency are changed to 'OTHER'
act_freq = 100 # Threshold of frequency
act = dict(zip(*np.unique(df_sessions.action, return_counts=True)))
df_sessions.action = df_sessions.action.apply(lambda x: 'OTHER' if act[x] < act_freq else x)
# np.unique(df_sessions.action, return_counts=True) 取以数组形式返回非重复的action值和它的数量
# zip(*(a,b))a,b种元素一一对应,返回zip object
- 首先将用户的特征根据用户id进行分组
- 特征action:统计每个用户总的action出现的次数,各个action类型的数量,平均值以及标准差
- 特征action_detail:统计每个用户总的action_detail出现的次数,各个action_detail类型的数量,平均值以及标准差
- 特征action_type:统计每个用户总的action_type出现的次数,各个action_type类型的数量,平均值,标准差以及总的停留时长(进行log处理)
- 特征device_type:统计每个用户总的device_type出现的次数,各个device_type类型的数量,平均值以及标准差
- 特征secs_elapsed:对缺失值用0填充,统计每个用户secs_elapsed时间的总和,平均值,标准差以及中位数(进行log处理),(总和/平均数),secs_elapsed(log处理后)各个时间出现的次数
# 对action特征进行细化,各个取值的数量并排序
f_act = df_sessions.action.value_counts().argsort()
f_act_detail = df_sessions.action_detail.value_counts().argsort()
f_act_type = df_sessions.action_type.value_counts().argsort()
f_dev_type = df_sessions.device_type.value_counts().argsort()
# 按照id进行分组
dgr_sess = df_sessions.groupby(['id'])
# 循环遍历dgr_sess创建所有特征
samples = [] # samples列表
ln = len(dgr_sess) # 计算分组后df_sessions的长度
# 对dgr_sess中每个id的数据进行遍历
for g in dgr_sess:
gr = g[1] # data frame that comtains all the data for a groupby value 'zzywmcn0jv'
l = [] # 建一个空列表,临时存放特征
# the id for example:'zzywmcn0jv'
l.append(g[0]) # 将id值放入空列表中
# number of total actions
l.append(len(gr)) # 将id对应数据的长度放入列表
# secs_elapsed 特征中的缺失值用0填充再获取具体的停留时长值
sev = gr.secs_elapsed.fillna(0).values # These values are used later.
# action features 特征-用户行为
# 每个用户行为出现的次数,各个行为类型的数量,平均值以及标准差
c_act = [0] * len(f_act)
for i,v in enumerate(gr.action.values): # i是从0-1对应的位置,v 是用户行为特征的值
c_act[f_act[v]] += 1
_, c_act_uqc = np.unique(gr.action.values, return_counts=True)
# 计算用户行为行为特征各个类型数量的长度,平均值以及标准差
c_act += [len(c_act_uqc), np.mean(c_act_uqc), np.std(c_act_uqc)]
l = l + c_act
# action_detail features 特征-用户行为具体
# (how many times each value occurs, numb of unique values, mean and std)
c_act_detail = [0] * len(f_act_detail)
for i,v in enumerate(gr.action_detail.values):
c_act_detail[f_act_detail[v]] += 1
_, c_act_det_uqc = np.unique(gr.action_detail.values, return_counts=True)
c_act_detail += [len(c_act_det_uqc), np.mean(c_act_det_uqc), np.std(c_act_det_uqc)]
l = l + c_act_detail
# action_type features 特征-用户行为类型 click等
# (how many times each value occurs, numb of unique values, mean and std
# + log of the sum of secs_elapsed for each value)
l_act_type = [0] * len(f_act_type)
c_act_type = [0] * len(f_act_type)
for i,v in enumerate(gr.action_type.values):
l_act_type[f_act_type[v]] += sev[i] #sev = gr.secs_elapsed.fillna(0).values ,求每个行为类型总的停留时长
c_act_type[f_act_type[v]] += 1
l_act_type = np.log(1 + np.array(l_act_type)).tolist() #每个行为类型总的停留时长,差异比较大,进行log处理
_, c_act_type_uqc = np.unique(gr.action_type.values, return_counts=True)
c_act_type += [len(c_act_type_uqc), np.mean(c_act_type_uqc), np.std(c_act_type_uqc)]
l = l + c_act_type + l_act_type
# device_type features 特征-设备类型
# (how many times each value occurs, numb of unique values, mean and std)
c_dev_type = [0] * len(f_dev_type)
for i,v in enumerate(gr.device_type .values):
c_dev_type[f_dev_type[v]] += 1
_, c_dev_type_uqc = np.unique(gr.device_type.values, return_counts=True)
c_dev_type += [len(c_dev_type_uqc), np.mean(c_dev_type_uqc), np.std(c_dev_type_uqc)]
l = l + c_dev_type
# secs_elapsed features 特征-停留时长
l_secs = [0] * 5
l_log = [0] * 15
if len(sev) > 0:
# Simple statistics about the secs_elapsed values.
l_secs[0] = np.log(1 + np.sum(sev))
l_secs[1] = np.log(1 + np.mean(sev))
l_secs[2] = np.log(1 + np.std(sev))
l_secs[3] = np.log(1 + np.median(sev))
l_secs[4] = l_secs[0] / float(l[1]) #
# Values are grouped in 15 intervals. Compute the number of values
# in each interval.
# sev = gr.secs_elapsed.fillna(0).values
log_sev = np.log(1 + sev).astype(int)
# np.bincount():Count number of occurrences of each value in array of non-negative ints.
l_log = np.bincount(log_sev, minlength=15).tolist()
l = l + l_secs + l_log
# The list l has the feature values of one sample.
# preparing objects
samples = np.array(samples)
samp_ar = samples[:, 1:].astype(np.float16) #取除id外的特征数据
samp_id = samples[:, 0] #取id,id位于第一列
# 为提取的特征创建一个dataframe
col_names = [] #name of the columns
for i in range(len(samples[0])-1): #减1的原因是因为有个id
col_names.append('c_' + str(i)) #起名字的方式
df_agg_sess = pd.DataFrame(samp_ar, columns=col_names)
df_agg_sess['id'] = samp_id
df_agg_sess.index = df_agg_sess.id #将id作为index
c_0 | c_1 | c_2 | c_3 | c_4 | c_5 | c_6 | c_7 | c_8 | c_9 | ... | c_448 | c_449 | c_450 | c_451 | c_452 | c_453 | c_454 | c_455 | c_456 | id | |
id | |||||||||||||||||||||
00023iyk9l | 40.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 12.0 | 6.0 | 2.0 | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 0.0 | 00023iyk9l |
0010k6l0om | 63.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 8.0 | 12.0 | 2.0 | 8.0 | 4.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0010k6l0om |
001wyh0pz8 | 90.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 27.0 | 30.0 | 9.0 | 8.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 001wyh0pz8 |
0028jgx1x1 | 31.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 2.0 | 3.0 | 5.0 | 4.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0028jgx1x1 |
002qnbzfs5 | 789.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 111.0 | 102.0 | 104.0 | 57.0 | 28.0 | 9.0 | 4.0 | 1.0 | 1.0 | 002qnbzfs5 |
5 rows × 458 columns
4.2 对trian和test文件进行特征提取
- labels存储了我们进行预测的目标变量country_destination
train = pd.read_csv("train_users_2.csv")
test = pd.read_csv("test_users.csv")
train_row = train.shape[0]
# The label we need to predict
labels = train['country_destination'].values
- 数据探索时我们发现date_first_booking在train和test文件中缺失值太多,故删除
- 删除country_destination,用模型预测country_destination,再与已经存储country_destination的labels进行比较,从而判断模型优劣
train.drop(['country_destination', 'date_first_booking'], axis = 1, inplace = True)
test.drop(['date_first_booking'], axis = 1, inplace = True)
- 便于进行相同的特征提取操作
#连接test 和 train
df = pd.concat([train, test], axis = 0, ignore_index = True)
timestamp_first_active 转换为datetime类型
tfa = df.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]),
# create tfa_year, tfa_month, tfa_day feature
df['tfa_year'] = np.array([x.year for x in tfa])
df['tfa_month'] = np.array([x.month for x in tfa])
df['tfa_day'] = np.array([x.day for x in tfa])
#isoweekday() 可以返回一周的星期几,e.g.星期日:0;星期一:1
df['tfa_wd'] = np.array([x.isoweekday() for x in tfa])
df_tfa_wd = pd.get_dummies(df.tfa_wd, prefix = 'tfa_wd') # one hot encoding
df = pd.concat((df, df_tfa_wd), axis = 1) #添加df['tfa_wd'] 编码后的特征
df.drop(['tfa_wd'], axis = 1, inplace = True)#删除原有未编码的特征
- 因为判断季节关注的是月份,故对年份进行统一
Y = 2000
seasons = [(0, (date(Y, 1, 1), date(Y, 3, 20))), #'winter'
(1, (date(Y, 3, 21), date(Y, 6, 20))), #'spring'
(2, (date(Y, 6, 21), date(Y, 9, 22))), #'summer'
(3, (date(Y, 9, 23), date(Y, 12, 20))), #'autumn'
(0, (date(Y, 12, 21), date(Y, 12, 31)))] #'winter'
def get_season(dt):
dt = dt.date() #获取日期
dt = dt.replace(year=Y) #将年统一换成2000年
return next(season for season, (start, end) in seasons if start <= dt <= end)
df['tfa_season'] = np.array([get_season(x) for x in tfa])
df_tfa_season = pd.get_dummies(df.tfa_season, prefix = 'tfa_season') # one hot encoding
df = pd.concat((df, df_tfa_season), axis = 1)
df.drop(['tfa_season'], axis = 1, inplace = True)
dac = pd.to_datetime(df.date_account_created)
# create year, month, day feature for dac
df['dac_year'] = np.array([x.year for x in dac])
df['dac_month'] = np.array([x.month for x in dac])
df['dac_day'] = np.array([x.day for x in dac])
# create features of weekday for dac
df['dac_wd'] = np.array([x.isoweekday() for x in dac])
df_dac_wd = pd.get_dummies(df.dac_wd, prefix = 'dac_wd')
df = pd.concat((df, df_dac_wd), axis = 1)
df.drop(['dac_wd'], axis = 1, inplace = True)
# create season features fro dac
df['dac_season'] = np.array([get_season(x) for x in dac])
df_dac_season = pd.get_dummies(df.dac_season, prefix = 'dac_season')
df = pd.concat((df, df_dac_season), axis = 1)
df.drop(['dac_season'], axis = 1, inplace = True)
- 即用户在airbnb平台活跃到正式注册所花的时间
dt_span = dac.subtract(tfa).dt.days
- dt_span的头十行数据
-1 275369
0 7
6 4
5 4
1 4
2 3
3 3
4 3
28 3
94 2
dtype: int64
- 从差值提取特征:差值为一天,一月,一年和其他
- 即用户活跃到注册花费的时间为一天,一月,一年或其他
# create categorical feature: span = -1; -1 < span < 30; 31 < span < 365; span > 365
def get_span(dt):
# dt is an integer
if dt == -1:
return 'OneDay'
elif (dt < 30) & (dt > -1):
return 'OneMonth'
elif (dt >= 30) & (dt <= 365):
return 'OneYear'
return 'other'
df['dt_span'] = np.array([get_span(x) for x in dt_span])
df_dt_span = pd.get_dummies(df.dt_span, prefix = 'dt_span')
df = pd.concat((df, df_dt_span), axis = 1)
df.drop(['dt_span'], axis = 1, inplace = True)
- 对timestamp_first_active,date_account_created进行特征提取后,从特征列表中删除原有的特征
df.drop(['date_account_created','timestamp_first_active'], axis = 1, inplace = True)
#Age 获取年龄
av = df.age.values
- 在数据探索阶段,我们发现大部分数据是集中在(15,90)区间的,但有部分年龄分布在(1900,2000)区间,我们猜测用户是把出生日期误填为年龄,故进行预处理
#This are birthdays instead of age (estimating age by doing 2014 - value)
av = np.where(np.logical_and(av<2000, av>1900), 2014-av, av)
df['age'] = av
E:\Anaconda3\envs\sklearn\lib\site-packages\ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in less
This is separate from the ipykernel package so we can avoid doing imports until
E:\Anaconda3\envs\sklearn\lib\site-packages\ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in greater
This is separate from the ipykernel package so we can avoid doing imports until
# Age has many abnormal values that we need to deal with.
age = df.age
age.fillna(-1, inplace = True) #空值填充为-1
div = 15
def get_age(age):
# age is a float number 将连续型转换为离散型
if age < 0:
return 'NA' #表示是空值
elif (age < div):
return div #如果年龄小于15岁,那么返回15岁
elif (age <= div * 2):
return div*2 #如果年龄大于15小于等于30岁,则返回30岁
elif (age <= div * 3):
return div * 3
elif (age <= div * 4):
return div * 4
elif (age <= div * 5):
return div * 5
elif (age <= 110):
return div * 6
return 'Unphysical' #非正常年龄
- 将分段后的年龄作为新的特征放入特征列表中
df['age'] = np.array([get_age(x) for x in age])
df_age = pd.get_dummies(df.age, prefix = 'age')
df = pd.concat((df, df_age), axis = 1)
df.drop(['age'], axis = 1, inplace = True)
- 在数据探索时,我们发现剩余的特征lables都比较少,故不进一步进行特征提取,只进行one-hot-encoding处理
feat_toOHE = ['gender',
for f in feat_toOHE:
df_ohe = pd.get_dummies(df[f], prefix=f, dummy_na=True)
df.drop([f], axis = 1, inplace = True)
df = pd.concat((df, df_ohe), axis = 1)
df_all = pd.merge(df, df_agg_sess, how='left')
df_all = df_all.drop(['id'], axis=1) #删除id
df_all = df_all.fillna(-2) #对没有sesssion data的特征进行缺失值处理
df_all['all_null'] = np.array([sum(r<0) for r in df_all.values])
E:\Anaconda3\envs\sklearn\lib\site-packages\IPython\core\interactiveshell.py:3267: FutureWarning: 'id' is both an index level and a column label.
Defaulting to column, but this will raise an ambiguity error in a future version
exec(code_obj, self.user_global_ns, self.user_ns)
5. 模型构建
5.1 数据准备
- train_row是之前记录的train数据行数
Xtrain = df_all.iloc[:train_row, :]
Xtest = df_all.iloc[train_row:, :]
#labels.tofile():Write array to a file as text or binary (default)
labels.tofile("Airbnb_ytrain_v2.csv", sep='\n', format='%s') #存放目标变量
xtrain = pd.read_csv("Airbnb_xtrain_v2.csv",index_col=0)
ytrain = pd.read_csv("Airbnb_ytrain_v2.csv", header=None)
tfa_year | tfa_month | tfa_day | tfa_wd_1 | tfa_wd_2 | tfa_wd_3 | tfa_wd_4 | tfa_wd_5 | tfa_wd_6 | tfa_wd_7 | ... | c_448 | c_449 | c_450 | c_451 | c_452 | c_453 | c_454 | c_455 | c_456 | all_null | |
0 | 2009 | 3 | 19 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | 457 |
1 | 2009 | 5 | 23 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | 457 |
2 | 2009 | 6 | 9 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | 457 |
3 | 2009 | 10 | 31 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | 457 |
4 | 2009 | 12 | 8 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | -2.0 | 457 |
5 rows × 661 columns
0 | |
0 | NDF |
1 | NDF |
2 | US |
3 | other |
4 | US |
将目标变量进行labels encoding
le = LabelEncoder()
ytrain_le = le.fit_transform(ytrain.values)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\preprocessing\label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
array([ 7, 7, 10, ..., 7, 7, 7])
- labels encoding前:
[‘AU’, ‘CA’, ‘DE’, ‘ES’, ‘FR’, ‘GB’, ‘IT’, ‘NDF’, ‘NL’, ‘PT’, ‘US’,’other’] - labels encoding后:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
- 减少训练模型花费的时间
# Let us take 10% of the data for faster training.
n = int(xtrain.shape[0]*0.1)
xtrain_new = xtrain.iloc[:n, :] #训练数据
ytrain_new = ytrain_le[:n] #训练数据的目标变量
- 数据集的标准化是许多机器学习估计器的一个共同要求:如果单个特征与标准的正态分布数据(例如均值为0的高斯分布和单位方差)不太相似,它们可能会表现得很差。
X_scaler = StandardScaler()
xtrain_new = X_scaler.fit_transform(xtrain_new)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\preprocessing\data.py:617: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
5.2 评分模型:NDCG
- NDCG是一种衡量排序质量的评价指标,该指标考虑了所有元素的相关性
- 预测的目标变量并不是二分类变量,故我们用NDGG模型来进行模型评分,判断模型优劣
- 二分类变量: 我们习惯于使用 f1 score, precision, recall, auc score来进行模型评分
from sklearn.metrics import make_scorer
def dcg_score(y_true, y_score, k=5):
y_true : array, shape = [n_samples] #数据
Ground truth (true relevance labels).
y_score : array, shape = [n_samples, n_classes] #预测的分数
Predicted scores.
k : int
order = np.argsort(y_score)[::-1] #分数从高到低排序
y_true = np.take(y_true, order[:k]) #取出前k[0,k)个分数
gain = 2 ** y_true - 1
discounts = np.log2(np.arange(len(y_true)) + 2)
return np.sum(gain / discounts)
def ndcg_score(ground_truth, predictions, k=5):
ground_truth : array, shape = [n_samples]
Ground truth (true labels represended as integers).
predictions : array, shape = [n_samples, n_classes]
Predicted probabilities. 预测的概率
k : int
lb = LabelBinarizer()
lb.fit(range(len(predictions) + 1))
T = lb.transform(ground_truth)
scores = []
# Iterate over each y_true and compute the DCG score
for y_true, y_score in zip(T, predictions):
actual = dcg_score(y_true, y_score, k)
best = dcg_score(y_true, y_true, k)
score = float(actual) / float(best)
return np.mean(scores)
6. 构建模型
6.1 Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
lr = LogisticRegression(C = 1.0, penalty='l2', multi_class='ovr')
RANDOM_STATE = 2017 #随机种子
#k-fold cross validation(k-折叠交叉验证)
kf = KFold(n_splits=5, random_state=RANDOM_STATE) #分成5个组
train_score = []
cv_score = []
# select a k (value how many y):
k_ndcg = 3
# kf.split: Generate indices to split data into training and test set.
for train_index, test_index in kf.split(xtrain_new, ytrain_new):
X_train, X_test = xtrain_new[train_index, :], xtrain_new[test_index, :]
y_train, y_test = ytrain_new[train_index], ytrain_new[test_index]
lr.fit(X_train, y_train)
y_pred = lr.predict_proba(X_test)
train_ndcg_score = ndcg_score(y_train, lr.predict_proba(X_train), k = k_ndcg)
cv_ndcg_score = ndcg_score(y_test, y_pred, k=k_ndcg)
print ("\nThe training score is: {}".format(np.mean(train_score)))
print ("\nThe cv score is: {}".format(np.mean(cv_score)))
ymin = np.min(cv_score)-0.1
ymax = np.max(train_score)+0.1
plt.plot(np.array(perc)*100, train_score, 'ro-', label = 'training')
plt.plot(np.array(perc)*100, cv_score, 'bo-', label = 'Cross-validation')
plt.xlabel("Sample size (unit %)")
plt.xlim(-5, np.max(perc)*100+10)
plt.ylim(ymin, ymax)
plt.legend(loc = 'lower right', fontsize = 12)
plt.title("Score vs sample size learning curve")
6.2 树模型
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import *
from sklearn.svm import SVC, LinearSVC, NuSVC
clf_tree ={
'DTree': DecisionTreeClassifier(max_depth=MAX_DEPTH,
'RF': RandomForestClassifier(n_estimators=N_ESTIMATORS,
'AdaBoost': AdaBoostClassifier(n_estimators=N_ESTIMATORS,
'Bagging': BaggingClassifier(n_estimators=N_ESTIMATORS,
'ExtraTree': ExtraTreesClassifier(max_depth=MAX_DEPTH,
'GraBoost': GradientBoostingClassifier(learning_rate=LEARNING_RATE,
train_score = []
cv_score = []
kf = KFold(n_splits=3, random_state=RANDOM_STATE)
k_ndcg = 5
for key in clf_tree.keys():
clf = clf_tree.get(key)
train_score_iter = []
cv_score_iter = []
for train_index, test_index in kf.split(xtrain_new, ytrain_new):
X_train, X_test = xtrain_new[train_index, :], xtrain_new[test_index, :]
y_train, y_test = ytrain_new[train_index], ytrain_new[test_index]
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
train_ndcg_score = ndcg_score(y_train, clf.predict_proba(X_train), k = k_ndcg)
cv_ndcg_score = ndcg_score(y_test, y_pred, k=k_ndcg)
train_score_tree = train_score
cv_score_tree = cv_score
ymin = np.min(cv_score)-0.05
ymax = np.max(train_score)+0.05
x_ticks = clf_tree.keys()
plt.plot(range(len(x_ticks)), train_score_tree, 'ro-', label = 'training')
plt.plot(range(len(x_ticks)),cv_score_tree, 'bo-', label = 'Cross-validation')
plt.xticks(range(len(x_ticks)),x_ticks,rotation = 45, fontsize = 10)
plt.xlabel("Tree method", fontsize = 12)
plt.ylabel("Score", fontsize = 12)
plt.xlim(-0.5, 5.5)
plt.ylim(ymin, ymax)
plt.legend(loc = 'best', fontsize = 12)
plt.title("Different tree methods")
6.4 xgboost
import xgboost as xgb
def customized_eval(preds, dtrain):
labels = dtrain.get_label()
top = []
for i in range(preds.shape[0]):
mat = np.reshape(np.repeat(labels,np.shape(top)[1]) == np.array(top).ravel(),np.array(top).shape).astype(int)
score = np.mean(np.sum(mat/np.log2(np.arange(2, mat.shape[1] + 2)),axis = 1))
return 'ndcg5', score
# xgboost parameters
NUM_XGB = 200
params = {}
params['colsample_bytree'] = 0.6
params['max_depth'] = 6
params['subsample'] = 0.8
params['eta'] = 0.3
params['seed'] = RANDOM_STATE
params['num_class'] = 12
params['objective'] = 'multi:softprob' # output the probability instead of class.
train_score_iter = []
cv_score_iter = []
kf = KFold(n_splits = 3, random_state=RANDOM_STATE)
k_ndcg = 5
for train_index, test_index in kf.split(xtrain_new, ytrain_new):
X_train, X_test = xtrain_new[train_index, :], xtrain_new[test_index, :]
y_train, y_test = ytrain_new[train_index], ytrain_new[test_index]
train_xgb = xgb.DMatrix(X_train, label= y_train)
test_xgb = xgb.DMatrix(X_test, label = y_test)
watchlist = [ (train_xgb,'train'), (test_xgb, 'test') ]
bst = xgb.train(params,
feval = customized_eval,
verbose_eval = 3,
early_stopping_rounds = 5)
#bst = xgb.train( params, dtrain, num_round, evallist )
y_pred = np.array(bst.predict(test_xgb))
y_pred_train = np.array(bst.predict(train_xgb))
train_ndcg_score = ndcg_score(y_train, y_pred_train , k = k_ndcg)
cv_ndcg_score = ndcg_score(y_test, y_pred, k=k_ndcg)
train_score_xgb = np.mean(train_score_iter)
cv_score_xgb = np.mean(cv_score_iter)
print ("\nThe training score is: {}".format(train_score_xgb))
print ("The cv score is: {}\n".format(cv_score_xgb))
[10:16:51] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 44 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 8 extra nodes, 0 pruned nodes, max_depth=3
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 76 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 62 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 46 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 76 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 8 extra nodes, 0 pruned nodes, max_depth=4
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 106 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 98 extra nodes, 0 pruned nodes, max_depth=6
[0] train-merror:0.432818 test-merror:0.509487 train-ndcg5:0.793868 test-ndcg5:0.746247
Multiple eval metrics have been passed: 'test-ndcg5' will be used for early stopping.
Will train until test-ndcg5 hasn't improved in 5 rounds.
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 50 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 90 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 76 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 58 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=5
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 90 extra nodes, 0 pruned nodes, max_depth=6
[10:16:52] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 78 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 58 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 108 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 94 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 96 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 42 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 44 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 110 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 68 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 68 extra nodes, 0 pruned nodes, max_depth=6
[10:16:53] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 96 extra nodes, 0 pruned nodes, max_depth=6
[3] train-merror:0.414266 test-merror:0.492762 train-ndcg5:0.805691 test-ndcg5:0.753109
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 48 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 86 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 70 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 62 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 94 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 82 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 104 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 58 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 94 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 56 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 78 extra nodes, 0 pruned nodes, max_depth=6
[10:16:54] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 100 extra nodes, 0 pruned nodes, max_depth=6
[10:16:55] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 122 extra nodes, 0 pruned nodes, max_depth=6
[10:16:55] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 0 pruned nodes, max_depth=6
[10:16:55] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:16:55] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 90 extra nodes, 0 pruned nodes, max_depth=6
[10:16:55] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 88 extra nodes, 0 pruned nodes, max_depth=6
Stopping. Best iteration:
[0] train-merror:0.432818 test-merror:0.509487 train-ndcg5:0.793868 test-ndcg5:0.746247
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 0 pruned nodes, max_depth=6
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 0 pruned nodes, max_depth=6
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 92 extra nodes, 0 pruned nodes, max_depth=6
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 62 extra nodes, 0 pruned nodes, max_depth=6
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 58 extra nodes, 0 pruned nodes, max_depth=6
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 96 extra nodes, 0 pruned nodes, max_depth=6
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=3
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6
[0] train-merror:0.453619 test-merror:0.47688 train-ndcg5:0.780043 test-ndcg5:0.771609
Multiple eval metrics have been passed: 'test-ndcg5' will be used for early stopping.
Will train until test-ndcg5 hasn't improved in 5 rounds.
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 36 extra nodes, 0 pruned nodes, max_depth=6
[10:16:59] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 38 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 70 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 112 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 54 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 86 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 38 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 88 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 8 extra nodes, 0 pruned nodes, max_depth=3
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 50 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 46 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 68 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 96 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 96 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 100 extra nodes, 0 pruned nodes, max_depth=6
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=4
[10:17:00] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 82 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 40 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 78 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 110 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 54 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 72 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 76 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
[3] train-merror:0.433661 test-merror:0.451441 train-ndcg5:0.793304 test-ndcg5:0.783746
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 46 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 86 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 90 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 50 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 84 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 84 extra nodes, 0 pruned nodes, max_depth=6
[10:17:01] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 86 extra nodes, 0 pruned nodes, max_depth=6
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 38 extra nodes, 0 pruned nodes, max_depth=6
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 68 extra nodes, 0 pruned nodes, max_depth=6
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 98 extra nodes, 0 pruned nodes, max_depth=6
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 74 extra nodes, 0 pruned nodes, max_depth=6
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 112 extra nodes, 0 pruned nodes, max_depth=6
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 0 pruned nodes, max_depth=6
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 92 extra nodes, 0 pruned nodes, max_depth=6
[10:17:02] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 84 extra nodes, 0 pruned nodes, max_depth=6
Stopping. Best iteration:
[0] train-merror:0.453619 test-merror:0.47688 train-ndcg5:0.780043 test-ndcg5:0.771609
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 0 pruned nodes, max_depth=6
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 0 pruned nodes, max_depth=5
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 88 extra nodes, 0 pruned nodes, max_depth=6
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 0 pruned nodes, max_depth=6
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 0 pruned nodes, max_depth=6
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 104 extra nodes, 0 pruned nodes, max_depth=6
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 98 extra nodes, 0 pruned nodes, max_depth=6
[10:17:06] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 88 extra nodes, 0 pruned nodes, max_depth=6
[0] train-merror:0.450949 test-merror:0.478426 train-ndcg5:0.782735 test-ndcg5:0.756588
Multiple eval metrics have been passed: 'test-ndcg5' will be used for early stopping.
Will train until test-ndcg5 hasn't improved in 5 rounds.
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 8 extra nodes, 0 pruned nodes, max_depth=3
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 78 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 94 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 40 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 74 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 68 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 76 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 84 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 70 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 98 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 90 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 116 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 46 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 76 extra nodes, 0 pruned nodes, max_depth=6
[10:17:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=4
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 56 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 100 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 84 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 74 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 74 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 62 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 76 extra nodes, 0 pruned nodes, max_depth=6
[3] train-merror:0.425088 test-merror:0.459873 train-ndcg5:0.798643 test-ndcg5:0.771855
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 48 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 40 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 76 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 82 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 104 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 78 extra nodes, 0 pruned nodes, max_depth=6
[10:17:08] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 104 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 90 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 54 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 58 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 106 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 78 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 88 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 114 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6
[10:17:09] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nodes, 0 pruned nodes, max_depth=6
Stopping. Best iteration:
[0] train-merror:0.450949 test-merror:0.478426 train-ndcg5:0.782735 test-ndcg5:0.756588
The training score is: 0.8033695668676714
The cv score is: 0.7713294556308351
7. 模型比较
model_cvscore = np.hstack((cv_score_tree, cv_score_xgb))
model_name = np.array(['ExtraTree','DTree','RF','GraBoost','Bagging','AdaBoost','Xgboost'])
fig = plt.figure(figsize=(8,4))
sns.barplot(model_cvscore, model_name, palette="Blues_d")
plt.xticks(rotation=0, size = 10)
plt.xlabel("CV score", fontsize = 12)
plt.ylabel("Model", fontsize = 12)
plt.title("Cross-validation score for different models")
- 对数据的理解和探索很重要
- 可以通过特征工程,进一步提取特征
- 模型评估的方法有很多种,选取适宜的模型评估方法
- 目前只用了10%的数据进行模型训练,用全部的数据集进行训练,效果可能会更好
- 需要深入学习模型算法,学会调参