Santander Product Recommendation应该是我参加kaggle的第一场比赛,虽然这场比赛最终只取得了103/1704(TOP 6%)的成绩,不过这场比赛让我熟悉了kaggle这个平台,结交了一些志同道合的kaggler。在持续的两个月的过程中,我进行了很多的尝试,学到了很多,下面对这场比赛进行个小结。
题目:根据西班牙Santander银行前17个月的商品购买记录以及用户属性去预测2016年6月每个用户最可能购买的商品(给出预测的7个商品,按可能性大小排列,评测指标用MAP@7)。
降低电脑使用内存:这题的数据量很大,在两个月的比赛过程中,本博主的电脑死机不下50次。本博主参加kaggle这题后,买了根8g的内存条把自己的电脑内存扩充到14g,可是还是经常会死机(程序写的太复杂或者应用软件开的过多)。处理数据如果读入一行处理一行,内存的占用率不会太大,可是这样对数据进行分组聚合和一些统计的处理会显得很麻烦。所以由于本博主处理能力有限,每次都是将整个表读入内存再进行处理,虽然很占内存,可是预处理提取特征这些操作会相对容易实现。这里用pandas的read_csv把2g的数据读到内存会占用10g的内存,因此我们读数据时可以通过改变数据读入的类型(将float64设置成int8读入)或者只读入需要的列,来减少内存的使用。另外,在编程过程中应及时删除一些不用的表来释放内存。
缺失值处理:这里只想说我在比赛过程中遇到的一个坑,csv中缺失的数据(为空)在采用不同的读入方式会呈现不同的结果。采用csv.DictReader( )会将csv中的缺失值读为' ',而用pd.read_csv( )会将缺失值读为nan。剩下的在缺失值的处理中要注意的就是数据类型的转换。
模型:这题发布的第一个月,各位kaggler尝试了各种方法模型,效果都不是很好。后来在BreakfastPirate发布了一条论坛介绍了自己的思路后(由于6月是商品销售分布和其它月差别很大,因此使用xgboost多分类用训练集里的6月去预测测试集6月,这样也大大减小了训练集),比赛排行榜上各种上升,各位kaggler都在此思路上进行特征工程来提高自己的分数。可能是本博主水平略差,我将自己能想到的特征都尝试了一遍可是分数依然提高不多,在比赛最后7天,我没有任何突破,名次掉了30多,哎,真的不会啊,每次加完特征都没提升,有时还会下降,我特么都开始怀疑人生了。此外,这题ensemble没什么用。
特征工程:lag-5特征,renta等属性分组购买率,各种sum,一些商品组合的求和及编码。
线下验证:本博主参加了两次kaggle了,每次比赛结束由Public LeaderBoard转到Private LeaderBoard时,名次都会相对下降一些,一方面可能是因为模型的鲁棒性不是很好,还有一点原因可能是因为过多依赖线上结果。因此,我们比赛时应多参考线下结果,不能一味的依赖于线上结果,这样可能会过拟合Public LeaderBoard。
其它比赛思路:Breakfast Pirate
我的比赛核心代码如下:
'''
author:TaoZI
date:2016/12/22
'''
import datetime
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.cross_validation import KFold
pd.options.mode.chained_assignment = None
mapping_dict = {
'sexo' : {'nan':0,'H':0, 'V':1},
'ind_actividad_cliente' : {'nan':0, '0.0':0, '0':0,'1.0':1, '1':1},
'segmento' : {'nan':0, '01 - TOP':0, '03 - UNIVERSITARIO':1, '02 - PARTICULARES':2},
'ind_nuevo' : {'nan':0, '1.0':0, '1':0, '0.0':1, '0':1 },
'tiprel_1mes' : {'nan':0, 'P':0, 'R':0, 'N':0, 'I':1, 'A':2},
'indext' : {'nan':0, 'S':0, 'N':1}
}
target_raw_cols = ['ind_ahor_fin_ult1', 'ind_aval_fin_ult1', 'ind_cco_fin_ult1', 'ind_cder_fin_ult1',
'ind_cno_fin_ult1', 'ind_ctju_fin_ult1', 'ind_ctma_fin_ult1', 'ind_ctop_fin_ult1',
'ind_ctpp_fin_ult1', 'ind_deco_fin_ult1', 'ind_deme_fin_ult1', 'ind_dela_fin_ult1',
'ind_ecue_fin_ult1', 'ind_fond_fin_ult1', 'ind_hip_fin_ult1', 'ind_plan_fin_ult1',
'ind_pres_fin_ult1', 'ind_reca_fin_ult1', 'ind_tjcr_fin_ult1', 'ind_valo_fin_ult1',
'ind_viv_fin_ult1', 'ind_nomina_ult1', 'ind_nom_pens_ult1', 'ind_recibo_ult1']
target_cols = target_raw_cols[2:]
con_cols = ['ncodpers', 'fecha_dato', 'age', 'antiguedad','renta']
cat_cols = mapping_dict.keys()
user_cols = con_cols + cat_cols + target_raw_cols
NUM_CLASS = 22
def getAge(str_age):
age = str_age.strip()
if age == 'NA' or age == 'nan':
age1 = 2
elif float(age) < 20:
age1 = 0
elif float(age) < 30:
age1 = 1
elif float(age) < 40:
age1 = 2
elif float(age) < 50:
age1 = 3
elif float(age) < 60:
age1 = 4
else:
age1 = 5
return age1
def getCustSeniority(str_seniority):
cust_seniority = str_seniority.strip()
if cust_seniority == 'NA' or cust_seniority == 'nan':
seniority = 4
elif float(cust_seniority) < 50:
seniority = 0
elif float(cust_seniority) < 75:
seniority = 1
elif float(cust_seniority) < 100:
seniority = 2
elif float(cust_seniority) < 125:
seniority = 3
elif float(cust_seniority) < 150:
seniority = 4
elif float(cust_seniority) < 175:
seniority = 5
elif float(cust_seniority) < 200:
seniority = 6
elif float(cust_seniority) < 225:
seniority = 7
else:
seniority = 8
return seniority
def getRent(str_rent):
rent = str_rent.strip()
if rent == 'NA' or rent == 'nan':
rent1 = 4
elif float(rent) < 45542.97:
rent1 = 1
elif float(rent) < 57629.67:
rent1 = 2
elif float(rent) < 68211.78:
rent1 = 3
elif float(rent) < 78852.39:
rent1 = 4
elif float(rent) < 90461.97:
rent1 = 5
elif float(rent) < 103855.23:
rent1 = 6
elif float(rent) < 120063.00:
rent1 = 7
elif float(rent) < 141347.49:
rent1 = 8
elif float(rent) < 173418.12:
rent1 = 9
elif float(rent) < 234687.12:
rent1 = 10
else:
rent1 = 11
return rent1
def add_com_features(lag_feats):
lag_feats['prod_sum'] = lag_feats.apply(lambda x: np.sum(x[-120:]), axis = 1)
for i, pre in enumerate(['1_', '2_', '3_', '4_', '5_']):
pre_cols = [pre + col for col in target_raw_cols]
lag_feats['sum_24_' + str(i + 1)] = lag_feats.loc[:, pre_cols].sum(axis = 1)
sum_24_list = ['sum_24_' + str(i + 1) for i in range(5)]
lag_feats['sum_24_max'] = lag_feats[sum_24_list].max(axis = 1)
lag_feats['sum_24_min'] = lag_feats[sum_24_list].min(axis = 1)
lag_feats['sum_24_mean'] = lag_feats[sum_24_list].mean(axis = 1)
for i, col in enumerate(target_raw_cols):
index_list = [pre + col for pre in ['1_', '2_', '3_', '4_', '5_']]
lag_feats['prod_sum_' + str(i)] = lag_feats.loc[:, index_list].sum(axis = 1)
pro_sum_list = ['prod_sum_' + str(i) for i in range(24)]
for gp_col in ['renta', 'sexo']:
group_feats = lag_feats[pro_sum_list].groupby(lag_feats[gp_col]).agg(lambda x: round(x.sum() / x.count(), 2))
group_feats.columns = [gp_col + str(i) for i in range(24)]
lag_feats = pd.merge(lag_feats, group_feats, left_on = gp_col, right_index = True, how = 'left')
com_col = [[0, 2], [7, 8, 9], [9, 10, 11], [19, 20, 21]]
for x in range(4):
import_col = [target_cols[i] for i in com_col[x]]
for i in range(1, 6):
pre_import_col = [str(i) + '_' + col for col in import_col]
lag_feats[str(i) + '_' + str(x + 1) + '_s_sum_import'] = lag_feats[pre_import_col].sum(axis = 1)
return lag_feats
def process_train_data(in_file_name, date_list):
this_month = in_file_name[in_file_name['fecha_dato'].isin([date_list[0]])]
for col in cat_cols:
this_month[col] = this_month[col].apply(lambda x:mapping_dict[col][str(x)])
for col in target_raw_cols:
this_month[col].fillna(0, inplace=True)
this_month['age'] = this_month['age'].apply(lambda x: getAge(x))
this_month['antiguedad'] = this_month['antiguedad'].apply(lambda x: getCustSeniority(x))
this_month['renta'] = this_month['renta'].apply(lambda x: getRent(str(x)))
hist_data = in_file_name.loc[:,['ncodpers','fecha_dato'] + target_raw_cols]
del in_file_name
pre_month = hist_data[hist_data['fecha_dato'].isin([date_list[1]])]
pre_month_ncodpers = pre_month[['ncodpers']]
pre_month_target = pre_month[target_raw_cols]
pre_month_target = pre_month_target.add_prefix('1_')
pre_month = pd.concat([pre_month_ncodpers, pre_month_target], axis=1)
this_month = pd.merge(this_month, pre_month, on=['ncodpers'], how='left')
this_month.fillna(0, inplace=True)
for col in target_cols:
this_month[col] = np.where(this_month[col]-this_month['1_'+col] > 0,(this_month[col]-this_month['1_'+col]), 0 )
this_month_target = this_month[target_cols]
this_month = this_month.drop(target_raw_cols, axis=1)
x_vars_list = []
y_vars_list = []
for i in range(2, len(date_list)):
tmp = hist_data[hist_data['fecha_dato'].isin([date_list[i]])].loc[:,['ncodpers'] + target_raw_cols]
tmp = tmp.add_prefix(str(i) + "_")
tmp.rename(columns={str(i) + '_ncodpers': 'ncodpers'}, inplace=True)
this_month = pd.merge(this_month, tmp, on=['ncodpers'], how='left')
this_month.fillna(0, inplace=True)
del hist_data
this_month = add_com_features(this_month)
this_month.fillna(0, inplace=True)
this_month = pd.concat([this_month,this_month_target],axis=1)
for idx,row in this_month.iterrows():
for i in range(0,22):
if row[(-22+i)]>0:
x_vars_list.append(row[:-22])
y_vars_list.append(i)
return np.array(x_vars_list), np.array(y_vars_list)
def process_test_data(test_file, hist_file, date_list):
for col in cat_cols:
test_file[col] = test_file[col].apply(lambda x: mapping_dict[col][str(x)])
test_file['age'] = test_file['age'].apply(lambda x: getAge(x))
test_file['antiguedad'] = test_file['antiguedad'].apply(lambda x: getCustSeniority(x))
test_file['renta'] = test_file['renta'].apply(lambda x: getRent(x))
for i in range(0, len(date_list)):
tmp = hist_file[hist_file['fecha_dato'].isin([date_list[i]])].loc[:,['ncodpers'] + target_raw_cols]
tmp = tmp.add_prefix(str(i + 1) + "_")
tmp.rename(columns={str(i + 1) + '_ncodpers': 'ncodpers'}, inplace=True)
test_file = pd.merge(test_file, tmp, on=['ncodpers'], how='left')
test_file.fillna(0, inplace=True)
del hist_file
test_file = add_com_features(test_file)
test_file.fillna(0, inplace=True)
return test_file.values
def runXGB_CV(train_X,train_y, test_X, index, seed_val):
train_index, test_index = index
X_train = train_X[train_index]
y_train = train_y[train_index]
xgtrain = xgb.DMatrix(X_train, label=y_train)
xgtest = xgb.DMatrix(test_X)
param = {
'objective' : 'multi:softprob',
'eval_metric' : "mlogloss",
'num_class' : NUM_CLASS,
'silent' : 1,
'min_child_weight' : 2,
'eta': 0.05,
'max_depth': 6,
'subsample' : 0.9,
'colsample_bytree' : 0.8,
'seed' : seed_val
}
num_rounds = 100
model = xgb.train(param, xgtrain, num_rounds)
pred = model.predict(xgtest)
return pred
def runXGB(train_X, train_y, test_X,seed_val=123):
param = {
'objective' : 'multi:softprob',
'eval_metric' : "mlogloss",
'num_class' : NUM_CLASS,
'silent' : 1,
'min_child_weight' : 2,
'eta': 0.05,
'max_depth': 6,
'subsample' : 0.9,
'colsample_bytree' : 0.8,
'seed' : seed_val
}
num_rounds = 100
xgtrain = xgb.DMatrix(train_X, label = train_y)
xgtest = xgb.DMatrix(test_X)
model = xgb.train(param, xgtrain, num_rounds)
preds = model.predict(xgtest)
return preds
if __name__ == "__main__":
cv_sel = 1
start_time = datetime.datetime.now()
data_path = '../input/'
print "feature extract..."
train_file = pd.read_csv(data_path + 'train_ver3.csv',
dtype = {'age': 'str', 'antiguedad': 'str', 'renta': 'str'},
usecols = user_cols)
print datetime.datetime.now() - start_time
train_X, train_y = process_train_data(train_file, ['2015-06-28', '2015-05-28', '2015-04-28',
'2015-03-28', '2015-02-28', '2015-01-28'])
train_X = train_X[:, 2:]
print datetime.datetime.now() - start_time
data_date = ['2016-05-28', '2016-04-28', '2016-03-28', '2016-02-28', '2016-01-28']
train_file = train_file[train_file['fecha_dato'].isin(data_date)].loc[:,
['ncodpers', 'fecha_dato'] + target_raw_cols]
test_file = pd.read_csv(data_path + 'test_ver3.csv',
dtype = {'age': 'str', 'antiguedad': 'str', 'renta': 'str'},
usecols = con_cols + cat_cols)
test_X = process_test_data(test_file, train_file, data_date)
print datetime.datetime.now() - start_time
del train_file, test_file
test_X = test_X[:, 2:]
print train_X.shape, train_y.shape, test_X.shape
print datetime.datetime.now() - start_time
seed_val = 123
if cv_sel == 1:
print "running model with cv..."
nfolds = 5
kf = KFold(train_X.shape[0], n_folds = nfolds, shuffle = True, random_state = seed_val)
preds = [0] * NUM_CLASS
for i, index in enumerate(kf):
preds += runXGB_CV(train_X, train_y, test_X, index, seed_val)
print 'fold %d' % (i + 1)
preds = preds / nfolds
else:
print "running model with feature..."
preds = runXGB(train_X, train_y, test_X,seed_val)
del train_X, test_X, train_y
print "Getting the top products.."
target_cols = np.array(target_cols)
preds = np.argsort(preds, axis = 1)
preds = np.fliplr(preds)[:, :7]
test_id = np.array(pd.read_csv( data_path + 'test_ver2.csv', usecols = ['ncodpers'])['ncodpers'])
final_preds = [" ".join(list(target_cols[pred])) for pred in preds]
out_df = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds})
out_df.to_csv('../submit/sub_xgb.csv', index = False)
GitHub : https://github.com/wenwu313/Kaggle-Solution