作为初入算法竞赛的计算机大菜鸡,想把自己这次的历程分享下来,用来记录自己在竞赛中的进步.
从2月份开始看天池竞赛代码,到现在将近两个月算是基本入门,知道是怎么个比赛流程,一路上自己摸爬滚打,中间过程挺艰辛的,很多次因为一个bug卡好几天,也没人帮着解决,效率挺低的.
好了,言归正传,分享下自己最近才做的一个O2O竞赛.
从一个在天池新人赛报名之后,就先到技术圈去学习了下,看到一个100行代码入门天池O2O优惠券使用新人赛的baseline,就拿来调试代码,结果运行很顺利,代码也相对比较简单.几乎未进行特征工程处理,采用了SGDClassifier算法,最后AUC是0.5287,排名412/13500,离第一名的0.81差距甚远,
于是乎在此基础上重新进行了代码修改,提取部分特征,采用XGboost算法模型,提交结果,最后AUC是0.5379,排名进了60多名,350/13500.下边就分享这次代码,并详细进行解读.
import os, sys, pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from datetime import date
from sklearn.model_selection import KFold, train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss, roc_auc_score, auc, roc_curve
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
#import xgboost as xgb
#import lightgbm as lgb
%matplotlib inline
%config InlineBackend.figure_format = ‘retina’
一:导入数据集
大概查看一下数据集
dfoff = pd.read_csv("..\input\ccf_offline_stage1_train.csv",keep_default_na = False)
dftest = pd.read_csv("..\input\ccf_offline_stage1_test_revised.csv",keep_default_na = False)
dfon = pd.read_csv("..\input\ccf_online_stage1_train.csv",keep_default_na = False)
dfoff.head(5) #显示训练集的前5行
dftest.head(5) #显示测试集的前5行
由此可见,训练集比测试集多了一列Date,我们要做的就是通过训练训练集的数据,最后通过测试集来预测用户是否会进行消费
本赛题提供用户在2016年1月1日至2016年6月30日之间真实线上线下消费行为,预测用户在2016年7月领取优惠券后15天以内的使用情况。
数据集清洗方法一: 查看是否有缺失值
dfoff.isnull().sum().sort_values(ascending=False).head(10) #结果说明没有缺失值
Date 0
Date_received 0
Distance 0
Discount_rate 0
Coupon_id 0
Merchant_id 0
User_id 0
dtype: int64
dfoff.info() #查看一下类型
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Coupon_id object
Discount_rate object
Distance object
Date_received object
Date object
dtypes: int64(2), object(5)
memory usage: 93.7+ MB
dfoff["Date_received"].unique() #查看一下线下训练集中 Data_receive 的类型个数
dfoff["Date"].unique()
print("有优惠卷,购买商品:" ,dfoff[(dfoff["Date_received"] != "null") & (dfoff["Date"] != "null")].shape[0])
print('有优惠卷,未购商品:%d' % dfoff[(dfoff['Date_received'] != 'null') & (dfoff['Date'] == 'null')].shape[0])
print('无优惠卷,购买商品:%d' % dfoff[(dfoff['Date_received'] == 'null') & (dfoff['Date'] != 'null')].shape[0])
print('无优惠卷,未购商品:%d' % dfoff[(dfoff['Date_received'] == 'null') & (dfoff['Date'] == 'null')].shape[0])
有优惠卷,购买商品: 75382
有优惠卷,未购商品:977900
无优惠卷,购买商品:701602
无优惠卷,未购商品:0
# 在测试集中出现的用户但训练集没有出现
print('1. User_id in training set but not in test set', set(dftest['User_id']) - set(dfoff['User_id']))
# 在测试集中出现的商户但训练集没有出现
print('2. Merchant_id in training set but not in test set', set(dftest['Merchant_id']) - set(dfoff['Merchant_id']))
- User_id in training set but not in test set {2495873, 1286474}
- Merchant_id in training set but not in test set {5920}
二:清洗数据集与提取特征
知识点1: unique函数是查看多少种类型
知识点2: 想要提取数据集中某一行的特征就用 dfoff[“中间存放标题”]+.函数 这种形式.例如:下面
特征一:打折率
我们想到第一个特征就是打折率,打折力度越大,用户使用优惠券的概率越大,先用unique函数查看一下有多少种打折类型
print('Discount_rate 类型:',dfoff['Discount_rate'].unique())
Discount_rate 类型: [‘null’ ‘150:20’ ‘20:1’ ‘200:20’ ‘30:5’ ‘50:10’ ‘10:5’ ‘100:10’ ‘200:30’
‘20:5’ ‘30:10’ ‘50:5’ ‘150:10’ ‘100:30’ ‘200:50’ ‘100:50’ ‘300:30’
‘50:20’ ‘0.9’ ‘10:1’ ‘30:1’ ‘0.95’ ‘100:5’ ‘5:1’ ‘100:20’ ‘0.8’ ‘50:1’
‘200:10’ ‘300:20’ ‘100:1’ ‘150:30’ ‘300:50’ ‘20:10’ ‘0.85’ ‘0.6’ ‘150:50’
‘0.75’ ‘0.5’ ‘200:5’ ‘0.7’ ‘30:20’ ‘300:10’ ‘0.2’ ‘50:30’ ‘200:100’
‘150:5’]
根据打印结果发现,打折类型一共有3中,
第一种是nan , 表示没有打折
第二种是150:20 表示满150元减少20元
第三种是0.95 表示打折0.95的折扣
我们因此构建4个函数,提取4个特征,分别是:
打折类型:getDiscountType()
打折率:convertRate
满多少:getDiscountMan
减多少:getDiscountJian
# convert Discount_rate and Distance
def getDiscountType(row): # row 是传进来的形参
if row == 'null':
return 'null' #无折扣
elif ':' in row:
return 1 #满多少
else:
return 0 #折扣率
def convertRate(row):
"""Convert discount to rate"""
if row == 'null': #无折扣
return 1.0
elif ':' in row:
rows = row.split(':')
return 1.0 - float(rows[1])/float(rows[0]) #满多少,转化为折扣率
else:
return float(row) #折扣率
def getDiscountMan(row):
if ':' in row:
rows = row.split(':')
return int(rows[0]) #如果是满多少,就返回满的数字 row[0]
else:
return 0 #否则就返回0
def getDiscountJian(row):
if ':' in row:
rows = row.split(':')
return int(rows[1]) #如果是满多少,就返回要减的数字row[1]
else:
return 0 #否则返回0
def processData(df): #自定义了函数processData,
# convert discunt_rate
df['discount_rate'] = df['Discount_rate'].apply(convertRate)#申请调用刚才自定义的函数,把Discount_rate传进去
df['discount_man'] = df['Discount_rate'].apply(getDiscountMan)
df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian)
df['discount_type'] = df['Discount_rate'].apply(getDiscountType)
print(df['discount_rate'].unique()) #查看更改后的discount_rate类型
# convert distance
df['distance'] = df['Distance'].replace('null', -1).astype(int)
return df
dfoff = processData(dfoff) #同理,调用processData这个函数,将线下训练集传进去
dftest = processData(dftest) #将测试集进行数据清洗,dftest是传进去的参数
[1. 0.86666667 0.95 0.9 0.83333333 0.8
0.5 0.85 0.75 0.66666667 0.93333333 0.7
0.6 0.96666667 0.98 0.99 0.975 0.33333333
0.2 0.4 ]
[0.83333333 0.9 0.96666667 0.8 0.95 0.75
0.98 0.5 0.86666667 0.6 0.66666667 0.7
0.85 0.33333333 0.94 0.93333333 0.975 0.99 ]
#展示前两行,发现训练集中多了4列"Date discount_rate","discount_man","discount_jian","discount_type"
dfoff.head(2)
#print('Distance 类型:', dfoff['Distance'].unique())
数据集清洗二:对数据集进行类型转换
处理"Distance"距离这一列
#df['distance'] = df['Distance'].replace('null', -1).astype(int) #在distance中把null替换为-1,并进行类型转换
#print(df['distance'].unique()) #查看distance的类型
dftest.head(2) #同理,对测试集进行处理
date_received = dfoff['Date_received'].unique() #此时的data_received是一个数组
date_received = sorted(date_received[date_received != 'null'])
date_buy = dfoff['Date'].unique()
date_buy = sorted(date_buy[date_buy != 'null'])
date_buy = sorted(dfoff[dfoff['Date'] != 'null']['Date'])
print('优惠券收到日期从',date_received[0],'到', date_received[-1])
print('消费日期从', date_buy[0], '到', date_buy[-1])
优惠券收到日期从 20160101 到 20160615
消费日期从 20160101 到 20160630
特征二:提取星期特征,消费时间更有可能和星期有关
def getWeekday(row): #自定义getWeekday的函数,row是形式参数
if row == 'null': #如果为空,则返回原来的
return row
else:
return date(int(row[0:4]), int(row[4:6]), int(row[6:8])).weekday() + 1
#对训练集中Date_received进行类型转换,转变为字符型,然后调用getWeekday函数,将结果赋值给weekday
dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday)
dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)
# weekday_type : 周六和周日为1,其他为0
dfoff['weekday_type'] = dfoff['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )
dftest['weekday_type'] = dftest['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )
dfoff.head()
数据集处理方式三:进行one-hot独热编码
# change weekday to one-hot encoding 独热编码
weekdaycols = ['weekday_' + str(i) for i in range(1,8)]
print(weekdaycols) #此时的weekdaycols是一个数组
tmpdf = pd.get_dummies(dfoff['weekday'].replace('null', np.nan)) #一键one_hot处理
tmpdf.columns = weekdaycols #更新栏目表头
dfoff[weekdaycols] = tmpdf
tmpdf = pd.get_dummies(dftest['weekday'].replace('null', np.nan)) #一键one_hot处理
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf
dfoff.head()
[‘weekday_1’, ‘weekday_2’, ‘weekday_3’, ‘weekday_4’, ‘weekday_5’, ‘weekday_6’, ‘weekday_7’]
好了,经过以上简单的特征提取,我们总共得到了 14 个有用的特征: discount_rate
discount_type
discount_man
discount_jian
distanceweek
distance
dayweekday_type
weekday_1
weekday_2
weekday_3
weekday_4
weekday_5
weekday_6
weekday_7
标注标签 Label
有了特征之后,我们还需要对训练样本进行 label 标注,即确定哪些是正样本(y = 1),哪些是负样本(y = 0)。我们要预测的是用户在领取优惠券之后 15 之内的消费情况。所以,总共有三种情况:
1.Date_received == ‘null’:
表示没有领到优惠券,无需考虑,y = -1
2.(Date_received != ‘null’) & (Date != ‘null’) & (Date - Date_received <= 15):
表示领取优惠券且在15天内使用,即正样本,y = 1
3.(Date_received != ‘null’) & ((Date == ‘null’) | (Date - Date_received > 15)):
表示领取优惠券未在在15天内使用,即负样本,y = 0
好了,知道规则之后,我们就可以定义标签备注函数了。
正负样本
def label(row): #自定义一个函数label,row是形参
if row['Date_received'] == 'null':
return -1 #Date_received为空表示,没有收到优惠券,不用考虑
if row['Date'] != 'null':
td = pd.to_datetime(row['Date'], format='%Y%m%d') - pd.to_datetime(row['Date_received'], format='%Y%m%d')
if td <= pd.Timedelta(15, 'D'): #表示领取优惠券后在15天内进行消费,即为正样本
return 1
return 0 #表示领取优惠券后没有在15天内进行消费,即为负样本
dfoff['label'] = dfoff.apply(label, axis = 1)
dfoff["label"].unique() #unique函数是查看类型
array([-1, 0, 1], dtype=int64)
value_counts函数是统计各个类型的个数
我们可以使用这个函数对训练集进行标注,看一下正负样本究竟有多少:
print(dfoff['label'].value_counts()) # 统计正负样本的个数
0 988887
-1 701602
1 64395
Name: label, dtype: int64
很清晰地,正样本共有 64395 例,负样本共有 988887 例。显然,正负样本数量差别很大。 这也是为什么会使用 AUC 作为模型性能评估标准的原因。
dfoff.columns.tolist() 函数是用来查看表头
print('已有columns:',dfoff.columns.tolist()) #查看一下已经有的标题
已有columns: [‘User_id’, ‘Merchant_id’, ‘Coupon_id’, ‘Discount_rate’, ‘Distance’, ‘Date_received’, ‘Date’, ‘discount_rate’, ‘discount_man’, ‘discount_jian’, ‘discount_type’, ‘distance’, ‘weekday’, ‘weekday_type’, ‘weekday_1’, ‘weekday_2’, ‘weekday_3’, ‘weekday_4’, ‘weekday_5’, ‘weekday_6’, ‘weekday_7’, ‘label’]
dfoff.head(2)
三:建立模型
接下来就是最主要的建立机器学习模型了。首先确定的是我们选择的特征是上面提取的 14 个特征,为了验证模型的性能,需要划分验证集进行模型验证,划分方式是按照领券日期,即训练集:20160101-20160515,验证集:20160516-20160615。我们采用XGboost算法
xgboost
1.划分训练集和验证集
注意这里得到的结果 pred_prob 是概率值(预测样本属于正类的概率)。
最后,就可以对验证集计算 AUC。直接调用 sklearn 库自带的计算 AUC 函数即可。
# data split
df = dfoff[dfoff['label'] != -1].copy() #把训练集中非负样本数赋值给df
train = df[(df['Date_received'] < '20160516')].copy() #df中日期小于20160516的作为训练集
valid = df[(df['Date_received'] >= '20160516') & (df['Date_received'] <= '20160615')].copy() #df中日期大于20160516的作为训练集
print(train['label'].value_counts()) #查看训练集中各样本数量
print(valid['label'].value_counts()) #查看验证集中各样本数量
0 759172
1 41524
Name: label, dtype: int64
0 229715
1 22871
Name: label, dtype: int64
#train.head(5)
#valid.head(5)
y = train.label
#drop函数是去掉删除的
#对训练集特征进行降维,去掉中括号里面的这些没用特征
X = train.drop(["User_id","Merchant_id","Coupon_id","Discount_rate","Distance","Date","Date_received","label"],axis=1)
val_y = valid.label
#对验证集特征进行降维,去掉中括号里面的这些没用特征
val_X = valid.drop(["User_id","Merchant_id","Coupon_id","Discount_rate","Distance","Date","Date_received","label"],axis=1)
#对测试集特征进行降维,去掉中括号里面的这些没用特征
tests = dftest.drop(["User_id","Merchant_id","Coupon_id","Discount_rate","Distance","Date_received"],axis=1)
val_X["weekday"].unique(),val_X["discount_type"].unique()
#查看一下类型,发现weekday和discount_type是object类型,所以对其进行类型转换
(array([6, 1, 4, 3, 2, 7, 5], dtype=object), array([1, 0], dtype=object))
#astype函数是进行类型转换的函数
#对weekday进行类型转换,转换成int型
X["weekday"] = X["weekday"].astype(int)
X["discount_type"] =X["discount_type"].astype(int)
val_X["weekday"]=val_X["weekday"].astype(int)
val_X["discount_type"]=val_X["discount_type"].astype(int)
tests["weekday"].unique()
array([2, 3, 5, 6, 7, 1, 4], dtype=int64)
val_X[“weekday”].unique() #查看一下
array([6, 1, 4, 3, 2, 7, 5], dtype=int64)
#xgb矩阵赋值
xgb_val = xgb.DMatrix(val_X,label=val_y) #将处理好的训练集模型转化成xgb矩阵
xgb_train = xgb.DMatrix(X, label=y) #将处理好的label转化成xgb矩阵
xgb_test = xgb.DMatrix(tests) #将处理好的测试集转化成xgb矩阵
xgb_val_X = xgb.DMatrix(val_X) #将处理好的验证集转化成xgb矩阵
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, ‘base’, None) is not None and
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:588: FutureWarning: Series.base is deprecated and will be removed in a future version
data.base is not None and isinstance(data, np.ndarray) \
# 自定义性能评价函数(auc)
def myauc(test):
testgroup = test.groupby(["Coupon_id"])
aucs = []
for i in testgroup:
tmpdf = i[1]
if len(tmpdf['label'].unique()) != 2:
continue
fpr, tpr, thresholds = roc_curve(tmpdf['label'], tmpdf['pred'], pos_label=1)
aucs.append(auc(fpr, tpr))
return np.average(aucs)
XGboost算法框架
params = {'booster': 'gbtree',
#'objective': 'rank:pairwise',
'eval_metric': 'auc',
'gamma': 0.1,
'min_child_weight': 1.1,
'max_depth': 5,
'lambda': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'eta': 0.01,
'tree_method': 'exact',
'seed': 0,
'nthread': 12
}
watchlist = [(xgb_train, 'train')]
model = xgb.train(params, xgb_train, num_boost_round=1000, evals=watchlist,early_stopping_rounds=100)
model.save_model('C:/Users/Administrator/o2o.code/notebook/xgbmodel')
model = xgb.Booster(params)
model.load_model('C:/Users/Administrator/o2o.code/notebook/xgbmodel')
val_X.head()
valid.head()
model = xgb.Booster()
model.load_model('C:/Users/Administrator/o2o.code/notebook/xgbmodel') #下载模型
temp = valid[["Coupon_id", "label"]].copy() #复制"Coupon_id", "label"两列
temp['pred'] = model.predict(xgb_val) #做出预测
temp.pred = MinMaxScaler(copy=True, feature_range=(0, 1)).fit_transform(temp['pred'].values.reshape(-1, 1))
print(myauc(temp)) #调用自定义函数计算AUC值
temp.head()
0.5518357641394374
tests.head()
val_X.head()
#predeict,做出预测
y_test = dftest[['User_id','Coupon_id',"Date_received"]].copy()
y_test['label'] = model.predict(xgb_test) #对测试集进行预测
#对结果进行保存
y_test.to_csv("C:/Users/Administrator/o2o.code/notebook/second.csv", index=None, header=None)
y_test.head()