本来准备搞泰坦尼克号那个的,发现网上资料太多了,就不重复了,写一个酒店推荐的吧:
https://www.kaggle.com/c/expedia-hotel-recommendations
原本准备参考以下链接:https://www.kaggle.com/zfturbo/leakage-solution
不过他居然没写notebook,只有代码。就简单看了一下,发现他使用的是数据的一个leakage。。没有参考价值了就。
故后来参考排名第二的链接:
https://www.kaggle.com/dvasyukova/predict-hotel-type-with-pandas
基本介绍:
题目目的需要预测用户将来会预定哪一个酒店。会根据已有数据(包括用户行为和行为相关的一些事件吧),给用户推荐五个酒店,只要用户最终在推荐的五个酒店中点击,就算预测正确。训练数据是2013到2014年的随机数据,测试集是15年的随机抽取数据。
大概数据如下:
train/test.csv
Column name | Description | Data type |
---|---|---|
date_time | 时间戳 | string |
site_name | 购买网站的id | int |
posa_continent | ID of continent associated with site_name | int |
user_location_country | 国家ID | int |
user_location_region | The ID of the region the customer is located | int |
user_location_city | The ID of the city the customer is located | int |
orig_destination_distance | Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated | double |
user_id | ID of user | int |
is_mobile | 1 when a user connected from a mobile device, 0 otherwise | tinyint |
is_package | 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise | int |
channel | ID of a marketing channel | int |
srch_ci | Checkin date | string |
srch_co | Checkout date | string |
srch_adults_cnt | The number of adults specified in the hotel room | int |
srch_children_cnt | The number of (extra occupancy) children specified in the hotel room | int |
srch_rm_cnt | The number of hotel rooms specified in the search | int |
srch_destination_id | ID of the destination where the hotel search was performed | int |
srch_destination_type_id | Type of destination | int |
hotel_continent | Hotel continent | int |
hotel_country | Hotel country | int |
hotel_market | Hotel market | int |
is_booking | 1 if a booking, 0 if a click | tinyint |
cnt | Numer of similar events in the context of the same user session | bigint |
hotel_cluster | ID of a hotel cluster | int |
destinations.csv
Column name | Description | Data type |
---|---|---|
srch_destination_id | ID of the destination where the hotel search was performed | int |
d1-d149 | latent description of search regions | double |
第一个leakage 发现从数据角度来看:
变量对:* user_location_city *和orig_destination_distance几乎100%概率定义正确的酒店。大多数情况下酒店都有相同的hotel_cluster。因此,如果您在测试集中看到与训练集相同的一对,您可以从通过测试机输出正确的hotel_cluster变量 这是数据泄漏。我们计算的第二件事是最受欢迎的hotel_clusters,其中包含相同的“srch_destination_id”(后来社区添加了hotel_country,hotel_marke和book_year以提高准确性)。第三个不太重要的是基于hotel_country和最佳酒店整体的最受欢迎的集群。
由于我们可以输出5个答案,我们从最重要的开始:1)(user_location_city和orig_destination_distance)
2)如果我们在训练集上没有这样的一对,通过srch_destination_id可以输出剩下的五个
3)对hotel_country和overall重复2)程序。
代码如下:
# coding: utf-8
__author__ = 'ZFTurbo: https://kaggle.com/zfturbo'
import datetime
from heapq import nlargest
from operator import itemgetter
from collections import defaultdict
def run_solution():
print('Preparing arrays...')
f = open("train.csv", "r")
f.readline()
best_hotels_od_ulc = defaultdict(lambda: defaultdict(int))
best_hotels_search_dest = defaultdict(lambda: defaultdict(int))
best_hotels_search_dest1 = defaultdict(lambda: defaultdict(int))
best_hotel_country = defaultdict(lambda: defaultdict(int))
popular_hotel_cluster = defaultdict(int)
total = 0
# Calc counts
while 1:
line = f.readline().strip()
total += 1
if total % 10000000 == 0:
print('Read {} lines...'.format(total))
if line == '':
break
arr = line.split(",")
book_year = int(arr[0][:4])
user_location_city = arr[5]
orig_destination_distance = arr[6]
srch_destination_id = arr[16]
is_booking = int(arr[18])
hotel_country = arr[21]
hotel_market = arr[22]
hotel_cluster = arr[23]
append_1 = 3 + 17*is_booking
append_2 = 1 + 5*is_booking
if user_location_city != '' and orig_destination_distance != '':
best_hotels_od_ulc[(user_location_city, orig_destination_distance)][hotel_cluster] += 1
if srch_destination_id != '' and hotel_country != '' and hotel_market != '' and book_year == 2014:
best_hotels_search_dest[(srch_destination_id, hotel_country, hotel_market)][hotel_cluster] += append_1
if srch_destination_id != '':
best_hotels_search_dest1[srch_destination_id][hotel_cluster] += append_1
if hotel_country != '':
best_hotel_country[hotel_country][hotel_cluster] += append_2
popular_hotel_cluster[hotel_cluster] += 1
f.close()
print('Generate submission...')
now = datetime.datetime.now()
path = 'submission_' + str(now.strftime("%Y-%m-%d-%H-%M")) + '.csv'
out = open(path, "w")
f = open("test.csv", "r")
f.readline()
total = 0
out.write("id,hotel_cluster\n")
topclasters = nlargest(5, sorted(popular_hotel_cluster.items()), key=itemgetter(1))
while 1:
line = f.readline().strip()
total += 1
if total % 1000000 == 0:
print('Write {} lines...'.format(total))
if line == '':
break
arr = line.split(",")
id = arr[0]
user_location_city = arr[6]
orig_destination_distance = arr[7]
srch_destination_id = arr[17]
hotel_country = arr[20]
hotel_market = arr[21]
out.write(str(id) + ',')
filled = []
s1 = (user_location_city, orig_destination_distance)
if s1 in best_hotels_od_ulc:
d = best_hotels_od_ulc[s1]
topitems = nlargest(5, sorted(d.items()), key=itemgetter(1))
for i in range(len(topitems)):
if topitems[i][0] in filled:
continue
if len(filled) == 5:
break
out.write(' ' + topitems[i][0])
filled.append(topitems[i][0])
s2 = (srch_destination_id, hotel_country, hotel_market)
if s2 in best_hotels_search_dest:
d = best_hotels_search_dest[s2]
topitems = nlargest(5, d.items(), key=itemgetter(1))
for i in range(len(topitems)):
if topitems[i][0] in filled:
continue
if len(filled) == 5:
break
out.write(' ' + topitems[i][0])
filled.append(topitems[i][0])
elif srch_destination_id in best_hotels_search_dest1:
d = best_hotels_search_dest1[srch_destination_id]
topitems = nlargest(5, d.items(), key=itemgetter(1))
for i in range(len(topitems)):
if topitems[i][0] in filled:
continue
if len(filled) == 5:
break
out.write(' ' + topitems[i][0])
filled.append(topitems[i][0])
if hotel_country in best_hotel_country:
d = best_hotel_country[hotel_country]
topitems = nlargest(5, d.items(), key=itemgetter(1))
for i in range(len(topitems)):
if topitems[i][0] in filled:
continue
if len(filled) == 5:
break
out.write(' ' + topitems[i][0])
filled.append(topitems[i][0])
for i in range(len(topclasters)):
if topclasters[i][0] in filled:
continue
if len(filled) == 5:
break
out.write(' ' + topclasters[i][0])
filled.append(topclasters[i][0])
out.write("\n")
out.close()
print('Completed!')
run_solution()
下面主要介绍第二个人的想法:先导入基本库
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
from subprocess import check_output
然后查看数据内容。因为文件很大,是每1000000为一块的单元阅读数据的。同时只取了觉得对最后结果有用的特征。
train = pd.read_csv('../input/train.csv',
dtype={'is_booking':bool,'srch_destination_id':np.int32, 'hotel_cluster':np.int32},
usecols=['srch_destination_id','is_booking','hotel_cluster'],
chunksize=1000000)
aggs = []
print('-'*38)
for chunk in train:
agg = chunk.groupby(['srch_destination_id',
'hotel_cluster'])['is_booking'].agg(['sum','count'])
agg.reset_index(inplace=True)
aggs.append(agg)
print('.',end='')
print('')
aggs = pd.concat(aggs, axis=0)
aggs.head()
srch_destination_id | hotel_cluster | sum | count | |
---|---|---|---|---|
0 | 1 | 20 | 0.0 | 2 |
1 | 1 | 30 | 0.0 | 1 |
2 | 1 | 60 | 0.0 | 2 |
3 | 4 | 22 | 1.0 | 2 |
4 | 4 | 25 | 1.0 | 2 |
接下来,我们再次进行汇总,计算所有区块的预订总数。通过从总行数中减去预订量来计算点击量。用加权的预订量和点击量计算酒店集群的“相关性”。
CLICK_WEIGHT = 0.05
agg = aggs.groupby(['srch_destination_id','hotel_cluster']).sum().reset_index()
agg['count'] -= agg['sum']
agg = agg.rename(columns={'sum':'bookings','count':'clicks'})
agg['relevance'] = agg['bookings'] + CLICK_WEIGHT * agg['clicks']
agg.head()
srch_destination_id | hotel_cluster | bookings | clicks | relevance | |
---|---|---|---|---|---|
0 | 0 | 3 | 0.0 | 2.0 | 0.10 |
1 | 1 | 20 | 4.0 | 22.0 | 5.10 |
2 | 1 | 30 | 2.0 | 20.0 | 3.00 |
3 | 1 | 57 | 0.0 | 1.0 | 0.05 |
4 | 1 | 60 | 0.0 | 17.0 | 0.85 |
根据目的地寻找最受欢迎的酒店集群,为目标组定义一个函数以获得最受欢迎的酒店。以前的版本使用nbiggest()系列方法来获得最大元素的索引。但这种方法相当缓慢。我用一个运行速度更快的版本更新了这个版本。
def most_popular(group, n_max=5):
relevance = group['relevance'].values
hotel_cluster = group['hotel_cluster'].values
most_popular = hotel_cluster[np.argsort(relevance)[::-1]][:n_max]
return np.array_str(most_popular)[1:-1] # remove square brackets
获得所有目的地最受欢迎的酒店集群:
most_pop = agg.groupby(['srch_destination_id']).apply(most_popular)
most_pop = pd.DataFrame(most_pop).rename(columns={0:'hotel_cluster'})
most_pop.head()
hotel_cluster | |
---|---|
srch_destination_id | |
0 | 3 |
1 | 20 30 60 57 |
2 | 20 30 53 46 41 |
3 | 53 60 |
4 | 82 25 32 58 78 |
预测测试集:
读取测试集并合并最受欢迎的酒店集群。
test = pd.read_csv('../input/test.csv',
dtype={'srch_destination_id':np.int32},
usecols=['srch_destination_id'],)
test = test.merge(most_pop, how='left',left_on='srch_destination_id',right_index=True)
test.head()
srch_destination_id | hotel_cluster | |
---|---|---|
0 | 12243 | 5 55 37 11 22 |
1 | 14474 | 5 |
2 | 11353 | 0 31 77 91 96 |
3 | 8250 | 1 45 79 24 54 |
4 | 11812 | 91 42 2 48 59 |
检查测试中的hotel_cluster列中的空值。
test.hotel_cluster.isnull().sum()
14036
看起来好像有14k个新目的地在测试中。让我们用总体上最受欢迎的酒店集群来填充nas。
most_pop_all = agg.groupby('hotel_cluster')['relevance'].sum().nlargest(5).index
most_pop_all = np.array_str(most_pop_all)[1:-1]
test.hotel_cluster.fillna(most_pop_all,inplace=True)
test.hotel_cluster.to_csv('predicted_with_pandas.csv',header=True, index_label='id')