kaggle 入门系列翻译(二) Expedia

本来准备搞泰坦尼克号那个的,发现网上资料太多了,就不重复了,写一个酒店推荐的吧:

https://www.kaggle.com/c/expedia-hotel-recommendations

原本准备参考以下链接:https://www.kaggle.com/zfturbo/leakage-solution

不过他居然没写notebook,只有代码。就简单看了一下,发现他使用的是数据的一个leakage。。没有参考价值了就。

故后来参考排名第二的链接:

https://www.kaggle.com/dvasyukova/predict-hotel-type-with-pandas

基本介绍:

题目目的需要预测用户将来会预定哪一个酒店。会根据已有数据(包括用户行为和行为相关的一些事件吧),给用户推荐五个酒店,只要用户最终在推荐的五个酒店中点击,就算预测正确。训练数据是2013到2014年的随机数据,测试集是15年的随机抽取数据。

大概数据如下:

train/test.csv

Column name Description Data type
date_time 时间戳 string
site_name 购买网站的id int
posa_continent ID of continent associated with site_name int
user_location_country 国家ID int
user_location_region The ID of the region the customer is located int
user_location_city The ID of the city the customer is located int
orig_destination_distance Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated double
user_id ID of user int
is_mobile 1 when a user connected from a mobile device, 0 otherwise tinyint
is_package 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise int
channel ID of a marketing channel int
srch_ci Checkin date string
srch_co Checkout date string
srch_adults_cnt The number of adults specified in the hotel room int
srch_children_cnt The number of (extra occupancy) children specified in the hotel room int
srch_rm_cnt The number of hotel rooms specified in the search int
srch_destination_id ID of the destination where the hotel search was performed int
srch_destination_type_id Type of destination int
hotel_continent Hotel continent int
hotel_country Hotel country int
hotel_market Hotel market int
is_booking 1 if a booking, 0 if a click tinyint
cnt Numer of similar events in the context of the same user session bigint
hotel_cluster ID of a hotel cluster int

destinations.csv

Column name Description Data type
srch_destination_id ID of the destination where the hotel search was performed int
d1-d149 latent description of search regions double

 

 

第一个leakage 发现从数据角度来看:

变量对:* user_location_city *和orig_destination_distance几乎100%概率定义正确的酒店。大多数情况下酒店都有相同的hotel_cluster。因此,如果您在测试集中看到与训练集相同的一对,您可以从通过测试机输出正确的hotel_cluster变量 这是数据泄漏。我们计算的第二件事是最受欢迎的hotel_clusters,其中包含相同的“srch_destination_id”(后来社区添加了hotel_country,hotel_marke和book_year以提高准确性)。第三个不太重要的是基于hotel_country和最佳酒店整体的最受欢迎的集群。

由于我们可以输出5个答案,我们从最重要的开始:1)(user_location_city和orig_destination_distance)

2)如果我们在训练集上没有这样的一对,通过srch_destination_id可以输出剩下的五个

3)对hotel_country和overall重复2)程序。

代码如下:

# coding: utf-8
__author__ = 'ZFTurbo: https://kaggle.com/zfturbo'

import datetime
from heapq import nlargest
from operator import itemgetter
from collections import defaultdict


def run_solution():
    print('Preparing arrays...')
    f = open("train.csv", "r")
    f.readline()
    best_hotels_od_ulc = defaultdict(lambda: defaultdict(int))
    best_hotels_search_dest = defaultdict(lambda: defaultdict(int))
    best_hotels_search_dest1 = defaultdict(lambda: defaultdict(int))
    best_hotel_country = defaultdict(lambda: defaultdict(int))
    popular_hotel_cluster = defaultdict(int)
    total = 0

    # Calc counts
    while 1:
        line = f.readline().strip()
        total += 1

        if total % 10000000 == 0:
            print('Read {} lines...'.format(total))

        if line == '':
            break

        arr = line.split(",")
        book_year = int(arr[0][:4])
        user_location_city = arr[5]
        orig_destination_distance = arr[6]
        srch_destination_id = arr[16]
        is_booking = int(arr[18])
        hotel_country = arr[21]
        hotel_market = arr[22]
        hotel_cluster = arr[23]

        append_1 = 3 + 17*is_booking
        append_2 = 1 + 5*is_booking

        if user_location_city != '' and orig_destination_distance != '':
            best_hotels_od_ulc[(user_location_city, orig_destination_distance)][hotel_cluster] += 1

        if srch_destination_id != '' and hotel_country != '' and hotel_market != '' and book_year == 2014:
            best_hotels_search_dest[(srch_destination_id, hotel_country, hotel_market)][hotel_cluster] += append_1
        
        if srch_destination_id != '':
            best_hotels_search_dest1[srch_destination_id][hotel_cluster] += append_1
        
        if hotel_country != '':
            best_hotel_country[hotel_country][hotel_cluster] += append_2
        
        popular_hotel_cluster[hotel_cluster] += 1
    
    f.close()

    print('Generate submission...')
    now = datetime.datetime.now()
    path = 'submission_' + str(now.strftime("%Y-%m-%d-%H-%M")) + '.csv'
    out = open(path, "w")
    f = open("test.csv", "r")
    f.readline()
    total = 0
    out.write("id,hotel_cluster\n")
    topclasters = nlargest(5, sorted(popular_hotel_cluster.items()), key=itemgetter(1))

    while 1:
        line = f.readline().strip()
        total += 1

        if total % 1000000 == 0:
            print('Write {} lines...'.format(total))

        if line == '':
            break

        arr = line.split(",")
        id = arr[0]
        user_location_city = arr[6]
        orig_destination_distance = arr[7]
        srch_destination_id = arr[17]
        hotel_country = arr[20]
        hotel_market = arr[21]

        out.write(str(id) + ',')
        filled = []

        s1 = (user_location_city, orig_destination_distance)
        if s1 in best_hotels_od_ulc:
            d = best_hotels_od_ulc[s1]
            topitems = nlargest(5, sorted(d.items()), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])

        s2 = (srch_destination_id, hotel_country, hotel_market)
        if s2 in best_hotels_search_dest:
            d = best_hotels_search_dest[s2]
            topitems = nlargest(5, d.items(), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])
        elif srch_destination_id in best_hotels_search_dest1:
            d = best_hotels_search_dest1[srch_destination_id]
            topitems = nlargest(5, d.items(), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])

        if hotel_country in best_hotel_country:
            d = best_hotel_country[hotel_country]
            topitems = nlargest(5, d.items(), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])

        for i in range(len(topclasters)):
            if topclasters[i][0] in filled:
                continue
            if len(filled) == 5:
                break
            out.write(' ' + topclasters[i][0])
            filled.append(topclasters[i][0])

        out.write("\n")
    out.close()
    print('Completed!')

run_solution()

 

下面主要介绍第二个人的想法:先导入基本库

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output

然后查看数据内容。因为文件很大,是每1000000为一块的单元阅读数据的。同时只取了觉得对最后结果有用的特征。

train = pd.read_csv('../input/train.csv',
                    dtype={'is_booking':bool,'srch_destination_id':np.int32, 'hotel_cluster':np.int32},
                    usecols=['srch_destination_id','is_booking','hotel_cluster'],
                    chunksize=1000000)
aggs = []
print('-'*38)
for chunk in train:
    agg = chunk.groupby(['srch_destination_id',
                         'hotel_cluster'])['is_booking'].agg(['sum','count'])
    agg.reset_index(inplace=True)
    aggs.append(agg)
    print('.',end='')
print('')
aggs = pd.concat(aggs, axis=0)
aggs.head()
srch_destination_id hotel_cluster sum count
0 1 20 0.0 2
1 1 30 0.0 1
2 1 60 0.0 2
3 4 22 1.0 2
4 4 25 1.0 2

接下来,我们再次进行汇总,计算所有区块的预订总数。通过从总行数中减去预订量来计算点击量。用加权的预订量和点击量计算酒店集群的“相关性”。

CLICK_WEIGHT = 0.05
agg = aggs.groupby(['srch_destination_id','hotel_cluster']).sum().reset_index()
agg['count'] -= agg['sum']
agg = agg.rename(columns={'sum':'bookings','count':'clicks'})
agg['relevance'] = agg['bookings'] + CLICK_WEIGHT * agg['clicks']
agg.head()
srch_destination_id hotel_cluster bookings clicks relevance
0 0 3 0.0 2.0 0.10
1 1 20 4.0 22.0 5.10
2 1 30 2.0 20.0 3.00
3 1 57 0.0 1.0 0.05
4 1 60 0.0 17.0 0.85

根据目的地寻找最受欢迎的酒店集群,为目标组定义一个函数以获得最受欢迎的酒店。以前的版本使用nbiggest()系列方法来获得最大元素的索引。但这种方法相当缓慢。我用一个运行速度更快的版本更新了这个版本。

def most_popular(group, n_max=5):
    relevance = group['relevance'].values
    hotel_cluster = group['hotel_cluster'].values
    most_popular = hotel_cluster[np.argsort(relevance)[::-1]][:n_max]
    return np.array_str(most_popular)[1:-1] # remove square brackets

获得所有目的地最受欢迎的酒店集群:

most_pop = agg.groupby(['srch_destination_id']).apply(most_popular)
most_pop = pd.DataFrame(most_pop).rename(columns={0:'hotel_cluster'})
most_pop.head()
hotel_cluster
srch_destination_id  
0 3
1 20 30 60 57
2 20 30 53 46 41
3 53 60
4 82 25 32 58 78

预测测试集:

读取测试集并合并最受欢迎的酒店集群。

test = pd.read_csv('../input/test.csv',
                    dtype={'srch_destination_id':np.int32},
                    usecols=['srch_destination_id'],)

test = test.merge(most_pop, how='left',left_on='srch_destination_id',right_index=True)
test.head()
srch_destination_id hotel_cluster
0 12243 5 55 37 11 22
1 14474 5
2 11353 0 31 77 91 96
3 8250 1 45 79 24 54
4 11812 91 42 2 48 59

检查测试中的hotel_cluster列中的空值。

test.hotel_cluster.isnull().sum()

14036

看起来好像有14k个新目的地在测试中。让我们用总体上最受欢迎的酒店集群来填充nas。

most_pop_all = agg.groupby('hotel_cluster')['relevance'].sum().nlargest(5).index
most_pop_all = np.array_str(most_pop_all)[1:-1]
test.hotel_cluster.fillna(most_pop_all,inplace=True)
test.hotel_cluster.to_csv('predicted_with_pandas.csv',header=True, index_label='id')

 

你可能感兴趣的:(机器学习,Kaggle,入门翻译系列)