【Kesci】【正式赛】2019中国高校计算机大赛——大数据挑战赛(基于FastText的新闻点击率预测qauc=0.558)

比赛连接 https://www.kesci.com/home/competition/5cc51043f71088002c5b8840

正式赛题——文本点击率预估(5月26日开赛)
搜索中一个重要的任务是根据query和title预测query下doc点击率,本次大赛参赛队伍需要根据脱敏后的数据预测指定doc的点击率,结果按照指定的评价指标使用在线评测数据进行评测和排名,得分最优者获胜。

直接上代码了(部分代码参考了讨论区的分享)

# 数据集处理,转化成fasttext需要的格式
import csv
with open('/home/kesci/work/labeled_content', 'w') as f:
    with open('/home/kesci/input/bytedance/first-round/train.csv') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        for row in csv_reader:
            query = row[1]
            title = row[3]
            label = row[4]
            f.write("__label__{0} {1} {2}\n".format(label, query, title))
        print(f'Processed {line_count} lines.')


# 划分训练集和验证集
!head -n 90000 labeled_content > train.txt
!tail -n 10000 labeled_content > valid.txt

# 乱序训练集

# 训练并查看效果
from fastText import train_supervised
from fastText import load_model
classifier = train_supervised(input='/home/kesci/work/shuffled.csv',loss='hs', wordNgrams = 5, bucket = 5500000,
lr=0.5)
def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))
print_results(*classifier.test("/home/kesci/work/valid.txt"))
classifier.save_model("/home/kesci/work/model.bin")

# 使用模型进行预测并将结果持久化
import csv
from fastText import load_model
loaded_model = load_model("/home/kesci/work/modelhswn5b55.bin")
with open('/home/kesci/work/resulthswn5b55.csv', 'w') as f:
    with open('/home/kesci/input/bytedance/first-round/test.csv') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        for row in csv_reader:
            query_id = row[0]
            query_title_id = row[2]
            prediction = loaded_model.predict(row[1] + ' ' + row[3])
            pred = prediction[1][0]
            type = prediction[0][0]
            if(type=='__label__0'):
                pred = 1- pred
            f.write("{0},{1},{2}\n".format(query_id, query_title_id, pred))

fasttext确实好用,训练阶段两小时左右就有结果了。代码和参数都分享出来供大家参考。

你可能感兴趣的:(学习笔记)