典型的例子就是Paraphrase任务,即要判断两个句子是否语义等价,所以它的分类标签集合就是个{等价,不等价}的二值集合。除此外,还有很多其它类型的任务都属于句子对匹配,比如问答系统中相似问题匹配和Answer Selection。
我在前一篇文章中写了一个基于Doc2vec和Word2vec的无监督句子匹配方法,这里就顺便用传统的机器学习算法做一下。用机器学习算法处理的话,这里的映射函数就是用训练一个分类模型来拟合F,当分类模型训练好之后,对于未待分类的数据,就可以输入分类模型,用训练好的分类模型进行预测直接输出结果。
关于分类算法:
常见的分类模型有逻辑回归(LR)、朴素贝叶斯、SVM、GBDT和随机森林(RandomForest)等。本文选用的机器学习分类算法有:逻辑回归(LR)、SVM、GBDT和随机森林(RandomForest)。
由于Sklearn中集成了常见的机器学习算法,包括分类、回归、聚类等,所以本文使用的是Sklearn,版本是0.17.1。
关于特征选择:
由于最近一直在使用doc2vec和Word2vec,而且上篇文章中对比结果表示,用Doc2vec得到句子向量表示比Word2vec求均值得到句子向量表示要好,所以这里使用doc2vec得到句子的向量表示,向量维数为100维,直接将句子的100维doc2vec向量作为特征输入分类算法。
关于数据集:
数据集使用的是Quora发布的Question Pairs语义等价数据集,和上文是同一个数据集,可以点击这个链接下载点击打开链接,其中包含了40多万对标注好的问题对,如果两个问题语义等价,则label为1,否则为0。统计之后,共有53万多个问题。具体格式如下图所示:
统计出所有的问题之后训练得到每一个问题的doc2vec向量,作为分类算法的特征输入。
将语料库随机打乱之后,切分出10000对数据作为验证集,剩余的作为训练集。
下面是具体的训练代码:
数据加载和得到句子的doc2vec代码是同一份,放在前面:
# coding:utf-8
import numpy as np
import csv
import datetime
from sklearn.ensemble import RandomForestClassifier
import os
import pandas as pd
from sklearn import metrics, feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
cwd = os.getcwd()
def load_data(datapath):
data_train = pd.read_csv(datapath, sep='\t', encoding='utf-8')
print data_train.shape
qid1 = []
qid2 = []
question1 = []
question2 = []
labels = []
count = 0
for idx in range(data_train.id.shape[0]):
# for idx in range(400):
# count += 1
# if count == 21: break
print idx
q1 = data_train.qid1[idx]
q2 = data_train.qid2[idx]
qid1.append(q1)
qid2.append(q2)
question1.append(data_train.question1[idx])
question2.append(data_train.question2[idx])
labels.append(data_train.is_duplicate[idx])
return qid1, qid2, question1, question2, labels
def load_doc2vec(word2vecpath):
f = open(word2vecpath)
embeddings_index = {}
count = 0
for line in f:
# count += 1
# if count == 10000: break
values = line.split('\t')
id = values[0]
print id
coefs = np.asarray(values[1].split(), dtype='float32')
embeddings_index[int(id)+1] = coefs
f.close()
print('Total %s word vectors.' % len(embeddings_index))
return embeddings_index
def sentence_represention(qid, embeddings_index):
vectors = np.zeros((len(qid), 100))
for i in range(len(qid)):
print i
vectors[i] = embeddings_index.get(qid[i])
return vectors
将main函数中的数据集路径和doc2vec路径换成自己的之后就可以直接使用了。
1.逻辑回归(LR):
def main():
start = datetime.datetime.now()
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
qid1, qid2, labels = load_data(datapath)
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
vectors1 = sentence_represention(qid1, embeddings_index)
vectors2 = sentence_represention(qid2, embeddings_index)
vectors = np.hstack((vectors1, vectors2))
labels = np.array(labels)
VALIDATION_SPLIT = 10000
VALIDATION_SPLIT0 = 1000
indices = np.arange(vectors.shape[0])
np.random.shuffle(indices)
vectors = vectors[indices]
labels = labels[indices]
train_vectors = vectors[:-VALIDATION_SPLIT]
train_labels = labels[:-VALIDATION_SPLIT]
test_vectors = vectors[-VALIDATION_SPLIT:]
test_labels = labels[-VALIDATION_SPLIT:]
# train_vectors = vectors[:VALIDATION_SPLIT0]
# train_labels = labels[:VALIDATION_SPLIT0]
# test_vectors = vectors[-VALIDATION_SPLIT0:]
# test_labels = labels[-VALIDATION_SPLIT0:]
lr = LogisticRegression()
print '***********************training************************'
lr.fit(train_vectors, train_labels)
print '***********************predict*************************'
prediction = lr.predict(test_vectors)
accuracy = metrics.accuracy_score(test_labels, prediction)
print accuracy
end = datetime.datetime.now()
print end-start
if __name__ == '__main__':
main() # the whole one model
2.SVM:
def main():
start = datetime.datetime.now()
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
qid1, qid2, labels = load_data(datapath)
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
vectors1 = sentence_represention(qid1, embeddings_index)
vectors2 = sentence_represention(qid2, embeddings_index)
vectors = np.hstack((vectors1, vectors2))
labels = np.array(labels)
VALIDATION_SPLIT = 10000
VALIDATION_SPLIT0 = 1000
indices = np.arange(vectors.shape[0])
np.random.shuffle(indices)
vectors = vectors[indices]
labels = labels[indices]
train_vectors = vectors[:-VALIDATION_SPLIT]
train_labels = labels[:-VALIDATION_SPLIT]
test_vectors = vectors[-VALIDATION_SPLIT:]
test_labels = labels[-VALIDATION_SPLIT:]
# train_vectors = vectors[:VALIDATION_SPLIT0]
# train_labels = labels[:VALIDATION_SPLIT0]
# test_vectors = vectors[-VALIDATION_SPLIT0:]
# test_labels = labels[-VALIDATION_SPLIT0:]
svm = SVC()
print '***********************training************************'
svm.fit(train_vectors, train_labels)
print '***********************predict*************************'
prediction = svm.predict(test_vectors)
accuracy = metrics.accuracy_score(test_labels, prediction)
print accuracy
end = datetime.datetime.now()
print end-start
if __name__ == '__main__':
main() # the whole one model
3.GBDT:
def main():
start = datetime.datetime.now()
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
qid1, qid2, labels = load_data(datapath)
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
vectors1 = sentence_represention(qid1, embeddings_index)
vectors2 = sentence_represention(qid2, embeddings_index)
vectors = np.hstack((vectors1, vectors2))
labels = np.array(labels)
VALIDATION_SPLIT = 10000
VALIDATION_SPLIT0 = 1000
indices = np.arange(vectors.shape[0])
np.random.shuffle(indices)
vectors = vectors[indices]
labels = labels[indices]
train_vectors = vectors[:-VALIDATION_SPLIT]
train_labels = labels[:-VALIDATION_SPLIT]
test_vectors = vectors[-VALIDATION_SPLIT:]
test_labels = labels[-VALIDATION_SPLIT:]
# train_vectors = vectors[:VALIDATION_SPLIT0]
# train_labels = labels[:VALIDATION_SPLIT0]
# test_vectors = vectors[-VALIDATION_SPLIT0:]
# test_labels = labels[-VALIDATION_SPLIT0:]
gbdt = GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
max_depth=3, max_features=None, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
random_state=None, subsample=1.0, verbose=0,
warm_start=False)
print '***********************training************************'
gbdt.fit(train_vectors, train_labels)
print '***********************predict*************************'
prediction = gbdt.predict(test_vectors)
accuracy = metrics.accuracy_score(test_labels, prediction)
acc = gbdt.score(test_vectors, test_labels)
print accuracy
print acc
end = datetime.datetime.now()
print end-start
if __name__ == '__main__':
main() # the whole one model
4.随机森林(RandomForest):
def main():
start = datetime.datetime.now()
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
qid1, qid2, question1, question2, labels = load_data(datapath)
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
vectors1 = sentence_represention(qid1, embeddings_index)
vectors2 = sentence_represention(qid2, embeddings_index)
vectors = np.hstack((vectors1, vectors2))
labels = np.array(labels)
VALIDATION_SPLIT = 10000
VALIDATION_SPLIT0 = 1000
indices = np.arange(vectors.shape[0])
np.random.shuffle(indices)
vectors = vectors[indices]
labels = labels[indices]
train_vectors = vectors[:-VALIDATION_SPLIT]
train_labels = labels[:-VALIDATION_SPLIT]
test_vectors = vectors[-VALIDATION_SPLIT:]
test_labels = labels[-VALIDATION_SPLIT:]
# train_vectors = vectors[:VALIDATION_SPLIT0]
# train_labels = labels[:VALIDATION_SPLIT0]
# test_vectors = vectors[-VALIDATION_SPLIT0:]
# test_labels = labels[-VALIDATION_SPLIT0:]
randomforest = RandomForestClassifier()
print '***********************training************************'
randomforest.fit(train_vectors, train_labels)
print '***********************predict*************************'
prediction = randomforest.predict(test_vectors)
accuracy = metrics.accuracy_score(test_labels, prediction)
print accuracy
end = datetime.datetime.now()
print end-start
if __name__ == '__main__':
main() # the whole one model
最终的结果如下:
LR 68.56%
SVM 69.77%
GBDT 71.4%
RandomForest 78.36%(跑了多次,最好的一次)
从准确率上来看,随机森林的效果最好。时间上面,SVM耗时最长。
未来:
其实本文在特征选择和分类算法的参数调整上还有很多可以深入的地方,我相信,通过继续挖掘更多的有用特征,以及对模型的参数进行调整还可以得到更好的结果。
详细代码参见我的GitHub,地址为:点击打开链接