【一起读ACL】Facebook的开放式问答系统(Reading Wikipedia to Answer Open-Domain Questions)

Reading Wikipedia to Answer Open-Domain Questions


In order to answer any question, one must first retrieve the few relevant articles among more than 5 million items, and then scan them carefully to identify the answer.


  • Document Retriever, a module using bigram hashing and TF-IDF matching designed to, given a question, efficiently return a subset of relevant articles
  • Document Reader, a multi-layer recurrent neural network machine comprehension model trained to detect answer spans in those few returned documents.

1.Document Retriever

We further improve our system by taking local word order into account with n-gram features. Our best performing system uses bigram counts while preserving speed and memory efficiency by using the hashing of (Weinberger et al.,2009) to map the bigrams to 224 bins with an unsigned murmur3 hash.

这里提到的bigram hashing大概是feature hashing的一种。将特征用哈希值表示出来。



2.Document Reader


对于一个含有l个token的问题 -> q1,q2ql

对于文章,以段落为单位。假设段落包括了m个单词 -> { p1,pm }

we develop an RNN model that we apply to each paragraph in turn and then finally aggregate the predicted answers.

2.1 Paragraph encoding

文章的表示就是将单词的feature vector输入RNN(双向LSTM)得到。


一个词的feature vector包括了以下部分

  • Word embedding

  • Exact match

    这个其实就是判断词语之间是否等价。包括original, low-ercase or lemma form

  • Token features


  • Aligned question embedding

    这里也用到了注意力机制。计算出段落中单词和问题单词之间的attention score。 faligh(pi)=jai,jE(qj)


2.2 question encoding

在word embedding上面加一个RNN,然后把每一时刻的隐藏层结合起来。 q=jbjqj

bj 表示提问的每个单词的重要性。需要学习出来。

2.3 prediction


Concretely,we use a bilinear term to capture the similarity between pi and q and compute the probabilities of each token being start and end as:

【一起读ACL】Facebook的开放式问答系统(Reading Wikipedia to Answer Open-Domain Questions)_第1张图片
可以限制最大长度,最后得到 PsPe 的最大值就是答案。
