MSLR数据集简介



微软发布的两个规模较大的learning to rank数据集
MSLR-WEB30k 30,000个查询query
从其中随机采样10,000个形成mslr-web10k

 

描述:
queries 和 urls 由ID来表示.
数据集包含了从q-u对中抽取的特征向量以及相关性评价标签
(1) 相关性评价来自于 Microsoft Bing,5分制, 从0 (不相关) 到 4 (最相关).

(2) 特征由作者抽取,基本上广泛用于研究社区
每行代表一个q-u对,第一栏是相关性分数,第2栏目是queryID,其他栏目是特征
The larger value the relevance label has, the more relevant the query-url pair is.
每个q-u 对由一个136维的特征向量表示

来自MSLR-WEB10K 的两个样本:

==============================================

0 qid:1 1:3 2:0 3:2 4:2 … 135:0 136:0

2 qid:1 1:3 2:3 3:0 4:0 … 135:0 136:0

==============================================

数据集分割:
分成5份一样大小的记为s1,....s5,用于交叉验证
建议3个用于训练,另外两个分别用于验证和测试
原文如下
We have partitioned each dataset into five parts with about the same number of queries, denoted as S1, S2, S3, S4, and S5, for five-fold cross validation. In each fold, we propose using three parts for training, one part for validation, and the remaining part for test (see the following table). The training set is used to learn ranking models. The validation set is used to tune the hyper parameters of the learning algorithms, such as the number of iterations in RankBoost and the combination coefficient in the objective function of Ranking SVM. The test set is used to evaluate the performance of the learned ranking models.

 Folds  Training Set Validation Set Test Set
 Fold1  {S1,S2,S3}  S4  S5
 Fold2  {S2,S3,S4}  S5  S1
 Fold3  {S3,S4,S5}  S1  S2
 Fold4  {S4,S5,S1}  S2  S3
 Fold5  {S5,S1,S2}  S3  S4

 
具体的下载地址,评估脚本,特证向量各个字段含义参考原文:
https://www.microsoft.com/en-us/research/project/mslr/

你可能感兴趣的:(自然语言处理,深度学习,机器学习,一般技巧和资源介绍)