Building decision trees to identify the intent

数据:自于日志,人工进行分类。
Feature Analysis
1.Number of terms in the query (nterms),
2.Number of clicks in query sessions(nclicks),a significant number of navigational queries concentrates only a few clicks per session --》导航类
3.Levenshtein distance:distance function calculated among the terms that   compose the query and the snippets (the snippet is compounded by the excerpt presented with the query result, the title and the URL of the selected  document)其实就是编辑距离,计算的是查询和返回片段之间
4.Number of sessions with less than n clicks over the total of sessions associated   to a query (nCS) 针对一个query所有session中点击 少于nclick 的session个数
5.Number of clicks before the n-th position of the query ranking (nRS)  q的Session中是点击了前面n个结果的Session的比例
6.pagerank  每个分类中文档的PageRank统计

结论:
1.navigational queries generally have fewer   terms than the informational queries,The behavior of this characteristic is not   as clear for the transactional class
2. some informational   queries register more than 9 different sites / pages selected in their sessions,This   usually does not occur in the case of navigational or transactional queries
3. Levenshtein distance calculated between query   terms and snippets is less in the case of navigational queries than for the other  categories
4.a good amount of informational queries   register clicks in pages / sites with low Page Rank, as opposed to transactional  or navigational queries

你可能感兴趣的:(Building decision trees to identify the intent)