query correction

https://web.stanford.edu/class/cs276/pa/pa2.pdf
语料:
lm corpus: 99,904 documents
query corpus: 819,722 编辑距离最多为1
Levenshtein automaton

比较清晰的ppt:
http://web.stanford.edu/class/cs276/handouts/spell_correction.pdf
目前的做法:

image.png

提升的方法:
image.png

额外的加分项:

  1. 考虑编辑距离在1以上的情况
  2. 除了斯坦福网站的语料,尝试其他语料
  3. 训练语言模型的时候,考虑其他的平滑方式,例如 Kneser-Ney smoothing
  4. K-gram index
  5. Levenshtein Automata:uses a finite state automata for fuzzy matching of words
    git:https://gist.github.com/Arachnid/491973
    blog:http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

https://github.com/aitounejjar/pa2-Spell-Corrector
https://github.com/pangolulu/spelling-corrector
Moore 文章语音+拼写model

  1. Toutanova K, Moore R C. Pronunciation modeling for improved spelling correction[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002: 144-151.

Moore 之前的文章:拼写model

  1. Brill E, Moore R C. An improved error model for noisy channel spelling correction[C]//Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2000: 286-293.

引用了Moore的文章

  1. Martins B, Silva M J. Spelling correction for search engine queries[M]//Advances in Natural Language Processing. Springer, Berlin, Heidelberg, 2004: 372-383.
  2. Sun X, Gao J, Micol D, et al. Learning phrase-based spelling error models from clickthrough data[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010: 266-274.
  3. Multi-level feature extraction for spelling correction

  1. Wilcox-O’Hearn A, Hirst G, Budanitsky A. Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model[C]//International conference on intelligent text processing and computational linguistics. Springer, Berlin, Heidelberg, 2008: 605-616.

Gao J, Li X, Micol D, et al. A large scale ranker-based system for search query spelling correction[C]//Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010: 358-366.

你可能感兴趣的:(query correction)