3.11 Further Reading 深入阅读
Extra materials for this chapter are posted at http://www.nltk.org/ , including links to freely available resources on the Web. Remember to consult the Python reference materials at http://docs.python.org/ . (For example, this documentation covers “universal newline support,” explaining how to work with the different newline conventions used by various operating systems.)
For more examples of processing words with NLTK, see the tokenization, stemming, and corpus HOWTOs at http://www.nltk.org/howto . Chapters 2 and 3 of (Jurafsky &Martin, 2008) contain more advanced material on regular expressions and morphology.
For more extensive discussion of text processing with Python, see (Mertz, 2003). For information about normalizing non-standard words, see (Sproat et al., 2001).
There are many references for regular expressions, both practical and theoretical. For an introductory tutorial to using regular expressions in Python, see Kuchling’s Regular Expression HOWTO, http://www.amk.ca/python/howto/regex/. For a comprehensive and detailed manual in using regular expressions, covering their syntax in most major programming languages, including Python, see (Friedl, 2002). Other presentations include Section 2.1 of (Jurafsky & Martin, 2008), and Chapter 3 of (Mertz, 2003).
There are many online resources for Unicode. Useful discussions of Python’s facilities
for handling Unicode are:
• PEP-100 http://www.python.org/dev/peps/pep-0100/
• Jason Orendorff, Unicode for Programmers,
http://www.jorendorff.com/articles/uni code/
• A. M. Kuchling, Unicode HOWTO,
http://www.amk.ca/python/howto/unicode
•Frederik Lundh, Python Unicode Objects,
http://effbot.org/zone/unicode-objects.htm
• Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), http://www.joelonsoftware.com/articles/Unicode.html
The problem of tokenizing Chinese text is a major focus of SIGHAN, the ACL Special Interest Group on Chinese Language Processing (http://sighan.org/). Our method for segmenting English text follows (Brent & Cartwright, 1995); this work falls in the area of language acquisition (Niyogi, 2006).
Collocations are a special case of multiword expressions. A multiword expression is a small phrase whose meaning and other properties cannot be predicted from its words alone, e.g., part-of-speech (Baldwin & Kim, 2010).
Simulated annealing is a heuristic for finding a good approximation to the optimum value of a function in a large, discrete search space, based on an analogy with annealing in metallurgy. The technique is described in many Artificial Intelligence texts.
The approach to discovering hyponyms in text using search patterns like x and other ys is described by (Hearst, 1992).