3.10 Summary 小结
• In this book we view a text as a list of words. A “raw text” is a potentially long string containing words and whitespace formatting, and is how we typically store and visualize a text.
• A string is specified in Python using single or double quotes: 'Monty Python', "Monty Python".
• The characters of a string are accessed using indexes, counting from zero: 'Monty Python'[0] gives the value M. The length of a string is found using len().
• Substrings are accessed using slice notation: 'Monty Python'[1:5] gives the value onty. If the start index is omitted, the substring begins at the start of the string; if the end index is omitted, the slice continues to the end of the string.
• Strings can be split into lists: 'Monty Python'.split() gives ['Monty', 'Python']. Lists can be joined into strings: '/'.join(['Monty', 'Python']) gives 'Monty/ Python'.
• We can read text from a file f using text = open(f).read(). We can read text from a URL u using text = urlopen(u).read(). We can iterate over the lines of a text file using for line in open(f).
• Texts found on the Web may contain unwanted material (such as headers, footers, and markup), that need to be removed before we do any linguistic processing.
• Tokenization is the segmentation of a text into basic units—or tokens—such as words and punctuation. Tokenization based on whitespace is inadequate(不恰当) for many applications because it bundles(捆) punctuation together with words. NLTK provides an off-the-shelf(现成的)tokenizer nltk.word_tokenize().
• Lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical(标准的) or citation(引用) form of the word, also known as the lexeme(词位) or lemma (e.g., appear).
• Regular expressions are a powerful and flexible method of specifying patterns. Once we have imported the re module, we can use re.findall() to find all substrings in a string that match a pattern.
• If a regular expression string includes a backslash, you should tell Python not to preprocess the string, by using a raw string with an r prefix: r'regexp'.
• When backslash is used before certain characters, e.g., \n, this takes on a special meaning (newline character); however, when backslash is used before regular expression wildcards and operators, e.g., \., \|, \$, these characters lose their special meaning and are matched literally.
• A string formatting expression template % arg_tuple consists of a format string template that contains conversion specifiers like %-6s and %0.2d.