3.7 Regular Expressions for Tokenizing Text
用正则表达式文本分词
Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and because NLTK includes some tokenizers. Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, and to have much more control over the process.
Simple Approaches to Tokenization 分词的简单方法
The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice’s Adventures in Wonderland:
We could split this raw text on whitespace using raw.split(). To do the same using a regular expression, it is not enough to match any space characters in the string①, since this results in tokens that contain a \n newline character; instead, we need to match any number of spaces, tabs, or newlines②:
The regular expression «[ \t\n]+» matches one or more spaces, tabs (\t), or newlines (\n). Other whitespace characters, such as carriage return and form feed(换页), should really be included too. Instead, we will use a built-in re abbreviation, \s, which means any whitespace character(匹配任何空白符). The second statement in the preceding example can be rewritten as re.split(r'\s+', raw).
Important: Remember to prefix regular expressions with the letter r (meaning “raw”), which instructs the Python interpreter to treat the string literally, rather than processing any backslashed characters it contains.
Splitting on whitespace gives us tokens like '(not' and 'herself,'. An alternative is to use the fact that Python provides us with a character class \w for word characters, equivalent to [a-zA-Z0-9_]. It also defines the complement(补集) of this class, \W, i.e., all characters other than letters, digits, or underscore. We can use \W in a simple regular expression to split the input on anything other than a word character:
Observe that this gives us empty strings at the start and the end (to understand why, try doing 'xx'.split('x')). With re.findall(r'\w+', raw), we get the same tokens, but without the empty strings, using a pattern that matches the words instead of the spaces. Now that we’re matching the words, we’re in a position to extend the regular expression to cover a wider range of cases. The regular expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g., ’s) but that sequences of two or more punctuation characters are separated.
Let’s generalize the \w+ in the preceding expression to permit word-internal hyphens(连字号) and apostrophes(省字号,表示词的一个或多个字母的省略,例 it is略为it's, of略为o', over略为o'er,cannot略为can't,1999略为'99): «\w+([-']\w+)*». This expression means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it’s. (We need to include ?: in this expression for reasons discussed earlier.) We’ll also add a pattern to match quote characters so these are kept separate from the text they enclose.
The expression in this example also included «[-.(]+», which causes the double hyphen, ellipsis, and open parenthesis to be tokenized separately.
Table 3-4 lists the regular expression character class symbols we have seen in this section, in addition to some other useful symbols.
Table 3-4. Regular expression symbols
Symbol Function
\b Word boundary (zero width)
\d Any decimal digit (equivalent to [0-9])
\D Any non-digit character (equivalent to [^0-9])
\s Any whitespace character (equivalent to [ \t\n\r\f\v]
\S Any non-whitespace character (equivalent to [^ \t\n\r\f\v])
\w Any alphanumeric character (equivalent to [a-zA-Z0-9_])
\W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])
\t The tab character
\n The newline character
NLTK’s Regular Expression Tokenizer
NLTK的正则表达式分词器
The function nltk.regexp_tokenize() is similar to re.findall() (as we’ve been using it for tokenization). However, nltk.regexp_tokenize() is more efficient for this task, and avoids the need for special treatment of parentheses. For readability we break up the regular expression over several lines and add a comment about each line. The special (?x) “verbose flag” tells Python to strip out the embedded whitespace and comments.
When using the verbose flag, you can no longer use ' ' to match a space character; use \s instead. The regexp_tokenize() function has an optional gaps parameter. When set to True, the regular expression specifies the gaps between tokens, as with re.split().
We can evaluate a tokenizer by comparing the resulting tokens with a wordlist, and then report any tokens that don’t appear in the wordlist, using set(tokens).difference(wordlist). You’ll probably want to lowercase all the tokens first. 评价分词器的好办法!
Further Issues with Tokenization 分词的进一步问题
Tokenization turns out to be a far more difficult task than you might have expected. No single solution works well across the board, and we must decide what counts as a token depending on the application domain.
When developing a tokenizer it helps to have access to raw text which has been manually tokenized, in order to compare the output of your tokenizer with high-quality (or“gold-standard”) tokens. The NLTK corpus collection includes a sample of Penn Tree-bank data, including the raw Wall Street Journal text (nltk.corpus.treebank_raw.raw()) and the tokenized version (nltk.corpus.treebank.words()).
A final issue for tokenization is the presence of contractions(缩写), such as didn’t. If we are analyzing the meaning of a sentence, it would probably be more useful to normalize this form to two separate forms: did and n’t (or not). We can do this work with the help of a lookup table.