text normalization: converting text to a more convenient, standard form.
edit distance: measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other.
regular expression (RE): an algebraic notation for characterizing a set of strings, a language for specifying text search strings.
python code
#find the first match
import re
key="hello world
"
p1="(?<=).+?(?=
)"
pattern1 =re.compile(p1)
matcher1=re.search(pattern1,key)
print(matcher1.group(0))
#find all matches
import re
key="Column 1 Column 2 Column 3 Columna"
p1="\\bColumn\\b"
pattern1 =re.compile(p1)
print(pattern1.findall(key))
Regular expressions are case sensitive.
[]
: means or
/[wW]/
: w or W
/[A-Z]/
: an upper case letter
/[a-z]/
: a lower case letter
When a caret ^
is the first symbol within a []
, it means negation:
/[^A-Z]/
: not an upper case letter
/[^Ss]/
: neither ‘S’ nor ‘s’
/[^\.]/
: not a period
/[e^]/
: either ‘e’ or ‘^’
/a^b/
: the pattern ‘a^b’
?
means “the preceding character or nothing”:
/woodchucks?/
: woodchuck or woodchucks
/colou?r/
: color or colour
**Kleene *** (pronounced “cleany star”): zero or more occurences of the immediately previous character or regular expression
/a*/
: any string of zero or more as
/aa*/
/[ab]*/
Kleene +: one or more occurrences of the immediately preceding character or regular expression
/./
: a wildcard expression that matches any single character (except a carriage return)
/beg.n/
: begin, begun …
/aardvark.*aardvark/
: to find any line in which a particular word, for example, aardvark, appears twice.
Anchors are special characters that anchor regular expressions to particular places in a string. The most common anchors are the caret ^
and the dollar sign $
. The caret ^
matches the start of a line. The pattern /^The/
matches the word The only at the start of a line.
Thus, the caret ^ has three uses:
to match the start of a line,
to indicate a negation inside of square brackets,
and just to mean a caret.
The dollar sign $
matches the end of a line. So the pattern $
is a useful pattern for matching a space at the end of a line, and /^The dog\.$/
matches a line that contains only the phrase The dog.
\b
matches a word boundary. /\bthe\b/
matches the word the but not the word other.
\B
matches a non-boundary
\b
and \B
seems not work in python. Use \\b
and \\B
/|/
: /cat|dog/
to match either the string cat or the string dog.
/gupp(y|ies)/
to match the string guppy or the string guppies
/Column [0-9]+ *)*/
to match the string Column 1 Column 2 Column 3
operator precedence hierarchy: from highest to lowerest
Parenthesis | () |
---|---|
Counters | *** + ? {}** |
Sequences and anchors | the ^my end$ |
Disjunction | | |
Thus, because counters have a higher precedence than sequences, /the*/
matches theeeee but not thethe. Because sequences have a higher precedence than disjunction, /the|any/
matches the or any but not theny.
we say that patterns are greedy, expanding to cover as much of a string as they can.
There are, however, ways to enforce non-greedy matching, using another meaning of the ? qualifier. The operator *? is a Kleene star that matches as little text as possible. The operator +? is a Kleene plus that matches as little text as possible.
The process we just went through was based on fixing two kinds of errors: false positives, strings that we incorrectly matched like other or there, and false negatives, strings that we incorrectly missed, like The. Addressing these two kinds of errors comes up again and again in implementing speech and language processing systems. Reducing the overall error rate for an application thus involves two antagonistic efforts:
RE | Expansion | Match | First Matches |
---|---|---|---|
\d | [0-9] | any digit | Party of 5 ‾ \underline{5} 5 |
\D | [^0-9] | any non-digit | Blue moon |
\w | [a-zA-Z0-9 ] | any alphanumeric/underscore | Daiyu |
\W | [^\w] | a non-alphanumeric | !!!! |
\s | [ \r\t\n\f] | whitespace (space, tab) | |
\S | [^\s] | Non-whitespace | in Concord |
RE | Match |
---|---|
* | zero or more occurrences of the previous char or expression |
+ | one or more occurrences of the previous char or expression |
? | exactly zero or one occurrence of the previous char or expression |
{n} | n occurrences of the previous char or expression |
{n,m} | from n to m occurrences of the previous char or expression |
{n,} | at least n occurrences of the previous char or expression |
{,m} | up to m occurrences of the previous char or expression |
substitution
s/regexp1/pattern
: replace a string characterized by a regular expression regexp1
with pattern
.
number operator
s/([0-9]+)/<\1>
: add angle brackets to integers. For example, change the 35 boxes to the <35> boxes.
/the (.*)er they were, the \1er they will be/
will match the bigger they were, the bigger they will be but not the bigger they were, the faster they will be.
\1
will be replaced by whatever string matched the first item in parentheses.
This use of parentheses to store a pattern in memory is called a capture group. Every time a capture group is used (i.e., parentheses surround a pattern), the resulting match is stored in a numbered register. If you match two different sets of parentheses, \2
means whatever matched the second capture group. Thus
/the (.*)er they (.*), the \1er they \2/
will match the faster they ran, the faster we ran but not the faster they ran, the faster we ate.
Parentheses thus have a double function in regular expressions; they are used to group terms for specifying the order in which operators should apply, and they are used to capture something in a register. Occasionally we might want to use parentheses for grouping, but don’t want to capture the resulting pattern in a register. In that case we use a non-capturing group, which is specified by putting the commands ?:
after the open paren, in the form (?: pattern )
.
/(?:some|a few) (people|cats) like some \1/
will match some cats like some cats but not some cats like some a few.
The operator (?= pattern)
is true if pattern zero-width occurs, but is zero-width, i.e. the match pointer doesn’t advance. The operator (?! pattern)
only returns true if a pattern does not match, but again is zero-width and doesn’t advance the cursor.
/(?<= pattern)/
matches a string that begins with pattern
/(? = pattern)/
matches a string that ends with pattern
corpus (plural corpora): a computer-readable collection of text or speech.
Punctuation is critical for finding boundaries of things (commas, periods, colons) and for identifying some aspects of meaning (question marks, exclamation marks, quotation marks). For some tasks, like part-of-speech tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if they were separate words.
An utterance is the spoken correlate of a sentence.
This utterance has two kinds of disfluencies. The broken-off word main- is fragment called a fragment. Words like uh and um are called fillers or filled pauses.
We also sometimes keep disfluencies around. Disfluencies like uh or um are actually helpful in speech recognition in predicting the upcoming word, because they may signal that the speaker is restarting the clause or idea, and so for speech recognition they are treated as regular words. Because people use different disfluencies they can also be a cue to speaker identification.
A lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense. The wordform is the full inflected or derived form of the word.
Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V , the number of types is the vocabulary size ∣ V ∣ |V| ∣V∣. Tokens are the total number N N N of running words.
The relationship between the number of types ∣ V ∣ |V| ∣V∣ and number of tokens N N N is called Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978) Heaps’ Law after its discoverers (in linguistics and information retrieval respectively). It is shown in Eq. 2.1, where k k k and b b b are positive constants, and 0 < β < 1 0 < \beta < 1 0<β<1.
∣ V ∣ = k N β |V| = kN^\beta ∣V∣=kNβ
The value of β \beta β depends on the corpus size and the genre, but at least for the large corpora, β \beta β ranges from .67 to .75. Roughly then we can say that the vocabulary size for a text goes up significantly faster than the square root of its length in words.
Another measure of the number of words in the language is the number of lemmas instead of wordform types. Dictionaries can help in giving lemma counts; dictionary entries or boldface forms are a very rough upper bound on the number of lemmas (since some lemmas have multiple boldface forms). The 1989 edition of the Oxford English Dictionary had 615,000 entries.
African American Vernacular English (AAVE)
Standard American English (SAE)
It’s also quite common for speakers or writers to use multiple languages in a single communicative act, a phenomenon called code switching.
At least three tasks are commonly applied as part of any normalization process:
Try https://bellard.org/jslinux/ for a free online Unix environment.
let’s begin with the ‘complete words’ of Shakespeare in one textfile, sh.txt.
We can use tr
to tokenize the words by changing every sequence of nonalphabetic characters to a newline (’A-Za-z’ means alphabetic, the -c option complements to non-alphabet, and the -s option squeezes all sequences into a single character. The whole command then means to squeeze every sequence of non-alphabet characters into one and translate it into a ‘\n’ character):
tr -sc ’A-Za-z’ ’\n’ < sh.txt
Now that there is one word per line, we can sort the lines, and pass them to uniq -c which will collapse and count them:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | sort | uniq -c
Alternatively, we can collapse all the upper case to lower case:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c
Now we can sort again to find the frequent words. The -n option to sort means to sort numerically rather than alphabetically, and the -r option means to sort in reverse order (highest-to-lowest):
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r
tokenization: the task of segmenting running text into words.
normalization: the task of putting words/tokens in a standard format.
One commonly used tokenization standard is known as the Penn Treebank tokenization standard, used for the parsed corpora (treebanks) released by the Linguistic Data Consortium (LDC), the source of many useful datasets. This standard separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words together, and separates out all punctuation.
Input: “The San Francisco-based restaurant,” they said, “doesn’t charge $10”.
Output: $\fbox{“ } \fbox{The} \fbox{ San}\fbox{ Francisco-based}\fbox{ restaurant}\fbox{ ,}\fbox{ ”}\fbox{ they}\fbox{ said}\fbox{ ,}\fbox{ “}\fbox{ does}\fbox{ n’t }\fbox{charge}\fbox{ $}\fbox{ 10 }\fbox{”}\fbox{.}$
Tokens can also be normalized, in which a single normalized form is chosen for words with multiple forms like USA
and US
or uh-huh
and uhhuh
. This standardization may be valuable, despite the spelling information that is lost in the normalization process. For information retrieval, we might want a query for US
to match a document that has USA
; for information extraction we might want to extract coherent information that is consistent across differently-spelled instances.
Case folding is another kind of normalization. For tasks like speech recognition and information retrieval, everything is mapped to lower case. For sentiment analysis and other text classification tasks, information extraction, and machine translation, by contrast, case is quite helpful and case folding is generally not done (losing the difference, for example, between US the country and us the pronoun can outweigh the advantage in generality that case folding provides).
The standard method for tokenization/normalization is therefore to use deterministic algorithms based on regular expressions compiled into very efficient finite state automata.
dict=["计算语言学","计算","课程","意思"]
sentence=u"计算语言学课程有意思"
def maxmatch(sentence, dictionary):
list_of_words=[]
if len(sentence)==0:
return []
for i in range(len(sentence),1,-1):
firstword = sentence[:i]
remainder = sentence[i:]
if firstword in dictionary:
list_of_words.append(firstword)
return list_of_words+maxmatch(remainder,dictionary)
# no word was found, so make a one-character word
firstword = sentence[0]
remainder = sentence[1:]
list_of_words.append(firstword)
return list_of_words+maxmatch(remainder,dictionary)
list_of_words=maxmatch(sentence,dict)
print(list_of_words)
We can quantify how well a segmenter works using a word error rate metric called word error rate. We compare our output segmentation with a perfect hand-segmented (‘gold’) sentence, seeing how many words differ. The word error rate is then the normalized minimum edit distance in words between our output and the gold: the number of word insertions, deletions, and substitutions divided by the length of the gold sentence in words; we’ll see in Section 2.5 how to compute edit distance. Even in Chinese, however, MaxMatch has problems, for example dealing with unknown words (words not in the dictionary) or genres that differ a lot from the assumptions made by the dictionary builder. The most accurate Chinese segmentation algorithms generally use statistical sequence models trained via supervised machine learning on hand-segmented training sets; we’ll introduce sequence models in Chapter 8.
Lemmatization is the task of determining that two words have the same root, despite their surface differences. The words am, are, and is have the shared lemma be; the words dinner and dinners both have the lemma dinner.
The the lemmatized form of a sentence like He is reading detective stories would thus be He be read detective story.
How is lemmatization done? The most sophisticated methods for lemmatization involve complete morphological parsing of the word. Morphology is the study of the way words are built up from smaller meaning-bearing units called morphemes. Two broad classes of morphemes can be distinguished: stems—the central morpheme of the word, supplying the main meaning— and affixes—adding “additional” meanings of various kinds. A morphological parser takes a word like cats and parses it into the two morphemes cat and s.
The Porter Stemmer
Lemmatization algorithms can be complex. For this reason we sometimes make use of a simpler but cruder method, which mainly consists of chopping off word-final stemming affixes. This naive version of morphological analysis is called stemming. One of Porter stemmer the most widely used stemming algorithms is the Porter (1980). The algorithm is based on series of rewrite rules run in series, as a cascade, in which the output of each pass is fed as input to the next pass.
Stemming or lemmatizing has another side-benefit. By treating two similar words identically, these normalization methods help deal with the problem of unknown words, words that a system has not seen before.
If our training corpus contains, say the words low, and lowest, but not lower, but then the word lower appears in our test corpus, our system will not know what to do with it. Stemming or lemmatizing everything to low can solve the problem, but has the disadvantage that sometimes we don’t want words to be completely collapsed. For some purposes (for example part-of-speech tagging) the words low and lower need to remain distinct.
A solution to this problem is to use a different kind of tokenization in which most tokens are words, but some tokens are frequent word parts like -er, so that an unseen word can be represented by combining the parts.
The simplest such algorithm is byte-pair encoding, or BPE (Sennrich et al., BPE 2016). Byte-pair encoding is based on a method for text compression (Gage, 1994), but here we use it for tokenization instead. The intuition of the algorithm is to iteratively merge frequent pairs of characters.
The algorithm begins with the set of symbols equal to the set of characters. Each word is represented as a sequence of characters plus a special end-of-word symbol ·. At each step of the algorithm, we count the number of symbol pairs, find the most frequent pair (‘A’, ‘B’), and replace it with the new merged symbol (‘AB’). We continue to count and merge, creating new longer and longer character strings, until we’ve done k merges; k is a parameter of the algorithm. The resulting symbol set will consist of the original set of characters plus k new symbols.
The algorithm is run inside words (we don’t merge across word boundaries). For this reason, the algorithm can take as input a dictionary of words together with counts. For example, consider the following tiny input dictionary:
Word | frequency |
---|---|
$\verb | l o w |
$\verb | l o w e s t |
$\verb | n e w e r |
$\verb | w i d e r |
$\verb | n e w |
We first count all pairs of symbols: the most frequent is the pair r ⋅ \verb|r |\cdot r ⋅ because it occurs in newer (frequency of 6) and wider (frequency of 3) for a total of 9 occurrences. We then merge these symbols, treating r ⋅ \verb|r|\cdot r⋅ as one symbol, and count again:
Word | frequency |
---|---|
$\verb | l o w |
$\verb | l o w e s t |
$\verb | n e w e r |
$\verb | w i d e r |
$\verb | n e w |
Now the most frequent pair is e r·, which we merge:
Word | frequency |
---|---|
$\verb | l o w |
$\verb | l o w e s t |
$\verb | n e w er |
$\verb | w i d er |
$\verb | n e w |
Our system has learned that there should be a token for word-final er \verb|er| er, represented as er ⋅ \verb|er|\cdot er⋅. If we continue, the next merges are
| ('e','w') \verb|('e','w')| (’e’,’w’) |
| ---------------------------------- |
| (’n’, ’ew’) \verb|(’n’, ’ew’)| (’n’, ’ew’) |
| (’l’, ’o’) \verb|(’l’, ’o’)| (’l’, ’o’) |
| (’lo’, ’w’) \verb|(’lo’, ’w’)| (’lo’, ’w’) |
| (’new’, ’er ⋅ ’) \verb|(’new’, ’er|\cdot\verb|’)| (’new’, ’er⋅’) |
| (’low’, ’ ⋅ ’) \verb|(’low’, ’|\cdot\verb|’)| (’low’, ’⋅’) |
The current set of symbols is thus { ⋅ , d, e, i, l, n, o, r, s, t, w, r ⋅ , er ⋅ , ew, new, lo, low, newer ⋅ , low ⋅ } \{\cdot\verb|, d, e, i, l, n, o, r, s, t, w, r|\cdot\verb|, er|\cdot\verb|, ew, new, lo, low, newer|\cdot\verb|, low|\cdot\} {⋅, d, e, i, l, n, o, r, s, t, w, r⋅, er⋅, ew, new, lo, low, newer⋅, low⋅}
When we need to tokenize a test sentence, we just run the merges we have learned, greedily, in the order we learned them, on the test data. (Thus the frequencies in the test data don’t play a role, just the frequencies in the training data). So first we segment each test sentence word into characters. Then we apply the first rule: replace every instance of r · in the test corpus with r·, and then the second rule: replace every instance of e r· in the test corpus with er·, and so on. By the end, if the test corpus contained the word n e w e r ·, it would be tokenized as a full word. But a new (unknown) word like l o w e r · would be merged into the two tokens low er·.
Of course in real algorithms BPE is run with many thousands of merges on a very large input dictionary. The result is that most words will be represented as full symbols, and only the very rare words (and unknown words) will have to be represented by their parts.
import re,collections
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols=word.split()#str to list
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]]+=freq
return pairs
def merge_vocab(pair,v_in):
v_out={}
bigram=re.escape(" ".join(pair))
p=re.compile(r'(?+bigram+r'(?!\S)')
for word in v_in:
w_out = p.sub(''.join(pair),word)
v_out[w_out]=v_in[word]
return v_out
vocab={'l o w ':5, 'l o w e s t ':2,
'n e w e r ':6, 'w i d e r ':3,
'n e w ':2}
num_merges=8
for i in range(num_merges):
pairs=get_stats(vocab)
best = max(pairs,key=pairs.get)
vocab =merge_vocab(best,vocab)
print(best)
Sentence segmentation is another important step in text processing. The most useful cues for segmenting a text into sentences are punctuation, like periods, question marks, exclamation points.
State-of-the-art methods for sentence tokenization are based on machine learning and are introduced in later chapters.
The word grraff \verb|grraff| grraff differs by only one letter from graff \verb|graff| graff, whereas grail \verb|grail| grail and graffe \verb|graffe| graffe differ in more letters.
Coreference is the task of deciding whether two strings such as the following refer to the same entity:
Stanford President John Hennessy \verb|Stanford President John Hennessy| Stanford President John Hennessy
Stanford University President John Hennessy \verb|Stanford University President John Hennessy| Stanford University President John Hennessy
Edit distance gives us a way to quantify both of these intuitions about string similarity. More formally, the minimum edit distance between two strings is defined as the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another.
Given two sequences, an alignment is a correspondence between substrings of the two sequences. Thus, we say I aligns with the empty string, N with E, and so on. Beneath the aligned strings is another representation; a series of symbols expressing an operation list for converting the top string into the bottom string: d for deletion, s for substitution, i for insertion.
I N T E * N T I O N \verb|I N T E * N T I O N| I N T E * N T I O N
| | | | | | | | | |
* E X E C U T I O N \verb|* E X E C U T I O N| * E X E C U T I O N
d s s i s \verb|d s s i s| d s s i s
We can also assign a particular cost or weight to each of these operations. The Levenshtein distance between two sequences is the simplest weighting factor in which each of the three operations has a cost of 1 (Levenshtein, 1966)—we assume that the substitution of a letter for itself, for example, t for t, has zero cost. The Levenshtein distance between intention and execution is 5. Levenshtein also proposed an alternative version of his metric in which each insertion or deletion has a cost of 1 and substitutions are not allowed. (This is equivalent to allowing substitution, but giving each substitution a cost of 2 since any substitution can be represented by one insertion and one deletion). Using this version, the Levenshtein distance between intention and execution is 8.
How do we find the minimum edit distance? We can think of this as a search task, in which we are searching for the shortest path—a sequence of edits—from one string to another.
The space of all possible edits is enormous, so we can’t search naively. However, lots of distinct edit paths will end up in the same state (string), so rather than recomputing all those paths, we could just remember the shortest path to a state each time we saw it. We can do this by using dynamic programming. Dynamic programming is the name for a class of algorithms, first introduced by Bellman (1957), that apply a table-driven method to solve problems by combining solutions to sub-problems. Some of the most commonly used algorithms in natural language processing make use of dynamic programming, such as the Viterbi algorithm (Chapter 8) and the CKY algorithm for parsing (Chapter 11).
The intuition of a dynamic programming problem is that a large problem can be solved by properly combining the solutions to various sub-problems.
Let’s first define the minimum edit distance between two strings. Given two strings, the source string X of length n, and target string Y of length m, we’ll define D ( i , j ) D(i, j) D(i,j) as the edit distance between X [ 1 … i ] X[1\ldots i] X[1…i] and Y [ 1 … j ] Y[1\ldots j] Y[1…j], i.e., the first i characters of X and the first j characters of Y. The edit distance between X and Y is thus D ( n , m ) D(n,m) D(n,m).
D ( i , j ) = min { D ( i − 1 , j ) + del-cost ( s o u r c e [ i ] ) D ( i , j − 1 ) + ins-cost ( t a r g e t [ j ] ) D ( i − 1 , j − 1 ) + sub-cost ( s o u r c e [ i ] , t a r g e t [ j ] ) D(i,j)=\min \begin{cases}D(i-1,j)+\textrm{del-cost}(source[i]) \\ D(i,j-1)+\textrm{ins-cost}(target[j])\\ D(i-1,j-1)+\textrm{sub-cost}(source[i],target[j]) \end{cases} D(i,j)=min⎩⎪⎨⎪⎧D(i−1,j)+del-cost(source[i])D(i,j−1)+ins-cost(target[j])D(i−1,j−1)+sub-cost(source[i],target[j])
If we assume the version of Levenshtein distance in which the insertions and deletions each have a cost of 1 (ins-cost(·) = del-cost(·) = 1), and substitutions have a cost of 2 (except substitution of identical letters have zero cost), the computation for D ( i , j ) D(i, j) D(i,j) becomes:
D ( i , j ) = min { D ( i − 1 , j ) + 1 D ( i , j − 1 ) + 1 D ( i − 1 , j − 1 ) + { 2 ; if s o u r c e [ i ] ≠ t a r g e t [ j ] 0 ; if s o u r c e [ i ] = t a r g e t [ j ] D(i,j)=\min \begin{cases}D(i-1,j)+1 \\ D(i,j-1)+1\\ D(i-1,j-1)+\begin{cases}2; \textrm{ if } source[i]\neq target[j]\\0; \textrm{ if } source[i] = target[j] \end{cases} \end{cases} D(i,j)=min⎩⎪⎪⎪⎨⎪⎪⎪⎧D(i−1,j)+1D(i,j−1)+1D(i−1,j−1)+{2; if source[i]̸=target[j]0; if source[i]=target[j]
import numpy as np
def min_edit_distance(source,target):
n=len(source)
m=len(target)
D=np.matrix(np.zeros((n+1,m+1)),dtype=np.int16)
for i in range(1,n+1):
D[i,0]=D[i-1,0]+1
for j in range(1,m+1):
D[0,j]=D[0,j-1]+1
for i in range(1,n+1):
for j in range(1,m+1):
D[i,j]=np.min([D[i-1,j]+1,
D[i,j-1]+1,
D[i-1,j-1]+(0 if source[i-1]==target[j-1] else 2)])
return D[n,m]
Knowing the minimum edit distance is useful for algorithms like finding potential spelling error corrections. But the edit distance algorithm is important in another way; with a small change, it can also provide the minimum cost alignment between two strings.
This chapter introduced a fundamental tool in language processing, the regular expression, and showed how to perform basic text normalization tasks including word segmentation and normalization, sentence segmentation, and stemming. We also introduce the important minimum edit distance algorithm for comparing strings. Here’s a summary of the main points we covered about these ideas: