//search lines that start with the string “From:”
import re hand = open('mbox-short.txt') for line in hand: line = line.rstrip() if re.search('ˆFrom:', line) : print line
// match any of the strings “From:”, “Fxxm:”, “F12m:”, or “F!@m:”
1 import re 2 hand = open('mbox-short.txt') 3 for line in hand: 4 line = line.rstrip() 5 if re.search('ˆF..m:', line) : 6 print line
//search lines that start with “From:”, followed by one or more characters (“.+”), followed by an at-sign
1 import re 2 hand = open('mbox-short.txt') 3 for line in hand: 4 line = line.rstrip() 5 if re.search('ˆFrom:.+@', line) : 6 print line
//uses findall() to find the lines with email addresses in them,
1 import re 2 s = 'Hello from [email protected] to [email protected] about the meeting @2PM' 3 lst = re.findall('\S+@\S+', s) 4 print lst
The output of the program would be:['[email protected]', '[email protected]']
The “\S+” matches as many non-whitespace characters as possible.
OUTPUT:
1 ['[email protected]'] 2 ['[email protected]'] 3 ['<[email protected]>'] 4 ['<[email protected]>'] 5 ['<[email protected]>;'] 6 ['<[email protected]>;'] 7 ['<[email protected]>;'] 8 ['apache@localhost)'] 9 ['[email protected];']
in order to get rid of characters like "<"、";", only interested in the portion of the string that starts and ends with a letter or a number. Square brackets are used to indicate a set of multiple acceptable characters we are willing to consider matching. In a sense, the “\S” is asking to match the set of “non-whitespace characters”. Now we will be a little more explicit in terms of the characters we will match.
Here is our new regular expression:
[a-zA-Z0-9]\S*@\S*[a-zA-Z]
Translating this regular expression, we are looking for substrings that start with a single lowercase letter, uppercase letter, or number “[a-zA-Z0-9]”, followed by zero or more non-blank characters (“\S*”), followed by an at-sign, followed by zero or more non-blank characters (“\S*”), followed by an uppercase or lowercase letter. Note that we switched from “+” to “*” to indicate zero or more non-blank characters since “[azA-Z0-9]” is already one non-blank character. Remember that the “*” or “+” applies to the single character immediately to the left of the plus or asterisk.
1 import re 2 hand = open('mbox-short.txt') 3 for line in hand: 4 line = line.rstrip() 5 x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line) 6 if len(x) > 0 : 7 print x
//get lines start with 'X-',like
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
1 import re 2 hand = open('mbox-short.txt') 3 for line in hand: 4 line = line.rstrip() 5 if re.search('ˆX\S*: [0-9.]+', line) : 6 print line
Translate:lines that start with “X-”, followed by zero or more non-whitespace characters (“\S*”), followed by a colon (“:”) and then a space. After the space we are looking for one or more characters that are either a digit (0-9) or a period “[0-9.]+”. Note that inside the square brackets, the period matches an actual period (i.e., it is not a wildcard between the square brackets)
OUTPUT:
1 X-DSPAM-Confidence: 0.8475 2 X-DSPAM-Probability: 0.0000 3 X-DSPAM-Confidence: 0.6178 4 X-DSPAM-Probability: 0.0000
Parentheses are another special character in regular expressions. When you add parentheses to a regular expression, they are ignored when matching the string.But when you are using findall(), parentheses indicate that while you want the whole expression to match, you only are interested in extracting a portion of the substring that matches the regular expression.
// add parentheses around the part of the regular expression that represents the floating-point number to indicate we only want findall() to give us back the floating-point number portion of the matching string.
1 import re 2 hand = open('mbox-short.txt') 3 for line in hand: 4 line = line.rstrip() 5 x = re.findall('ˆX\S*: ([0-9.]+)', line) 6 if len(x) > 0 : 7 print x
OUTPUT:
1 ['0.8475'] 2 ['0.0000'] 3 ['0.6178'] 4 ['0.0000'] 5 ['0.6961'] 6 ['0.0000'] 7 ..
//extract the numbers ,like
//Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
1 import re 2 hand = open('mbox-short.txt') 3 for line in hand: 4 line = line.rstrip() 5 x = re.findall('ˆDetails:.*rev=([0-9]+)', line) 6 if len(x) > 0: 7 print x
//extract the hour of the day for each line ,like belows:
//From [email protected] Sat Jan 5 09:14:16 2008
1 import re 2 hand = open('mbox-short.txt') 3 for line in hand: 4 line = line.rstrip() 5 x = re.findall('ˆFrom .* ([0-9][0-9]):', line) 6 if len(x) > 0 : print x
OUTPUT:(like this)
['09'] ['18'] ['16'] ['15']
//For Escape Character:(对于换行符)
//Since we use special characters in regular expressions to match the beginning or end of a line or specify wild cards, We can match the special //character by prefixing that character with a backslash
1 import re 2 x = 'We just received $10.00 for cookies.' 3 y = re.findall('\$[0-9.]+',x)
OUTPUT:
['$10.00']
NOTE:Inside square brackets, characters are not “special”.
So when we say “[0-9.]”, it really meansdigits or a period. Outside of square brackets, a period is the “wild-card” character
and matches any character. Inside square brackets, the period is a period.
//*****Special characters and character sequences *****
ˆ
Matches the beginning of the line.
$
Matches the end of the line.
.
Matches any character (a wildcard).
\s
Matches a whitespace character.
\S
Matches a non-whitespace character (opposite of \s).
*
Applies to the immediately preceding character and indicates to match zero or more of the preceding character(s).
*?
Applies to the immediately preceding character and indicates to match zero or more of the preceding character(s) in “non-greedy mode”.
+
Applies to the immediately preceding character and indicates to match one or more of the preceding character(s).
+?
Applies to the immediately preceding character and indicates to match one or more of the preceding character(s) in “non-greedy mode”.
[aeiou]
Matches a single character as long as that character is in the specified set. In this example, it would match “a”, “e”, “i”, “o”, or “u”, but no other characters.
[a-z0-9]
You can specify ranges of characters using the minus sign. This example is a single character that must be a lowercase letter or a digit.
[ˆA-Za-z]
When the first character in the set notation is a caret, it inverts the logic. This example matches a single character that is anything other than an uppercase or lowercase letter.
( )
When parentheses are added to a regular expression, they are ignored for the purpose of matching, but allow you to extract a particular subset of the matched string rather than the whole string when using findall().
\b
Matches the empty string, but only at the start or end of a word.
\B
Matches the empty string, but not at the start or end of a word.
\d
Matches any decimal digit; equivalent to the set [0-9].
\D
Matches any non-digit character; equivalent to the set [ˆ0-9].
//For Linux user
//There is a command-line program built into Unix called grep(Generalized Regular Expression Parser) that does pretty much the same as the
//search() examples,So if you have a Macintosh or Linux system, you can try the following commands in your command-line window.
1 $ grep 'ˆFrom:' mbox-short.txt 2 From: [email protected] 3 From: [email protected] 4 From: [email protected] 5 From: [email protected]
This tells grep to show you lines that start with the string “From:” in the filembox-short.txt. If you experiment with the grep command a bit and read the documentation for grep, you will find some subtle differences between the regular expression support in Python and the regular expression support in grep. As an example, grep does not support the non-blank character “\S” so you will need to use the slightly more complex set notation “[ˆ ]”, which simply means match a character that is anything other than a space.