正则表达式学习

参考资料：
爬虫入门系列（五）：正则表达式完全指南（上）
爬虫入门系列（六）：正则表达式完全指南（下）

常规字符与原始字符

    print('aa\n')
    print(r'aa\n')

输出

aa

aa\n

常规字符会将一些转义字符\n、\t等识别并另外表示为对应的显示
原始字符是什么就是什么！
正则表达式的规则用原始字符表示

python-re库

    original_str = 'and'
    pattern = re.compile(r'a.d')
    # 返回一个match对象
    m = pattern.match(original_str)  # 等价于re.match(r"a.d", "and")
    print(m)

结果：

<_sre.SRE_Match object; span=(0, 3), match='and'>

基本元字符
边界匹配
重复匹配
逻辑分支
分组
括号为一个分组，跟算术式的括号差不多
group()没有参数取刚好匹配的字符串，group(n)分别去第n个分组

    # 匹配IP地址
    pattern = re.compile(r'(\d{1,3}\.){3}\d{1,3}')
    result = pattern.match('192.168.01.02xxx')
    print(result.group())
    print(result.group(1))

group只返回第一个，findall返回多个

    pattern = re.compile(r'ab')
    result = pattern.match('abab')
    print(result.group())

    pattern = re.compile(r'ab')
    result = pattern.findall('abab')
    print(result)

findall返回第一个分组

    html = '![](/images/category.png)![](/images/js_framework.png)'
    pattern = re.compile(r'')
    result = pattern.findall(html)
    print(result)

    html = '![](/images/category.png)![](/images/js_framework.png)'
    pattern = re.compile(r'')
    result = pattern.findall(html)
    print(result)

贪婪模式与非贪婪模式
*贪婪模式，满足匹配的情况下尽可能多地重复
*?非贪婪模式，只吸取一个

    # 非贪婪模式
    html = '![](/images/category.png)![](/images/js_framework.png)'
    pattern = re.compile(r'')
    result = pattern.findall(html)
    print(result)

    # 贪婪模式
    html = '![](/images/category.png)![](/images/js_framework.png)'
    pattern = re.compile(r'')
    result = pattern.findall(html)
    print(result)

识别邮箱

html = """
        

        
            [email protected]，谢谢了
            [email protected]麻烦楼主
        
        [email protected]
谢谢
      """

    pattern = re.compile(r'(\d+@(\d|\w+).com)')
    result = pattern.findall(html)
    print(result)
    for mail in result:
        print(mail[0])

结果：

[('[email protected]', 'qq'), ('[email protected]', 'qq'), ('[email protected]', '163')]
[email protected]
[email protected]
[email protected]

re函数区别

    print('match和fullmatch的区别：')
    str = '[email protected]!!!!'
    pattern1 = re.compile(r'\[email protected]')
    print(pattern1.match(str).group())
    print(pattern1.fullmatch(str))

    print('$的作用：')
    str2 = '[email protected]!!!!'
    pattern1 = re.compile(r'\[email protected]')
    pattern2 = re.compile(r'\[email protected]$')
    print(pattern1.match(str).group())
    print(pattern2.match(str))

    print('search和match的区别：')
    str3 = '[email protected]'
    pattern3 = re.compile(r'\[email protected]')
    print(pattern3.search(str3).group())
    print(pattern3.match(str3))

正则表达式学习

你可能感兴趣的:(正则表达式学习)