python之正则表达式以及网络爬虫

正则表达式

正则表达式 (Regular Expression) 又称 RegEx, 是用来匹配字符的一种工具. 在一大串字符中寻找你需要的内容. 它常被用在很多方面, 比如网页爬虫, 文稿整理, 数据筛选等等. 最简单的一个例子, 比如我需要爬取网页中每一页的标题. 而网页中的标题常常是这种形式.

我是标题</ title>
</code></pre> 
    </div> 
   </div> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 而且每个网页的标题各不相同, 我就能使用正则表达式, 用一种简单的匹配方法, 一次性选取出成千上万网页的标题信息. 正则表达式绝对不是一天就能学会和记住的, 因为表达式里面的内容非常多, 强烈建议, 现在这个阶段, 你只需要了解正则里都有些什么, 不用记住, 等到你真正需要用到它的时候, 再反过头来, 好好琢磨琢磨, 那个时候才是你需要训练自己记住这些表达式的时候.</p> 
   <h2 class="tut-h2-pad" id="简单的匹配" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> 简单的匹配</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 正则表达式无非就是在做这么一回事. 在文字中找到特定的内容, 比如下面的内容. 我们在 “dog runs to cat” 这句话中寻找是否存在 “cat” 或者 “bird”.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="c" style="color:rgb(159,217,159);"># matching string</span>
<span class="n" style="color:rgb(220,220,204);">pattern1</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="s">"cat"</span>
<span class="n" style="color:rgb(220,220,204);">pattern2</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="s">"bird"</span>
<span class="n" style="color:rgb(220,220,204);">string</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="s">"dog runs to cat"</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">pattern1</span> <span class="ow">in</span> <span class="n" style="color:rgb(220,220,204);">string</span><span class="p" style="color:rgb(177,204,250);">)</span>    <span class="c" style="color:rgb(159,217,159);"># True</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">pattern2</span> <span class="ow">in</span> <span class="n" style="color:rgb(220,220,204);">string</span><span class="p" style="color:rgb(177,204,250);">)</span>    <span class="c" style="color:rgb(159,217,159);"># False</span>
</code></pre> 
    </div> 
   </div> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 但是正则表达式绝非不止这样简单的匹配, 它还能做更加高级的内容. 要使用正则表达式, 首先需要调用一个 python 的内置模块 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">re</code>. 然后我们重复上面的步骤, 不过这次使用正则. 可以看出, 如果 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">re.search()</code> 找到了结果, 它会返回一个 match 的 object. 如果没有匹配到, 它会返回 None. 这个 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">re.search()</code> 只是 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">re</code> 中的一个功能, 之后会介绍其它的功能.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="kn" style="color:rgb(223,175,143);font-weight:700;">import</span> <span class="nn" style="color:rgb(143,190,222);">re</span>

<span class="c" style="color:rgb(159,217,159);"># regular expression</span>
<span class="n" style="color:rgb(220,220,204);">pattern1</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="s">"cat"</span>
<span class="n" style="color:rgb(220,220,204);">pattern2</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="s">"bird"</span>
<span class="n" style="color:rgb(220,220,204);">string</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="s">"dog runs to cat"</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">pattern1</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="n" style="color:rgb(220,220,204);">string</span><span class="p" style="color:rgb(177,204,250);">))</span>  <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(12, 15), match='cat'></span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">pattern2</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="n" style="color:rgb(220,220,204);">string</span><span class="p" style="color:rgb(177,204,250);">))</span>  <span class="c" style="color:rgb(159,217,159);"># None</span></code></pre> 
    </div> 
   </div> 
   <div style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <ins id="aswift_2_expand" style="display:inline-table;border:none;visibility:visible;width:636px;"></ins> 
   </div> 
   <h2 class="tut-h2-pad" id="灵活匹配" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> 灵活匹配</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 除了上面的简单匹配, 下面的内容才是正则的核心内容, 使用特殊的 pattern 来灵活匹配需要找的文字.</p> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 如果需要找到潜在的多个可能性文字, 我们可以使用 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">[]</code> 将可能的字符囊括进来. 比如 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">[ab]</code> 就说明我想要找的字符可以是 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">a</code> 也可以是 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">b</code>. 这里我们还需要注意的是, 建立一个正则的规则, 我们在 pattern 的 “” 前面需要加上一个 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">r</code> 用来表示这是正则表达式, 而不是普通字符串. 通过下面这种形式, 如果字符串中出现 “run” 或者是 “ran”, 它都能找到.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="c" style="color:rgb(159,217,159);"># multiple patterns ("run" or "ran")</span>
<span class="n" style="color:rgb(220,220,204);">ptn</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="s">r"r[au]n"</span>       <span class="c" style="color:rgb(159,217,159);"># start with "r" means regular expression</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">ptn</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"dog runs to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>    <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(4, 7), match='run'></span>
</code></pre> 
    </div> 
   </div> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 同样, 中括号 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">[]</code> 中还可以是以下这些或者是这些的组合. 比如 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">[A-Z]</code> 表示的就是所有大写的英文字母. <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">[0-9a-z]</code> 表示可以是数字也可以是任何小写字母.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r[A-Z]n"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"dog runs to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>     <span class="c" style="color:rgb(159,217,159);"># None</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r[a-z]n"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"dog runs to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>     <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(4, 7), match='run'></span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r[0-9]n"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"dog r2ns to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>     <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(4, 7), match='r2n'></span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r[0-9a-z]n"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"dog runs to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>  <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(4, 7), match='run'></span></code></pre> 
    </div> 
   </div> 
   <div style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <ins id="aswift_3_expand" style="display:inline-table;border:none;visibility:visible;width:636px;"></ins> 
   </div> 
   <h2 class="tut-h2-pad" id="按类型匹配" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> 按类型匹配</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 除了自己定义规则, 还有很多匹配的规则时提前就给你定义好了的. 下面有一些特殊的匹配类型给大家先总结一下, 然后再上一些例子.</p> 
   <ul style="color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <li>\d : 任何数字</li> 
    <li>\D : 不是数字</li> 
    <li>\s : 任何 white space, 如 [\t\n\r\f\v]</li> 
    <li>\S : 不是 white space</li> 
    <li>\w : 任何大小写字母, 数字和 “<em>” [a-zA-Z0-9</em>]</li> 
    <li>\W : 不是 \w</li> 
    <li>\b : 空白字符 (<strong>只</strong>在某个字的开头或结尾)</li> 
    <li>\B : 空白字符 (<strong>不</strong>在某个字的开头或结尾)</li> 
    <li>\\ : 匹配 \</li> 
    <li>. : 匹配任何字符 (除了 \n)</li> 
    <li>^ : 匹配开头</li> 
    <li>$ : 匹配结尾</li> 
    <li>? : 前面的字符可有可无</li> 
   </ul> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 下面就是具体的举例说明啦.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="c" style="color:rgb(159,217,159);"># \d : decimal digit</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r\dn"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"run r4n"</span><span class="p" style="color:rgb(177,204,250);">))</span>           <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(4, 7), match='r4n'></span>
<span class="c" style="color:rgb(159,217,159);"># \D : any non-decimal digit</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r\Dn"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"run r4n"</span><span class="p" style="color:rgb(177,204,250);">))</span>           <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 3), match='run'></span>
<span class="c" style="color:rgb(159,217,159);"># \s : any white space [\t\n\r\f\v]</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r\sn"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"r</span><span class="se">\n</span><span class="s">n r4n"</span><span class="p" style="color:rgb(177,204,250);">))</span>          <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 3), match='r\nn'></span>
<span class="c" style="color:rgb(159,217,159);"># \S : opposite to \s, any non-white space</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r\Sn"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"r</span><span class="se">\n</span><span class="s">n r4n"</span><span class="p" style="color:rgb(177,204,250);">))</span>          <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(4, 7), match='r4n'></span>
<span class="c" style="color:rgb(159,217,159);"># \w : [a-zA-Z0-9_]</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r\wn"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"r</span><span class="se">\n</span><span class="s">n r4n"</span><span class="p" style="color:rgb(177,204,250);">))</span>          <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(4, 7), match='r4n'></span>
<span class="c" style="color:rgb(159,217,159);"># \W : opposite to \w</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r\Wn"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"r</span><span class="se">\n</span><span class="s">n r4n"</span><span class="p" style="color:rgb(177,204,250);">))</span>          <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 3), match='r\nn'></span>
<span class="c" style="color:rgb(159,217,159);"># \b : empty string (only at the start or end of the word)</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"\bruns\b"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"dog runs to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>    <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(4, 8), match='runs'></span>
<span class="c" style="color:rgb(159,217,159);"># \B : empty string (but not at the start or end of a word)</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"\B runs \B"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"dog   runs  to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>  <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(8, 14), match=' runs '></span>
<span class="c" style="color:rgb(159,217,159);"># \\ : match \</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"runs\\"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"runs</span><span class="err" style="color:rgb(227,113,112);">\</span><span class="s"> to me"</span><span class="p" style="color:rgb(177,204,250);">))</span>     <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 5), match='runs\\'></span>
<span class="c" style="color:rgb(159,217,159);"># . : match anything (except \n)</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r.n"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"r[ns to me"</span><span class="p" style="color:rgb(177,204,250);">))</span>         <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 3), match='r[n'></span>
<span class="c" style="color:rgb(159,217,159);"># ^ : match line beginning</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"^dog"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"dog runs to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>   <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 3), match='dog'></span>
<span class="c" style="color:rgb(159,217,159);"># $ : match line ending</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"cat$"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"dog runs to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>   <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(12, 15), match='cat'></span>
<span class="c" style="color:rgb(159,217,159);"># ? : may or may not occur</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"Mon(day)?"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"Monday"</span><span class="p" style="color:rgb(177,204,250);">))</span>       <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 6), match='Monday'></span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"Mon(day)?"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"Mon"</span><span class="p" style="color:rgb(177,204,250);">))</span>          <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 3), match='Mon'></span>
</code></pre> 
    </div> 
   </div> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 如果一个字符串有很多行, 我们想使用 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">^</code> 形式来匹配行开头的字符, 如果用通常的形式是不成功的. 比如下面的 “I” 出现在第二行开头, 但是使用 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">r"^I"</code> 却匹配不到第二行, 这时候, 我们要使用 另外一个参数, 让 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">re.search()</code> 可以对每一行单独处理. 这个参数就是 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">flags=re.M</code>, 或者这样写也行 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">flags=re.MULTILINE</code>.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="n" style="color:rgb(220,220,204);">string</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="s">"""
dog runs to cat.
I run to dog.
"""</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"^I"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="n" style="color:rgb(220,220,204);">string</span><span class="p" style="color:rgb(177,204,250);">))</span>                 <span class="c" style="color:rgb(159,217,159);"># None</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"^I"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="n" style="color:rgb(220,220,204);">string</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="n" style="color:rgb(220,220,204);">flags</span><span class="o" style="color:rgb(240,239,208);">=</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">M</span><span class="p" style="color:rgb(177,204,250);">))</span>     <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(18, 19), match='I'></span>
</code></pre> 
    </div> 
   </div> 
   <h2 class="tut-h2-pad" id="重复匹配" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> 重复匹配</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 如果我们想让某个规律被重复使用, 在正则里面也是可以实现的, 而且实现的方式还有很多. 具体可以分为这三种:</p> 
   <ul style="color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <li><code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">*</code> : 重复零次或多次</li> 
    <li><code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">+</code> : 重复一次或多次</li> 
    <li><code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">{n, m}</code> : 重复 n 至 m 次</li> 
    <li><code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">{n}</code> : 重复 n 次</li> 
   </ul> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 举例如下:</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="c" style="color:rgb(159,217,159);"># * : occur 0 or more times</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"ab*"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"a"</span><span class="p" style="color:rgb(177,204,250);">))</span>             <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 1), match='a'></span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"ab*"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"abbbbb"</span><span class="p" style="color:rgb(177,204,250);">))</span>        <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 6), match='abbbbb'></span>

<span class="c" style="color:rgb(159,217,159);"># + : occur 1 or more times</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"ab+"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"a"</span><span class="p" style="color:rgb(177,204,250);">))</span>             <span class="c" style="color:rgb(159,217,159);"># None</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"ab+"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"abbbbb"</span><span class="p" style="color:rgb(177,204,250);">))</span>        <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 6), match='abbbbb'></span>

<span class="c" style="color:rgb(159,217,159);"># {n, m} : occur n to m times</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"ab{2,10}"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"a"</span><span class="p" style="color:rgb(177,204,250);">))</span>        <span class="c" style="color:rgb(159,217,159);"># None</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"ab{2,10}"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"abbbbb"</span><span class="p" style="color:rgb(177,204,250);">))</span>   <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(0, 6), match='abbbbb'></span>
</code></pre> 
    </div> 
   </div> 
   <h2 class="tut-h2-pad" id="分组" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> 分组</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 我们甚至可以为找到的内容分组, 使用 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">()</code> 能轻松实现这件事. 通过分组, 我们能轻松定位所找到的内容. 比如在这个 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">(\d+)</code> 组里, 需要找到的是一些数字, 在 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">(.+)</code> 这个组里, 我们会找到 “Date: “ 后面的所有内容. 当使用 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">match.group()</code> 时, 他会返回所有组里的内容, 而如果给 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">.group(2)</code> 里加一个数, 它就能定位你需要返回哪个组里的信息.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="n" style="color:rgb(220,220,204);">match</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"(\d+), Date: (.+)"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"ID: 021523, Date: Feb/12/2017"</span><span class="p" style="color:rgb(177,204,250);">)</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">match</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">group</span><span class="p" style="color:rgb(177,204,250);">())</span>                   <span class="c" style="color:rgb(159,217,159);"># 021523, Date: Feb/12/2017</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">match</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">group</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="mi">1</span><span class="p" style="color:rgb(177,204,250);">))</span>                  <span class="c" style="color:rgb(159,217,159);"># 021523</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">match</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">group</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="mi">2</span><span class="p" style="color:rgb(177,204,250);">))</span>                  <span class="c" style="color:rgb(159,217,159);"># Date: Feb/12/2017</span>
</code></pre> 
    </div> 
   </div> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 有时候, 组会很多, 光用数字可能比较难找到自己想要的组, 这时候, 如果有一个名字当做索引, 会是一件很容易的事. 我们字需要在括号的开头写上这样的形式 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">?P<名字></code> 就给这个组定义了一个名字. 然后就能用这个名字找到这个组的内容.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="n" style="color:rgb(220,220,204);">match</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"(?P<id>\d+), Date: (?P<date>.+)"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"ID: 021523, Date: Feb/12/2017"</span><span class="p" style="color:rgb(177,204,250);">)</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">match</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">group</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">'id'</span><span class="p" style="color:rgb(177,204,250);">))</span>                <span class="c" style="color:rgb(159,217,159);"># 021523</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">match</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">group</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">'date'</span><span class="p" style="color:rgb(177,204,250);">))</span>              <span class="c" style="color:rgb(159,217,159);"># Date: Feb/12/2017</span></code></pre> 
    </div> 
   </div> 
   <div style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <ins id="aswift_4_expand" style="display:inline-table;border:none;visibility:visible;width:636px;"></ins> 
   </div> 
   <h2 class="tut-h2-pad" id="findall" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> findall</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 前面我们说的都是只找到了最开始匹配上的一项而已, 如果需要找到全部的匹配项, 我们可以使用 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">findall</code> 功能. 然后返回一个列表. 注意下面还有一个新的知识点, <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">|</code> 是 or 的意思, 要不是前者要不是后者.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="c" style="color:rgb(159,217,159);"># findall</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">findall</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r[ua]n"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"run ran ren"</span><span class="p" style="color:rgb(177,204,250);">))</span>    <span class="c" style="color:rgb(159,217,159);"># ['run', 'ran']</span>

<span class="c" style="color:rgb(159,217,159);"># | : or</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">findall</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"(run|ran)"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"run ran ren"</span><span class="p" style="color:rgb(177,204,250);">))</span> <span class="c" style="color:rgb(159,217,159);"># ['run', 'ran']</span>
</code></pre> 
    </div> 
   </div> 
   <h2 class="tut-h2-pad" id="replace" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> replace</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 我们还能通过正则表达式匹配上一些形式的字符串然后再替代掉这些字符串. 使用这种匹配 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">re.sub()</code>, 将会比 python 自带的 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">string.replace()</code> 要灵活多变.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">sub</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r[au]ns"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"catches"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"dog runs to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>     <span class="c" style="color:rgb(159,217,159);"># dog catches to cat</span>
</code></pre> 
    </div> 
   </div> 
   <h2 class="tut-h2-pad" id="split" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> split</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 再来我们 Python 中有个字符串的分割功能, 比如想获取一句话中所有的单词. 比如 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">"a is b".split(" ")</code>, 这样它就会产生一个列表来保存所有单词. 但是在正则中, 这种普通的分割也可以做的淋漓精致.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">split</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"[,;\.]"</span><span class="p" style="color:rgb(177,204,250);">,</span> <span class="s">"a;b,c.d;e"</span><span class="p" style="color:rgb(177,204,250);">))</span>             <span class="c" style="color:rgb(159,217,159);"># ['a', 'b', 'c', 'd', 'e']</span>
</code></pre> 
    </div> 
   </div> 
   <h2 class="tut-h2-pad" id="compile" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> compile</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 最后, 我们还能使用 compile 过后的正则, 来对这个正则重复使用. 先将正则 compile 进一个变量, 比如 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">compiled_re</code>, 然后直接使用这个 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">compiled_re</code> 来搜索.</p> 
   <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="n" style="color:rgb(220,220,204);">compiled_re</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="nb" style="color:rgb(252,140,72);">compile</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"r[ua]n"</span><span class="p" style="color:rgb(177,204,250);">)</span>
<span class="k" style="color:rgb(252,233,72);">print</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="n" style="color:rgb(220,220,204);">compiled_re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">search</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">"dog ran to cat"</span><span class="p" style="color:rgb(177,204,250);">))</span>  <span class="c" style="color:rgb(159,217,159);"># <_sre.SRE_Match object; span=(4, 7), match='ran'></span>
</code></pre> 
    </div> 
   </div> 
   <h2 class="tut-h2-pad" id="小抄" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> 小抄</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 为了大家方便记忆, 我很久以前在网上找到了一份小抄, 这个小抄的原出处应该是这里. 小抄很有用, 不记得的时候回头方便看.</p> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> <a href="http://img.e-com-net.com/image/info8/0d72233d8bba41008ea71e848999ddf3.jpg" target="_blank"><img class="course-image lazy-img" src="http://img.e-com-net.com/image/info8/0d72233d8bba41008ea71e848999ddf3.jpg" alt="python之正则表达式以及网络爬虫_第1张图片" title="正则表达式-0" style="display:block;;border:1px solid black;" width="650" height="1398"></a></p> 
   <br> 
  </div> 
  <div> 
   <br> 
  </div> 
  <div> 
   <h1 style="color:rgb(76,176,123);font-size:2.5em;text-align:center;font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> 了解网页结构</h1> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 学习爬虫, 首先要懂的是网页. 支撑起各种光鲜亮丽的网页的不是别的, 全都是一些代码. 这种代码我们称之为 HTML, HTML 是一种浏览器(Chrome, Safari, IE, Firefox等)看得懂的语言, 浏览器能将这种语言转换成我们用肉眼看到的网页. 所以 HTML 里面必定存在着很多规律, 我们的爬虫就能按照这样的规律来爬取你需要的信息.</p> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 其实除了 HTML, 一同构建多彩/多功能网页的组件还有 CSS 和 JavaScript. 但是这个简单的爬虫教程, 大部分时间会将会使用 HTML. CSS 和 JavaScript 会在后期简单介绍一下. 因为爬网页的时候多多少少还是要和 CSS JavaScript 打交道的.</p> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 虽然莫烦Python主打的是机器学习的教程. 但是这个爬虫教程适用于任何想学爬虫的朋友们. 从机器学习的角度看, 机器学习中的大量数据, 也是可以从这些网页中来, 使用爬虫来爬取各种网页上面的信息, 然后再放入各种机器学习的方法, 这样的应用途径正在越来越多被采用. 所以如果你的数据也是分散在各个网页中, 爬虫是你减少人力劳动的必修课.</p> 
   <h2 class="tut-h2-pad" id="网络基本组成部分" style="color:rgb(76,176,123);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;"> 网络基本组成部分</h2> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 在真正进入爬虫之前, 我们先来做一下热身运动, 弄明白网页的基础, HTML 有哪些组成部分, 是怎么样运作的. 如果你已经非常熟悉网页的构造了, 欢迎直接跳过这一节, 进入下面的学习.</p> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 我制作了一个非常简易的网页, 给大家呈现以下最骨感的 HTML 结构. 如果你点开它, 呈现在你眼前的, 就是下面这张图的上半部分. 而下半部分就是我们网页背后的 HTML code.</p> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> <img class="course-image lazy-img" src="http://img.e-com-net.com/image/info8/32be9830773a450d80d17e064fb9025f.jpg" alt="了解网页结构-0" title="了解网页结构-0" style="display:block;" width="0" height="0"></p> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 想问我是如何看到 HTML 的 source code 的? 其实很简单, 在你的浏览器中, 显示网页的地方, 点击鼠标右键, 大多数浏览器都会有类似这样一个选项 “View Page Source”. 点击它就能看到页面的源码了.</p> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> <img class="course-image lazy-img" src="http://img.e-com-net.com/image/info8/920ade1d9f25450b9b307791c9d2b663.jpg" alt="了解网页结构-1" title="了解网页结构-1" style="display:block;" width="0" height="0"></p> 
   <p style="line-height:1.7em;text-align:justify;color:rgb(65,88,73);font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 在 HTML 中, 基本上所有的实体内容, 都会有个 tag 来框住它. 而这个被 tag 住的内容, 就可以被展示成不同的形式, 或有不同的功能. 主体的 tag 分成两部分, <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">header</code> 和 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">body</code>. 在 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">header</code> 中, 存放这一些网页的网页的元信息, 比如说 <code class="highlighter-rouge" style="color:rgb(254,113,113);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;">title</code>, 这些信息是不会被显示到你看到的网页中的. 这些信息大多数时候是给浏览器看, 或者是给搜索引擎的爬虫看.</p> 
   <div class="language-html highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> 
    <div class="highlight"> 
     <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="nt"><head></span>
	<span class="nt"><meta</span> <span class="na" style="color:rgb(154,195,159);">charset=</span><span class="s">"UTF-8"</span><span class="nt">></span>
	<span class="nt"><title></span>Scraping tutorial 1 | 莫烦Python<span class="nt">
	 rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">

HTML 的第二大块是 body, 这个部分才是你看到的网页信息. 网页中的 heading, 视频, 图片和文字等都存放在这里. 这里的 

 tag 就是主标题, 我们看到呈现出来的效果就是大一号的文字. 

 里面的文字就是一个段落. 里面都是一些链接. 所以很多情况, 东西都是放在这些 tag 中的.


    

爬虫测试1

这是一个在 href="https://morvanzhou.github.io/">莫烦Python href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程 中的简单测试.

爬虫想要做的就是根据这些 tag 来找到合适的信息.

用 Python 登录网页

好了, 对网页结构和 HTML 有了一些基本认识以后, 我们就能用 Python 来爬取这个网页的一些基本信息. 首先要做的, 是使用 Python 来登录这个网页, 并打印出这个网页 HTML 的 source code. 注意, 因为网页中存在中文, 为了正常显示中文, read() 完以后, 我们要对读出来的文字进行转换, decode() 成可以正常显示中文的形式.

from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

print 出来就是下面这样啦. 这就证明了我们能够成功读取这个网页的所有信息了. 但我们还没有对网页的信息进行汇总和利用. 我们发现, 想要提取一些形式的信息, 合理的利用 tag 的名字十分重要.


 lang="cn">

	 charset="UTF-8">
	</span>Scraping tutorial 1 | 莫烦Python<span class="nt">
	 rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">


	

爬虫测试1

这是一个在 href="https://morvanzhou.github.io/">莫烦Python href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程 中的简单测试.

匹配网页内容

所以这里我们使用 Python 的正则表达式 RegEx 进行匹配文字, 筛选信息的工作. 我有一个很不错的正则表达式的教程, 如果是初级的网页匹配, 我们使用正则完全就可以了, 高级一点或者比较繁琐的匹配, 我还是推荐使用 BeautifulSoup. 不急不急, 我知道你想偷懒, 我之后马上就会教 beautiful soup 了. 但是现在我们还是使用正则来做几个简单的例子, 让你熟悉一下套路.

如果我们想用代码找到这个网页的 title, 我们就能这样写. 选好要使用的 tag 名称 </code>. 使用正则匹配.</p> <div class="language-python highlighter-rouge" style="font-family:'Hiragino Sans GB', Tahoma, Helvetica, Arial, 'Microsoft YaHei', 'WenQuanYi Micro Hei', '黑体', '宋体', sans-serif;font-size:16px;"> <div class="highlight"> <pre class="highlight" style="color:rgb(253,206,147);font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><code style="font-family:Menlo, 'Andale Mono', Arial, Tahoma, 'Microsoft YaHei';font-size:.85em;"><span class="kn" style="color:rgb(223,175,143);font-weight:700;">import</span> <span class="nn" style="color:rgb(143,190,222);">re</span> <span class="n" style="color:rgb(220,220,204);">res</span> <span class="o" style="color:rgb(240,239,208);">=</span> <span class="n" style="color:rgb(220,220,204);">re</span><span class="o" style="color:rgb(240,239,208);">.</span><span class="n" style="color:rgb(220,220,204);">findall</span><span class="p" style="color:rgb(177,204,250);">(</span><span class="s">r"<title>(.+?)", html) print("\nPage title is: ", res[0]) # Page title is: Scraping tutorial 1 | 莫烦Python

如果想要找到中间的那个段落 

, 我们使用下面方法, 因为这个段落在 HTML 中还夹杂着 tab, new line, 所以我们给一个 flags=re.DOTALL 来对这些 tab, new line 不敏感.

res = re.findall(r"

(.*?)

"
, html, flags=re.DOTALL) # re.DOTALL if multi line print("\nPage paragraph is: ", res[0]) # Page paragraph is: # 这是一个在 莫烦Python # 爬虫教程 中的简单测试.

最后一个练习是找一找所有的链接, 这个比较有用, 有时候你想找到网页里的链接, 然后下载一些内容到电脑里, 就靠这样的途径了.

res = re.findall(r'href="(.*?)"', html)
print("\nAll links: ", res)
# All links:  ['https://morvanzhou.github.io/static/img/description/tab_icon.png', 'https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/scraping']

下次我们就来看看为了图方面, 我们如何使用 BeautifulSoup.


你可能感兴趣的:(AI)