python从零写一个采集器:获取网页信息

  • 博客链接 https://uublog.com/article/20170216/python-extarct-html-info/

前言

获取内容,比较纠结是用BeautifulSoup还是直接用正则匹配好。BeautifulSoup简单清晰,但是不够灵活。
正则则相反。

正文

信息位置的分析

像网盘,我们要提取的信息主要有共享者ID、资源名、网盘URL、资源大小、创建时间等等。搞清楚这些信息的位置,不是本文的重点,所以这里假设已经清楚了信息的位置,然后提取就行了。用共享者ID、资源名、网盘URL做个示范。

举个栗子,比如莽荒纪.zip的资源,URL是:http://www.sobaidupan.com/file-106010793.html从HTML中我们可以获得如下信息:

  • 资源名:莽荒纪.zip
  • 共享者ID: http://www.sobaidupan.com/user-2082813876-1.html
  • 网盘URL: http://sbdp.baidudaquan.com/down.asp?id=106010793&token=c4e0d8de4bf94fe0d86a6b4f675fe176

2082813876是sobaidupan.com的站内ID,也是百度云盘的用户ID。这就好办了。
但是资源的URL还要进一步加载http://sbdp.baidudaquan.com/down.asp?id=16166237&token=301efbbe2c138d150b41b5813a3d4077才能知道。
源码如下:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<div style="margin:0 auto;margin-top:10%;width:600px;border: 1px solid #ff0000;line-height:30px;padding:10px 10px 10px 10px; ">提示:亲,正在为您跳转,请稍等2秒..... 
<meta http-equiv='refresh' content='2;URL=http://pan.baidu.com/share/link?shareid=3994307345&uk=2755655514&fid=45639734040097'>div>

源码里的http://pan.baidu.com/share/link?shareid=3994307345&uk=2755655514&fid=45639734040097正是我们要的资源。

也就是说,要提取莽荒纪的资源名称,至少得加载两次URL,才能将信息提取全。

  • 第一次加载:http://www.sobaidupan.com/user-2082813876-1.html
    得到资源名、共享者ID和网盘的站内地址http://sbdp.baidudaquan.com/down.asp?id=106010793&token=c4e0d8de4bf94fe0d86a6b4f675fe176

  • 第二次加载: http://sbdp.baidudaquan.com/down.asp?id=106010793&token=c4e0d8de4bf94fe0d86a6b4f675fe176提取出网盘的真实地址。

提取信息

获取网站源码

上一篇日志提到如何提取源码。我把它放到一个叫yzyPublic.py文件里。所以等下得先导入这个文件再使用。

import yzyPublic

res = yzyPublic.get_web_source('http://www.sobaidupan.com/file-106010793.html')
print res

res内容如下:


<html xmlns=http://www.w3.org/1999/xhtml>
<head>
<meta http-equiv=X-UA-Compatible content="IE=edge,chrome=1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="style.css" />

<title>莽荒纪.zip_zgh*****1617_百度云盘下载 - 搜百度盘title>
<meta name="keywords" content="莽荒纪.zip" />
<meta name="description" content="小说/修真/莽荒纪.zip" />
<style type="text/css">
--
.f_color {
    color: #FFFFFF;
    font-weight: bold;
}
--> 
style>
head>
<body>
<div class="headtop">
    <div class="headtop_f"><B>搜百度盘(SoBaiduPan.com)B> 是基于百度云搜索,最大的百度云盘资源搜索中心,千万级大数据量,让您一网打尽所有的百度网盘资源.div> 
div>
<div class="site_head w c">
    <div class="sitelogo"><a href="/"><img src="image/logo.gif" border="0" title="SoBaiduPan.com">a>div>
    <div class="top_allsite" id="top_allsite"><ul>
    <script type="text/javascript" src="top_txtad.asp">script>
    ul>div>
div>
<div class="menu w c">
    <ul>
        <li><a href="http://www.sobaidupan.com">首 页a>li>
        <li><a href="list-1-1.html">最新资源a>li>
        <li><a href="zhuan-1-1.html">影视目录a>li>
        <li><a href="zhuan-2-1.html">小说目录a>li>
        <li><a href="list-28-1.html">影视资源a>li>
        <li><a href="list-30-1.html">动漫资源a>li>
        <li><a href="list-29-1.html">小说资源a>li>
        <li><a href="zhuan-3-1.html">综合资源a>li>
        <li><a href="http://soft.sobaidupan.com" target="_blank" title="百度云下载器">云下载器a>li>
        <li><a href="m.asp" title="移动端访问">手机专版a>li>
        <li><a href="http://weipan.sobaidupan.com" title="新浪微盘资源搜索" target="_blank">新浪微盘a>li>
        <li><a href="about.asp?id=2" title="在线发布共享资源">发布资源a>li>
        <li><a href="http://bbs.sobaidupan.com" title="建议留言" target="_blank"><font color="#FFFF00">建议留言font>a>li>
    ul>
div>

<div class="smenu c">
    <div class="smenu_nav">
        <a href="list-3-1.html">torrenta><a href="list-5-1.html">rmvba><a href="list-4-1.html">mp4a><a href="list-7-1.html">mp3a><a href="list-9-1.html">avia><a href="list-8-1.html">epuba><a href="list-10-1.html">mkva><a href="list-11-1.html">flva><a href="list-12-1.html">pdfa><a href="list-13-1.html">ppsa><a href="list-15-1.html">psda><a href="list-16-1.html">isoa><a href="list-17-1.html">ghosta><a href="list-19-1.html">exea><a href="list-20-1.html">txta><a href="list-21-1.html">apka><a href="list-22-1.html">ipaa><a href="list-24-1.html">wpsa><a href="list-25-1.html">rtfa><a href="list-26-1.html">voba><a href="list-13-1.html">ppt/pptxa><a href="list-27-1.html">xls/xlsxa><a href="list-14-1.html">doc/docxa><a href="list-18-1.html">rar/zipa>
    div>
div>

<div class="search w c">
<table width="100%" height="90" border="0" align="center" cellpadding="0" cellspacing="1">
  <tr>
    <td>

    <script type="text/javascript" src="ad/top1_580x90.js">script>

td>
    <td>
        <a href="adgo.asp?id=30" target="_blank"><img src="ad/ad2.jpg">a>
    td>
  tr>
table>
<div class="fgx">div>
    <form id="form1" name="form1" method="get" action="search.asp" ><img src="image/s.png" width="32" height="32" align="absmiddle"> 请您输入搜索内容:
        <input name="wd" id="wd" placeholder="共108,789,857个资源,今日已更新2382..." type="text" size="30" value="" autocomplete="off" />
        <input type="submit" id="Su" tabindex="2" value="网盘搜索" style="cursor:hand;"> <img src="image/soso.gif" width="23" height="21" align="absmiddle"><a href="about.asp?id=1" target="_blank"><font color="red"><b>点击打赏本站b>font>  <a href="http://koubei.baidu.com/s/www.sobaidupan.com" target="_blank"><b>点击支持本站b>a<img src="image/new.gif" width="22" height="14" align="absmiddle"> <a href="http://soft.sobaidupan.com" target="_blank"><font color="red"><b>百度云搜索器b>font>a>
    form>
div>
<script type="text/javascript" charset="gbk" src="opensug.js">script>
<script type="text/javascript">
var txtObj = document.getElementById("alertSpan");
function show(str){
window.location.href="search.asp?r=0&wd="+encodeURIComponent(str);
}
var params = {
"XOffset":0,
"YOffset":0,
"width":204,
"fontColor":"#f70",
"fontColorHI":"#FFF",
"fontSize":"15px",
"fontFamily":"宋体",
"borderColor":"gray",
"bgcolorHI":"#03c",
"sugSubmit":false
};
BaiduSuggestion.bind("wd",params,show);
script>

<div class="main w c">
    <div class="art_bt_box w c"><ul><li><h1>莽荒纪.ziph1>li>ul>div>
  <div class="art_box">
          <table border="0">
            <tr>
              <td width="250" valign="top" ><table width="250" border="0" cellpadding="0" cellspacing="1" bordercolor="#3E92CF" bgcolor="#3E92CF">
                  <tr>
                    <td width="250" height="119" bgcolor="#FFFFFF" ><div align="center"><a href="user-2082813876-1.html"><img src="http://himg.bdimg.com/sys/portrait/item/797c6b21.jpg" width="100" height="100" border="0">a>div>td>
                  tr>
                  <tr>
                    <td height="40" bgcolor="#FFFFFF" ><div align="center">用户名:zgh*****1617div>td>
                  tr>
                  <tr>
                    <td height="40" bgcolor="#FFFFFF" ><div align="center"><a href="user-2082813876-1.html"><img src="image/jrzy.gif" width="89" height="24" border="0">a>div>td>
                  tr>
                  <tr>
                    <td bgcolor="#FFFFFF" >
                    <script src="ad/250x250.js" type="text/javascript">script>div>
                    td>
                  tr>
                  <tr>
                    <td height="35" bgcolor="#3E92CF" > <span class="f_color">Ta 分享的其它资源:span>td>
                  tr>
                  <tr>
                    <td height="40" bgcolor="#FFFFFF">
                    <ul>
                        <li> <a href="file-1266183.html" title=网游——屠龙巫师.zip>网游——屠龙巫师.zipa>li><li> <a href="file-1266216.html" title=网游-梦幻现实.zip>网游-梦幻现实.zipa>li><li> <a href="file-1266234.html" title=神也玩转网游.zip>神也玩转网游.zipa>li><li> <a href="file-1668670.html" title=魔兽英雄.zip>魔兽英雄.zipa>li><li> <a href="file-1668832.html" title=阿亚罗克年代记.zip>阿亚罗克年代记.zipa>li><li> <a href="file-1668883.html" title=重生之福星道士.zip>重生之福星道士.zipa>li><li> <a href="file-1668930.html" title=重生之极限风流.zip>重生之极限风流.zipa>li><li> <a href="file-1669255.html" title=英雄无敌之大航海时代.zip>英雄无敌之大航海时代.zipa>li><li> <a href="file-1674467.html" title=网游之霸世神偷.zip>网游之霸世神偷.zipa>li><li> <a href="file-2013963.html" title=霸王怒.zip>霸王怒.zipa>li>
                        ul>
                    td>
                  tr>
                  <tr>
                    <td  bgcolor="#FFFFFF" >
                    <script src="ad/250x250-2.js" type="text/javascript">script>td>
                  tr>
                  <tr>
                    <td height="35" bgcolor="#3E92CF" > <span class="f_color">其它网友正在下载的资源:span>td>
                  tr>
                  <tr>
                    <td  bgcolor="#FFFFFF" >
                    <ul>
                    <li> <a href="file-830.html" title=橄榄油 - 副本5.psd>橄榄油 - 副本5.psda>li><li> <a href="file-829.html" title=百度云管家 v4.8.0 绿色版 i2i2.cn.rar>百度云管家 v4.8.0 绿色版 i2i2.cn.rara>li><li> <a href="file-828.html" title=百度云管家 v4.8.0  单文件版 i2i2.cn.rar>百度云管家 v4.8.0  单文件版 i2i2.cn.rara>li><li> <a href="file-827.html" title=第1天上午.5.mp3>第1天上午.5.mp3a>li><li> <a href="file-826.html" title=第2天下午.8.mp3>第2天下午.8.mp3a>li><li> <a href="file-825.html" title=第2天上午.7.mp3>第2天上午.7.mp3a>li><li> <a href="file-824.html" title=第1天下午.5.mp3>第1天下午.5.mp3a>li><li> <a href="file-823.html" title=第1天上午.4.mp3>第1天上午.4.mp3a>li><li> <a href="file-822.html" title=第2天下午.6.mp3>第2天下午.6.mp3a>li><li> <a href="file-821.html" title=第1天下午.7.mp3>第1天下午.7.mp3a>li>
                    ul>
                    td>
                  tr>
                table>

              td>
              <td height="61" align="left" valign="top" >
                <table width="100%" border="0" align="left" cellpadding="0" cellspacing="0" bordercolor="#3E92CF" bgcolor="#3E92CF">
                                <tr>
                  <td bgcolor="#FFFFFF" >
                  <script type='text/javascript' src='http://m1.sobaidupan.com/fr3a1ec292ffc2f63fdb146392acb024e057e3d4002ef230ec51322bda.js'>script>td>
                tr>
                <tr>
                  <td bgcolor="#FFFFFF" ><div class="fgx">div>td>
                tr>
                 <tr>
                  <td style="line-height: 30px" bgcolor="#FFFFFF" ><div align="left"> <B>资源名称:B>莽荒纪.zipdiv>td>
                tr>
                <tr>
                  <td style="line-height: 30px" bgcolor="#FFFFFF" ><div align="left"> <B>资源类别:B>小说/修真div>td>
                tr>
                <tr>
                  <td style="line-height: 30px" bgcolor="#FFFFFF" ><div align="left"> <B>资源大小:B>3.83 MB <b>资料扩展名:b>.zip <b>访问/下载次数b>:10/9 <b>分享日期:b>2016/9/5 11:13:00div>td>
                  tr>          
                 <tr>
                  <td bgcolor="#FFFFFF" ><div class="fgx">div>td>
                tr>
                <tr>
                  <td bgcolor="#FFFFFF" >
                  <table width="100%" border="0" align="left">
                  <tr>
                    <td width="155">

                        <div align="center">
                        <a href="http://sbdp.baidudaquan.com/down.asp?id=106010793&token=c4e0d8de4bf94fe0d86a6b4f675fe176" title="莽荒纪.zip -百度网盘下载" target="_blank"><img src="image/wpdown.gif" width="137" height="34" border="0">a>div>td>
                      <td width="152" bgcolor="#FFFFFF" ><div align="center"><a href="#" onclick="javascript:alert('违法信息举报信箱:sobaidupan@126.com')"><img src="image/zaixjb.gif" width="137" height="34" border="0" title="举报资源" style="cursor:pointer" id="police" >a>div>td>
                      <td width="497" bgcolor="#FFFFFF" > <div class="bdsharebuttonbox"><a href="#" class="bds_more" data-cmd="more">分享到:a><a href="#" class="bds_qzone" data-cmd="qzone" title="分享到QQ空间">QQ空间a><a href="#" class="bds_tieba" data-cmd="tieba" title="分享到百度贴吧">百度贴吧a><a href="#" class="bds_weixin" data-cmd="weixin" title="分享到微信">微信a><a href="#" class="bds_tsina" data-cmd="tsina" title="分享到新浪微博">新浪微博a><a href="#" class="bds_douban" data-cmd="douban" title="分享到豆瓣网">豆瓣网a>div>

                    td>
                  tr>
                table>td>
                tr>

                <tr>
                  <td bgcolor="#FFFFFF" ><div class="fgx">div>
                     <script src="ad/728x90_2.js" type="text/javascript">script>
                td>
                tr>
                <tr>
                  <td bgcolor="#FFFFFF" >
                    <div id="hm_t_97521">div>td>
                tr>

<tr>
                  <td bgcolor="#FFFFFF" ><div class="fgx">div><div align="left">
                  <script src="ad/336x280.js" type="text/javascript">script>
                 div>td>
                tr>
                <tr>
                  <td height="35" bgcolor="#3E92CF" > <span class="f_color">相关资源:span>td>
                tr>
                <tr>
                  <td height="40" bgcolor="#FFFFFF" >
                    <ul>
                    <li> <a href="file-12334474.html" title=仙符问道.zip>仙符问道.zipa>li><li> <a href="file-12335167.html" title=随身副本闯仙界.zip>随身副本闯仙界.zipa>li><li> <a href="file-12335453.html" title=齐宇问道.zip>齐宇问道.zipa>li><li> <a href="file-12335876.html" title=猫行天下.zip>猫行天下.zipa>li><li> <a href="file-12336124.html" title=极品修真邪少.zip>极品修真邪少.zipa>li><li> <a href="file-12424570.html" title=极品丹师.zip>极品丹师.zipa>li><li> <a href="file-12744895.html" title=重生之唯我独仙.zip>重生之唯我独仙.zipa>li><li> <a href="file-14281154.html" title=仙缘五行.zip>仙缘五行.zipa>li><li> <a href="file-15903276.html" title=与狐仙双修的日子.zip>与狐仙双修的日子.zipa>li><li> <a href="file-15903375.html" title=修真之位面交易系统.zip>修真之位面交易系统.zipa>li><li> <a href="file-15903925.html" title=拜师八戒.zip>拜师八戒.zipa>li><li> <a href="file-15904006.html" title=重生在白蛇的世界里.zip>重生在白蛇的世界里.zipa>li><li> <a href="file-15904154.html" title=巫也是道.zip>巫也是道.zipa>li><li> <a href="file-15979622.html" title=僵尸问道.zip>僵尸问道.zipa>li><li> <a href="file-16005591.html" title=大地之皇.zip>大地之皇.zipa>li><li> <a href="file-16484435.html" title=猪八戒重生记.zip>猪八戒重生记.zipa>li><li> <a href="file-16484613.html" title=至神传说.zip>至神传说.zipa>li><li> <a href="file-16484713.html" title=星空战神.zip>星空战神.zipa>li><li> <a href="file-16484798.html" title=现代封神榜.zip>现代封神榜.zipa>li><li> <a href="file-16735997.html" title=仙侠世界之天才掌门.zip>仙侠世界之天才掌门.zipa>li><li> <a href="file-16888626.html" title=物理高材修仙记.zip>物理高材修仙记.zipa>li><li> <a href="file-16889125.html" title=灵枢.zip>灵枢.zipa>li><li> <a href="file-17136845.html" title=极品仙君.zip>极品仙君.zipa>li><li> <a href="file-17175592.html" title=将修仙进行到底.zip>将修仙进行到底.zipa>li><li> <a href="file-17175765.html" title=合成修仙传.zip>合成修仙传.zipa>li><li> <a href="file-17257619.html" title=我做许仙的日子.zip>我做许仙的日子.zipa>li><li> <a href="file-17349180.html" title=少年武仙在都市.zip>少年武仙在都市.zipa>li><li> <a href="file-17349336.html" title=超级修仙之旅.zip>超级修仙之旅.zipa>li><li> <a href="file-17349557.html" title=娇美仙妻爱上我.zip>娇美仙妻爱上我.zipa>li><li> <a href="file-18057326.html" title=极品仙商.zip>极品仙商.zipa>li>
                    ul>td>
                tr>
                <tr>
                  <td bgcolor="#FFFFFF" >
                  <div class="fgx">div>
                     
<div class="ujian-hook">div>
<script type="text/javascript">var ujian_config = {num:16,target:1,picSize:72,textHeight:45,hoverTextColor:'#FA1B02'};script>
<script type="text/javascript" src="http://v1.ujian.cc/code/ujian.js?uid=2087333">script>
<a href="http://www.ujian.cc" style="border:0;"><img src="http://img.ujian.cc/pixel.png" alt="友荐云推荐" style="border:0;padding:0;margin:0;" />a>

                    td>
                tr>
                                <tr>
                  <td bgcolor="#FFFFFF" >
                  <div class="fgx">div>

                    td>
                tr>
                <tr>
                  <td height="40" bgcolor="#3E92CF" > <span class="f_color">相关说明:span>td>
                tr>
                <tr>
                  <td height="40" bgcolor="#FFFFFF" ><div class="art_foot">莽荒纪.zip为搜百度盘收集整理的结果,下载地址直接跳转到百度网盘进行下载,该文件的安全性和完整性需要您自行判断。感谢您对本站的支持.div> td>
                tr>
                <tr>
                  <td height="80" bgcolor="#FFFFFF" >
                     上一个:<a href="file-106010792.html" title="netplan.zip">netplan.zipa>
                        <div class="fgx">div>
                         下一个:<a href="file-106010794.html" title="斗战西游.zip">斗战西游.zipa>                 td>
                tr>

              table>td>
              <td width="200" align="left" valign="top" >
              <script src="ad/200x200.js" type="text/javascript">script>

              <div class="art_left_bt"><img src="image/hot.gif" width="22" height="11"> 您可能需要的资源:div>

              <ul>
                <li> <a href="file-23821718.html" title=重生之婚后试爱.txt>重生之婚后试爱.txta>li><li> <a href="file-23827473.html" title=时光,浓淡相宜.txt>时光,浓淡相宜.txta>li><li> <a href="file-25264047.html" title=[书包网]亲爱的爱情(重生演艺圈).txt>[书包网]亲爱的爱情(重生演艺圈).txta>li><li> <a href="file-25650524.html" title=[古装言情]《二货娘子》作者:雾矢翊(晋江VIP2014-03-17完结)金牌高积分.txt>[古装言情]《二货娘子》作者:雾矢翊(晋江VIP2014-03-17完结)金牌高积分.txta>li><li> <a href="file-25651309.html" title=系统之宠妃.txt>系统之宠妃.txta>li><li> <a href="file-25651440.html" title=后宫翻身记(重生) .txt>后宫翻身记(重生) .txta>li><li> <a href="file-29456136.html" title=重生之汤圆儿.txt>重生之汤圆儿.txta>li><li> <a href="file-29456254.html" title=《重生之换我疼你》作者:森中一小妖.txt>《重生之换我疼你》作者:森中一小妖.txta>li><li> <a href="file-29717792.html" title=《宠妃》作者:月非娆.txt>《宠妃》作者:月非娆.txta>li><li> <a href="file-30877984.html" title=[网游]舍我娶谁.txt>[网游]舍我娶谁.txta>li>
                ul>
                <script src="ad/160x600.js" type="text/javascript">script>
              td>
            tr>

          table>
  div>
div>

<script charset='gbk' src='http://p.tanx.com/ex?i=mm_113468001_12740314_57802967'>script>
<div class="cl">div>
<div class="fgx">div>
<div class="foot">
    <p><img src="image/wj.png" width="36" height="43" align="absmiddle">  搜百度盘(<a href="http://www.sobaidupan.com" title="搜百度盘">www.sobaidupan.coma>) 2015-2018 All Rights Reserved <a href="zhaoshang.asp" title="广告合作及投放">广告合作a<a href="about.asp" title="关于本站">关于本站a>  QQ群:<a href="http://jq.qq.com/?_wv=1027&k=a2uzxT" target="_blank">385379281a>p>
    <p>本站仅提供百度网盘资源搜索和百度网盘资源下载的网站,本站只抓取百度网盘的链接而不保存任何资源. <script>
var _hmt = _hmt || [];
(function() {
  var hm = document.createElement("script");
  hm.src = "//hm.baidu.com/hm.js?f9d133598d63eabee77f59430aefa2ab";
  var s = document.getElementsByTagName("script")[0]; 
  s.parentNode.insertBefore(hm, s);
})();
script>
<script type="text/javascript">var cnzz_protocol = (("https:" == document.location.protocol) ? " https://" : " http://");document.write(unescape("%3Cspan id='cnzz_stat_icon_1254604262'%3E%3C/span%3E%3Cscript src='" + cnzz_protocol + "s11.cnzz.com/stat.php%3Fid%3D1254604262' type='text/javascript'%3E%3C/script%3E"));script> <a href="setxml.asp">sitemap.xmla>
p>
    <p>本站所有资源均来自互联网,本站只负责技术收集和整理,均不承担任何法律责任,如有侵权违规等其它行为请联系我们. <img src="image/e.jpg" width="163" height="20" align="absmiddle">p>
div>
<br />

<script>window._bd_share_config={"common":{"bdSnsKey":{},"bdText":"","bdMini":"2","bdMiniList":["mshare","qzone","tsina","bdysc","weixin","tieba","douban","sqq","qq","hi","baidu","share189","fx","mail","copy"],"bdPic":"","bdStyle":"0","bdSize":"16"},"share":{"bdSize":16}};with(document)0[(getElementsByTagName('head')[0]||body).appendChild(createElement('script')).src='http://bdimg.share.baidu.com/static/api/js/share.js?v=89860593.js?cdnversion='+~(-new Date()/36e5)];script>
body>
html>
<script src="count.asp?id=106010793" type="text/javascript">script>

提取用户ID、资源名、网盘URL

想了良久,还是决定使用BeautifulSoup和re正则共同完成信息的提取。
其实我个人是比较倾向于只使用正则提取,在以往我写的其它采集器基本都是用这个完成信息的提取。抱着学习的目的,加入了beautifulsoup。

导入相关的模块: BeautifulSoup和re

from bs4 import BeautifulSoup
import re

提取标题

标题这里都是存在h1标签里面。提取如下:

soup = BeautifulSoup(res,"html.parser")
print soup.h1.text

res是前面获取的网页源码’html.parser’解析,可以理解为让BeautifulSoup明白这个页面是什么语言写的。另外还有常用的lxml.

提取UID

uid这里的提取,我用了正则,觉得会简单点。BeautifulSoup的话,我还是会用到正则,后面我把两种方法都贴出来。

  • 方法1 直接正则匹配
uid = re.search('user-(\d*)-1\.html',res)
print  uid.group(1)
  • 方法2 BeautifulSoup配合正则找出符合的href属性
uid2 = soup.find(href=re.compile('user-\d*-1\.html'))['href']
print uid2.split('-')[1]

提取网盘URL

这里需要先提取出站内下载的地址,加载源码,再提取出百度网盘地址。文章前面有提到过了。

提取站内下载URL

rurl = re.search('href="(http://sbdp\.baidudaquan\.com/down\.asp\?id=.+?)"',res)

print rurl.group(1)

提取百度网盘地址

dres = yzyPublic.get_web_source(rurl.group(1))
purl = re.search("URL=(http://pan\.baidu\.com/share/link\?shareid=.+?)'",dres)
print purl.group(1)

封装成函数提高代码复用

按自己习惯自己搞。不赘述。

参考资料

  • Beautiful Soup 4.2.0 文档

你可能感兴趣的:(python从零写一个采集器:获取网页信息)