获取内容,比较纠结是用BeautifulSoup还是直接用正则匹配好。BeautifulSoup简单清晰,但是不够灵活。
正则则相反。
像网盘,我们要提取的信息主要有共享者ID、资源名、网盘URL、资源大小、创建时间等等。搞清楚这些信息的位置,不是本文的重点,所以这里假设已经清楚了信息的位置,然后提取就行了。用共享者ID、资源名、网盘URL做个示范。
举个栗子,比如莽荒纪.zip
的资源,URL是:http://www.sobaidupan.com/file-106010793.html
从HTML中我们可以获得如下信息:
莽荒纪.zip
而 2082813876
是sobaidupan.com的站内ID,也是百度云盘的用户ID。这就好办了。
但是资源的URL还要进一步加载http://sbdp.baidudaquan.com/down.asp?id=16166237&token=301efbbe2c138d150b41b5813a3d4077
才能知道。
源码如下:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<div style="margin:0 auto;margin-top:10%;width:600px;border: 1px solid #ff0000;line-height:30px;padding:10px 10px 10px 10px; ">提示:亲,正在为您跳转,请稍等2秒.....
<meta http-equiv='refresh' content='2;URL=http://pan.baidu.com/share/link?shareid=3994307345&uk=2755655514&fid=45639734040097'>div>
源码里的http://pan.baidu.com/share/link?shareid=3994307345&uk=2755655514&fid=45639734040097
正是我们要的资源。
也就是说,要提取莽荒纪的资源名称,至少得加载两次URL,才能将信息提取全。
第一次加载:http://www.sobaidupan.com/user-2082813876-1.html
得到资源名、共享者ID和网盘的站内地址http://sbdp.baidudaquan.com/down.asp?id=106010793&token=c4e0d8de4bf94fe0d86a6b4f675fe176
第二次加载: http://sbdp.baidudaquan.com/down.asp?id=106010793&token=c4e0d8de4bf94fe0d86a6b4f675fe176
提取出网盘的真实地址。
上一篇日志提到如何提取源码。我把它放到一个叫yzyPublic.py
文件里。所以等下得先导入这个文件再使用。
import yzyPublic
res = yzyPublic.get_web_source('http://www.sobaidupan.com/file-106010793.html')
print res
res内容如下:
<html xmlns=http://www.w3.org/1999/xhtml>
<head>
<meta http-equiv=X-UA-Compatible content="IE=edge,chrome=1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="style.css" />
<title>莽荒纪.zip_zgh*****1617_百度云盘下载 - 搜百度盘title>
<meta name="keywords" content="莽荒纪.zip" />
<meta name="description" content="小说/修真/莽荒纪.zip" />
<style type="text/css">
--
.f_color {
color: #FFFFFF;
font-weight: bold;
}
-->
style>
head>
<body>
<div class="headtop">
<div class="headtop_f"><B>搜百度盘(SoBaiduPan.com)B> 是基于百度云搜索,最大的百度云盘资源搜索中心,千万级大数据量,让您一网打尽所有的百度网盘资源.div>
div>
<div class="site_head w c">
<div class="sitelogo"><a href="/"><img src="image/logo.gif" border="0" title="SoBaiduPan.com">a>div>
<div class="top_allsite" id="top_allsite"><ul>
<script type="text/javascript" src="top_txtad.asp">script>
ul>div>
div>
<div class="menu w c">
<ul>
<li><a href="http://www.sobaidupan.com">首 页a>li>
<li><a href="list-1-1.html">最新资源a>li>
<li><a href="zhuan-1-1.html">影视目录a>li>
<li><a href="zhuan-2-1.html">小说目录a>li>
<li><a href="list-28-1.html">影视资源a>li>
<li><a href="list-30-1.html">动漫资源a>li>
<li><a href="list-29-1.html">小说资源a>li>
<li><a href="zhuan-3-1.html">综合资源a>li>
<li><a href="http://soft.sobaidupan.com" target="_blank" title="百度云下载器">云下载器a>li>
<li><a href="m.asp" title="移动端访问">手机专版a>li>
<li><a href="http://weipan.sobaidupan.com" title="新浪微盘资源搜索" target="_blank">新浪微盘a>li>
<li><a href="about.asp?id=2" title="在线发布共享资源">发布资源a>li>
<li><a href="http://bbs.sobaidupan.com" title="建议留言" target="_blank"><font color="#FFFF00">建议留言font>a>li>
ul>
div>
<div class="smenu c">
<div class="smenu_nav">
<a href="list-3-1.html">torrenta><a href="list-5-1.html">rmvba><a href="list-4-1.html">mp4a><a href="list-7-1.html">mp3a><a href="list-9-1.html">avia><a href="list-8-1.html">epuba><a href="list-10-1.html">mkva><a href="list-11-1.html">flva><a href="list-12-1.html">pdfa><a href="list-13-1.html">ppsa><a href="list-15-1.html">psda><a href="list-16-1.html">isoa><a href="list-17-1.html">ghosta><a href="list-19-1.html">exea><a href="list-20-1.html">txta><a href="list-21-1.html">apka><a href="list-22-1.html">ipaa><a href="list-24-1.html">wpsa><a href="list-25-1.html">rtfa><a href="list-26-1.html">voba><a href="list-13-1.html">ppt/pptxa><a href="list-27-1.html">xls/xlsxa><a href="list-14-1.html">doc/docxa><a href="list-18-1.html">rar/zipa>
div>
div>
<div class="search w c">
<table width="100%" height="90" border="0" align="center" cellpadding="0" cellspacing="1">
<tr>
<td>
<script type="text/javascript" src="ad/top1_580x90.js">script>
td>
<td>
<a href="adgo.asp?id=30" target="_blank"><img src="ad/ad2.jpg">a>
td>
tr>
table>
<div class="fgx">div>
<form id="form1" name="form1" method="get" action="search.asp" ><img src="image/s.png" width="32" height="32" align="absmiddle"> 请您输入搜索内容:
<input name="wd" id="wd" placeholder="共108,789,857个资源,今日已更新2382..." type="text" size="30" value="" autocomplete="off" />
<input type="submit" id="Su" tabindex="2" value="网盘搜索" style="cursor:hand;"> <img src="image/soso.gif" width="23" height="21" align="absmiddle"><a href="about.asp?id=1" target="_blank"><font color="red"><b>点击打赏本站b>font> <a href="http://koubei.baidu.com/s/www.sobaidupan.com" target="_blank"><b>点击支持本站b>a> <img src="image/new.gif" width="22" height="14" align="absmiddle"> <a href="http://soft.sobaidupan.com" target="_blank"><font color="red"><b>百度云搜索器b>font>a>
form>
div>
<script type="text/javascript" charset="gbk" src="opensug.js">script>
<script type="text/javascript">
var txtObj = document.getElementById("alertSpan");
function show(str){
window.location.href="search.asp?r=0&wd="+encodeURIComponent(str);
}
var params = {
"XOffset":0,
"YOffset":0,
"width":204,
"fontColor":"#f70",
"fontColorHI":"#FFF",
"fontSize":"15px",
"fontFamily":"宋体",
"borderColor":"gray",
"bgcolorHI":"#03c",
"sugSubmit":false
};
BaiduSuggestion.bind("wd",params,show);
script>
<div class="main w c">
<div class="art_bt_box w c"><ul><li><h1>莽荒纪.ziph1>li>ul>div>
<div class="art_box">
<table border="0">
<tr>
<td width="250" valign="top" ><table width="250" border="0" cellpadding="0" cellspacing="1" bordercolor="#3E92CF" bgcolor="#3E92CF">
<tr>
<td width="250" height="119" bgcolor="#FFFFFF" ><div align="center"><a href="user-2082813876-1.html"><img src="http://himg.bdimg.com/sys/portrait/item/797c6b21.jpg" width="100" height="100" border="0">a>div>td>
tr>
<tr>
<td height="40" bgcolor="#FFFFFF" ><div align="center">用户名:zgh*****1617div>td>
tr>
<tr>
<td height="40" bgcolor="#FFFFFF" ><div align="center"><a href="user-2082813876-1.html"><img src="image/jrzy.gif" width="89" height="24" border="0">a>div>td>
tr>
<tr>
<td bgcolor="#FFFFFF" >
<script src="ad/250x250.js" type="text/javascript">script>div>
td>
tr>
<tr>
<td height="35" bgcolor="#3E92CF" > <span class="f_color">Ta 分享的其它资源:span>td>
tr>
<tr>
<td height="40" bgcolor="#FFFFFF">
<ul>
<li> <a href="file-1266183.html" title=网游——屠龙巫师.zip>网游——屠龙巫师.zipa>li><li> <a href="file-1266216.html" title=网游-梦幻现实.zip>网游-梦幻现实.zipa>li><li> <a href="file-1266234.html" title=神也玩转网游.zip>神也玩转网游.zipa>li><li> <a href="file-1668670.html" title=魔兽英雄.zip>魔兽英雄.zipa>li><li> <a href="file-1668832.html" title=阿亚罗克年代记.zip>阿亚罗克年代记.zipa>li><li> <a href="file-1668883.html" title=重生之福星道士.zip>重生之福星道士.zipa>li><li> <a href="file-1668930.html" title=重生之极限风流.zip>重生之极限风流.zipa>li><li> <a href="file-1669255.html" title=英雄无敌之大航海时代.zip>英雄无敌之大航海时代.zipa>li><li> <a href="file-1674467.html" title=网游之霸世神偷.zip>网游之霸世神偷.zipa>li><li> <a href="file-2013963.html" title=霸王怒.zip>霸王怒.zipa>li>
ul>
td>
tr>
<tr>
<td bgcolor="#FFFFFF" >
<script src="ad/250x250-2.js" type="text/javascript">script>td>
tr>
<tr>
<td height="35" bgcolor="#3E92CF" > <span class="f_color">其它网友正在下载的资源:span>td>
tr>
<tr>
<td bgcolor="#FFFFFF" >
<ul>
<li> <a href="file-830.html" title=橄榄油 - 副本5.psd>橄榄油 - 副本5.psda>li><li> <a href="file-829.html" title=百度云管家 v4.8.0 绿色版 i2i2.cn.rar>百度云管家 v4.8.0 绿色版 i2i2.cn.rara>li><li> <a href="file-828.html" title=百度云管家 v4.8.0 单文件版 i2i2.cn.rar>百度云管家 v4.8.0 单文件版 i2i2.cn.rara>li><li> <a href="file-827.html" title=第1天上午.5.mp3>第1天上午.5.mp3a>li><li> <a href="file-826.html" title=第2天下午.8.mp3>第2天下午.8.mp3a>li><li> <a href="file-825.html" title=第2天上午.7.mp3>第2天上午.7.mp3a>li><li> <a href="file-824.html" title=第1天下午.5.mp3>第1天下午.5.mp3a>li><li> <a href="file-823.html" title=第1天上午.4.mp3>第1天上午.4.mp3a>li><li> <a href="file-822.html" title=第2天下午.6.mp3>第2天下午.6.mp3a>li><li> <a href="file-821.html" title=第1天下午.7.mp3>第1天下午.7.mp3a>li>
ul>
td>
tr>
table>
td>
<td height="61" align="left" valign="top" >
<table width="100%" border="0" align="left" cellpadding="0" cellspacing="0" bordercolor="#3E92CF" bgcolor="#3E92CF">
<tr>
<td bgcolor="#FFFFFF" >
<script type='text/javascript' src='http://m1.sobaidupan.com/fr3a1ec292ffc2f63fdb146392acb024e057e3d4002ef230ec51322bda.js'>script>td>
tr>
<tr>
<td bgcolor="#FFFFFF" ><div class="fgx">div>td>
tr>
<tr>
<td style="line-height: 30px" bgcolor="#FFFFFF" ><div align="left"> <B>资源名称:B>莽荒纪.zipdiv>td>
tr>
<tr>
<td style="line-height: 30px" bgcolor="#FFFFFF" ><div align="left"> <B>资源类别:B>小说/修真div>td>
tr>
<tr>
<td style="line-height: 30px" bgcolor="#FFFFFF" ><div align="left"> <B>资源大小:B>3.83 MB <b>资料扩展名:b>.zip <b>访问/下载次数b>:10/9 <b>分享日期:b>2016/9/5 11:13:00div>td>
tr>
<tr>
<td bgcolor="#FFFFFF" ><div class="fgx">div>td>
tr>
<tr>
<td bgcolor="#FFFFFF" >
<table width="100%" border="0" align="left">
<tr>
<td width="155">
<div align="center">
<a href="http://sbdp.baidudaquan.com/down.asp?id=106010793&token=c4e0d8de4bf94fe0d86a6b4f675fe176" title="莽荒纪.zip -百度网盘下载" target="_blank"><img src="image/wpdown.gif" width="137" height="34" border="0">a>div>td>
<td width="152" bgcolor="#FFFFFF" ><div align="center"><a href="#" onclick="javascript:alert('违法信息举报信箱:sobaidupan@126.com')"><img src="image/zaixjb.gif" width="137" height="34" border="0" title="举报资源" style="cursor:pointer" id="police" >a>div>td>
<td width="497" bgcolor="#FFFFFF" > <div class="bdsharebuttonbox"><a href="#" class="bds_more" data-cmd="more">分享到:a><a href="#" class="bds_qzone" data-cmd="qzone" title="分享到QQ空间">QQ空间a><a href="#" class="bds_tieba" data-cmd="tieba" title="分享到百度贴吧">百度贴吧a><a href="#" class="bds_weixin" data-cmd="weixin" title="分享到微信">微信a><a href="#" class="bds_tsina" data-cmd="tsina" title="分享到新浪微博">新浪微博a><a href="#" class="bds_douban" data-cmd="douban" title="分享到豆瓣网">豆瓣网a>div>
td>
tr>
table>td>
tr>
<tr>
<td bgcolor="#FFFFFF" ><div class="fgx">div>
<script src="ad/728x90_2.js" type="text/javascript">script>
td>
tr>
<tr>
<td bgcolor="#FFFFFF" >
<div id="hm_t_97521">div>td>
tr>
<tr>
<td bgcolor="#FFFFFF" ><div class="fgx">div><div align="left">
<script src="ad/336x280.js" type="text/javascript">script>
div>td>
tr>
<tr>
<td height="35" bgcolor="#3E92CF" > <span class="f_color">相关资源:span>td>
tr>
<tr>
<td height="40" bgcolor="#FFFFFF" >
<ul>
<li> <a href="file-12334474.html" title=仙符问道.zip>仙符问道.zipa>li><li> <a href="file-12335167.html" title=随身副本闯仙界.zip>随身副本闯仙界.zipa>li><li> <a href="file-12335453.html" title=齐宇问道.zip>齐宇问道.zipa>li><li> <a href="file-12335876.html" title=猫行天下.zip>猫行天下.zipa>li><li> <a href="file-12336124.html" title=极品修真邪少.zip>极品修真邪少.zipa>li><li> <a href="file-12424570.html" title=极品丹师.zip>极品丹师.zipa>li><li> <a href="file-12744895.html" title=重生之唯我独仙.zip>重生之唯我独仙.zipa>li><li> <a href="file-14281154.html" title=仙缘五行.zip>仙缘五行.zipa>li><li> <a href="file-15903276.html" title=与狐仙双修的日子.zip>与狐仙双修的日子.zipa>li><li> <a href="file-15903375.html" title=修真之位面交易系统.zip>修真之位面交易系统.zipa>li><li> <a href="file-15903925.html" title=拜师八戒.zip>拜师八戒.zipa>li><li> <a href="file-15904006.html" title=重生在白蛇的世界里.zip>重生在白蛇的世界里.zipa>li><li> <a href="file-15904154.html" title=巫也是道.zip>巫也是道.zipa>li><li> <a href="file-15979622.html" title=僵尸问道.zip>僵尸问道.zipa>li><li> <a href="file-16005591.html" title=大地之皇.zip>大地之皇.zipa>li><li> <a href="file-16484435.html" title=猪八戒重生记.zip>猪八戒重生记.zipa>li><li> <a href="file-16484613.html" title=至神传说.zip>至神传说.zipa>li><li> <a href="file-16484713.html" title=星空战神.zip>星空战神.zipa>li><li> <a href="file-16484798.html" title=现代封神榜.zip>现代封神榜.zipa>li><li> <a href="file-16735997.html" title=仙侠世界之天才掌门.zip>仙侠世界之天才掌门.zipa>li><li> <a href="file-16888626.html" title=物理高材修仙记.zip>物理高材修仙记.zipa>li><li> <a href="file-16889125.html" title=灵枢.zip>灵枢.zipa>li><li> <a href="file-17136845.html" title=极品仙君.zip>极品仙君.zipa>li><li> <a href="file-17175592.html" title=将修仙进行到底.zip>将修仙进行到底.zipa>li><li> <a href="file-17175765.html" title=合成修仙传.zip>合成修仙传.zipa>li><li> <a href="file-17257619.html" title=我做许仙的日子.zip>我做许仙的日子.zipa>li><li> <a href="file-17349180.html" title=少年武仙在都市.zip>少年武仙在都市.zipa>li><li> <a href="file-17349336.html" title=超级修仙之旅.zip>超级修仙之旅.zipa>li><li> <a href="file-17349557.html" title=娇美仙妻爱上我.zip>娇美仙妻爱上我.zipa>li><li> <a href="file-18057326.html" title=极品仙商.zip>极品仙商.zipa>li>
ul>td>
tr>
<tr>
<td bgcolor="#FFFFFF" >
<div class="fgx">div>
<div class="ujian-hook">div>
<script type="text/javascript">var ujian_config = {num:16,target:1,picSize:72,textHeight:45,hoverTextColor:'#FA1B02'};script>
<script type="text/javascript" src="http://v1.ujian.cc/code/ujian.js?uid=2087333">script>
<a href="http://www.ujian.cc" style="border:0;"><img src="http://img.ujian.cc/pixel.png" alt="友荐云推荐" style="border:0;padding:0;margin:0;" />a>
td>
tr>
<tr>
<td bgcolor="#FFFFFF" >
<div class="fgx">div>
td>
tr>
<tr>
<td height="40" bgcolor="#3E92CF" > <span class="f_color">相关说明:span>td>
tr>
<tr>
<td height="40" bgcolor="#FFFFFF" ><div class="art_foot">莽荒纪.zip为搜百度盘收集整理的结果,下载地址直接跳转到百度网盘进行下载,该文件的安全性和完整性需要您自行判断。感谢您对本站的支持.div> td>
tr>
<tr>
<td height="80" bgcolor="#FFFFFF" >
上一个:<a href="file-106010792.html" title="netplan.zip">netplan.zipa>
<div class="fgx">div>
下一个:<a href="file-106010794.html" title="斗战西游.zip">斗战西游.zipa> td>
tr>
table>td>
<td width="200" align="left" valign="top" >
<script src="ad/200x200.js" type="text/javascript">script>
<div class="art_left_bt"><img src="image/hot.gif" width="22" height="11"> 您可能需要的资源:div>
<ul>
<li> <a href="file-23821718.html" title=重生之婚后试爱.txt>重生之婚后试爱.txta>li><li> <a href="file-23827473.html" title=时光,浓淡相宜.txt>时光,浓淡相宜.txta>li><li> <a href="file-25264047.html" title=[书包网]亲爱的爱情(重生演艺圈).txt>[书包网]亲爱的爱情(重生演艺圈).txta>li><li> <a href="file-25650524.html" title=[古装言情]《二货娘子》作者:雾矢翊(晋江VIP2014-03-17完结)金牌高积分.txt>[古装言情]《二货娘子》作者:雾矢翊(晋江VIP2014-03-17完结)金牌高积分.txta>li><li> <a href="file-25651309.html" title=系统之宠妃.txt>系统之宠妃.txta>li><li> <a href="file-25651440.html" title=后宫翻身记(重生) .txt>后宫翻身记(重生) .txta>li><li> <a href="file-29456136.html" title=重生之汤圆儿.txt>重生之汤圆儿.txta>li><li> <a href="file-29456254.html" title=《重生之换我疼你》作者:森中一小妖.txt>《重生之换我疼你》作者:森中一小妖.txta>li><li> <a href="file-29717792.html" title=《宠妃》作者:月非娆.txt>《宠妃》作者:月非娆.txta>li><li> <a href="file-30877984.html" title=[网游]舍我娶谁.txt>[网游]舍我娶谁.txta>li>
ul>
<script src="ad/160x600.js" type="text/javascript">script>
td>
tr>
table>
div>
div>
<script charset='gbk' src='http://p.tanx.com/ex?i=mm_113468001_12740314_57802967'>script>
<div class="cl">div>
<div class="fgx">div>
<div class="foot">
<p><img src="image/wj.png" width="36" height="43" align="absmiddle"> 搜百度盘(<a href="http://www.sobaidupan.com" title="搜百度盘">www.sobaidupan.coma>) 2015-2018 All Rights Reserved <a href="zhaoshang.asp" title="广告合作及投放">广告合作a> <a href="about.asp" title="关于本站">关于本站a> QQ群:<a href="http://jq.qq.com/?_wv=1027&k=a2uzxT" target="_blank">385379281a>p>
<p>本站仅提供百度网盘资源搜索和百度网盘资源下载的网站,本站只抓取百度网盘的链接而不保存任何资源. <script>
var _hmt = _hmt || [];
(function() {
var hm = document.createElement("script");
hm.src = "//hm.baidu.com/hm.js?f9d133598d63eabee77f59430aefa2ab";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
script>
<script type="text/javascript">var cnzz_protocol = (("https:" == document.location.protocol) ? " https://" : " http://");document.write(unescape("%3Cspan id='cnzz_stat_icon_1254604262'%3E%3C/span%3E%3Cscript src='" + cnzz_protocol + "s11.cnzz.com/stat.php%3Fid%3D1254604262' type='text/javascript'%3E%3C/script%3E"));script> <a href="setxml.asp">sitemap.xmla>
p>
<p>本站所有资源均来自互联网,本站只负责技术收集和整理,均不承担任何法律责任,如有侵权违规等其它行为请联系我们. <img src="image/e.jpg" width="163" height="20" align="absmiddle">p>
div>
<br />
<script>window._bd_share_config={"common":{"bdSnsKey":{},"bdText":"","bdMini":"2","bdMiniList":["mshare","qzone","tsina","bdysc","weixin","tieba","douban","sqq","qq","hi","baidu","share189","fx","mail","copy"],"bdPic":"","bdStyle":"0","bdSize":"16"},"share":{"bdSize":16}};with(document)0[(getElementsByTagName('head')[0]||body).appendChild(createElement('script')).src='http://bdimg.share.baidu.com/static/api/js/share.js?v=89860593.js?cdnversion='+~(-new Date()/36e5)];script>
body>
html>
<script src="count.asp?id=106010793" type="text/javascript">script>
想了良久,还是决定使用BeautifulSoup和re正则共同完成信息的提取。
其实我个人是比较倾向于只使用正则提取,在以往我写的其它采集器基本都是用这个完成信息的提取。抱着学习的目的,加入了beautifulsoup。
导入相关的模块: BeautifulSoup和re
from bs4 import BeautifulSoup
import re
标题这里都是存在h1标签里面。提取如下:
soup = BeautifulSoup(res,"html.parser")
print soup.h1.text
res是前面获取的网页源码’html.parser’解析,可以理解为让BeautifulSoup明白这个页面是什么语言写的。另外还有常用的lxml.
uid这里的提取,我用了正则,觉得会简单点。BeautifulSoup的话,我还是会用到正则,后面我把两种方法都贴出来。
uid = re.search('user-(\d*)-1\.html',res)
print uid.group(1)
uid2 = soup.find(href=re.compile('user-\d*-1\.html'))['href']
print uid2.split('-')[1]
这里需要先提取出站内下载的地址,加载源码,再提取出百度网盘地址。文章前面有提到过了。
提取站内下载URL
rurl = re.search('href="(http://sbdp\.baidudaquan\.com/down\.asp\?id=.+?)"',res)
print rurl.group(1)
提取百度网盘地址
dres = yzyPublic.get_web_source(rurl.group(1))
purl = re.search("URL=(http://pan\.baidu\.com/share/link\?shareid=.+?)'",dres)
print purl.group(1)
按自己习惯自己搞。不赘述。