今天给一个小可爱同学写的爬虫;
我的环境是:
MacOs 10.13.5;
Python 2.7.10;
用到的包:
urllib2
BeautifulSoup4
先自动生成获取段子的目标url:
url = 'http://www.qiushibaike.com/hot/page/' + str(page)
然后用urllib2直接获取html内容,用headers头伪装浏览器;
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
try:
request = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(request)
content = response.read().decode('utf-8')
然后用BS4构建html的对象;
用html.parser来解析内容;
用find来找到需要查找的标签,用id和div来做区分,最后找到了页面的第一个段子的标签节点,使用get_text()去除标签部分;
注意到找到的内容是Unicode类型的,为了统一格式把它重新编码成utf-8类型,使用encode(utf-8);
然后把换行符去掉(因为直接获取到的字符串里包含很多换行符,影响阅读,所以去掉了换行符,换成统一格式比较好看);
content = response.read().decode('utf-8')
soup = BeautifulSoup(content,"html.parser")
aPage = soup.find('body').find('div', id = 'content').find('div', id = 'content-left').find('div')
print aPage.span.get_text().encode("utf-8").replace("\n","")+"\n"
然后依次访问这个页面剩下的所有段子,全部输出(也是去掉空格统一格式输出);
在页面最后一个段子的某个地方好像会有一个结果是"1",懒得找是在哪个节点下错查出来的…直接用len(s)来过滤掉这个字符串;
for i in aPage.find_next_siblings():
s=i.span.get_text().encode("utf-8")
if len(s)>3:
print s.replace("\n","")+"\n"
最后异常处理部分就不写了…直接pass…ho ho ho ~
最终代码:
# -*- coding:utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
n = 2
#每页25个段子,一共13页
print "\n"
print "正在爬取糗事百科段子,请稍候:\n"
for page in range(1,n+1):
print "\n\n第"+str(page)+"条段子:\n"
url = 'http://www.qiushibaike.com/hot/page/' + str(page)
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
try:
request = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(request)
content = response.read().decode('utf-8')
soup = BeautifulSoup(content,"html.parser")
aPage = soup.find('body').find('div', id = 'content').find('div', id = 'content-left').find('div')
print aPage.span.get_text().encode("utf-8").replace("\n","")+"\n"
for i in aPage.find_next_siblings():
s=i.span.get_text().encode("utf-8")
if len(s)>3:
print s.replace("\n","")+"\n"
except :
pass