http://blog.csdn.net/abclixu123/article/details/38502993
详细内容在http://www.tuicool.com/articles/RNFVrm
Python的BeautifulSoup包大家都知道吧,
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
利用这个包先把html里script,style给清理了:
[script.extract()for script in soup.findAll('script')]
[style.extract()for style in soup.findAll('style')]
#清理完成后,这个包有一个prettify()函数,把代码格式给搞的标准一些:
soup.prettify()
然后用正则表达式,把所有的HTML标签全部清理了:
reg1 = re.compile("<[^>]*>")
content = reg1.sub('',soup.prettify())
剩下的都是纯文本的文件了,通常是一行行的,把空白行给排除了,这样就会知道总计有多少行,每行的字符数有多少,我用excel搞了一些每行字符数的统计,如下图:
http://www.open-open.com/lib/view/open1420378937984.html
urllib.request.urlretrieve(url,‘example.pdf’)
例子:抓取巨潮资讯网的草案页面下pdf下载
from bs4 import BeautifulSoup
import urllib.request
import re
import io
import sys
sys.stdout =io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') #改变标准输出的默认编码,输出就正常了
urlhand = urllib.request.urlopen(‘http://www.cninfo.com.cn/cninfo-new/disclosure/sse/bulletin_detail/true/1202367076?announceTime=2016-06-14’)
soup = BeautifulSoup(content,'html.parser')
# print(soup.get_text())#cmd中显示仍然乱码
getContent = soup.find(attrs ={'class':r'btn-blue bd-btn'})
# print(getContent)
tags = getContent('a')
# print(len(tags))
pattern = r'"/(cninfo-new.+?)"'
stringFinded = re.findall(pattern,str(tags))#用search的结果难处理,用findall得到list
downloadUrl = 'http://www.cninfo.com.cn/' +stringFinded[0]
print(downloadUrl)
##获得文件名
getName = soup.find('h2')
print(getName.encode('utf-8'))
pattern = r'(\d+?.+?)
[\n\t]+?([\u4e00-\u9fa5\w]+?.+?)[\n\t]'#遇到%号,[\u4e00-\u9fa5\w]是用于匹配中文的,就会停下,点匹配符(.)能匹配任何内容除了\n
stringFinded = re.findall(pattern,str(getName))
print(stringFinded)
saveName =stringFinded[0][0]+stringFinded[0][1] +'.pdf'
urllib.request.urlretrieve(downloadUrl,saveName)
#获得全部text
# fin = open('./textResult.txt', 'w',encoding = 'utf-8')#因为要写入字符串,用utf格式写入
# fin.write(soup.get_text())#参数是str
# fin.close()
http://www.oschina.net/translate/python-parallelism-in-one-line
利用try except 捕捉错误并且写入noOklist列表,把正常获取到的下载文件保存。
注意:
1. 触发异常的语句必须在try里面才能被捕捉到
2. Except 后可以不写exception的种类,捕捉索引,但是如果写了种类,超出种类的就会截断程序。
例如:
#处理包含pdf的巨潮网页,获得pdf下载地址
import xlrd
import re
import socket
from bs4 import BeautifulSoup
import urllib.request
socket.setdefaulttimeout(10)#设置超时时间
#记录所有在excel中取得的数据
pageList = [‘http://www.cninfo.com.cn/cninfo-new/disclosure/sse/bulletin_detail/true/1202367076?announceTime=2016-06-14’,‘http://www.cninfo.com.cn/cninfo-new/disclosure/szse_sme/bulletin_detail/true/1202534048?announceTime=2016-08-03%2008:15’]
pageName = [‘haha’, ‘xixi’]
#记录获得的pdf下载地址,用于自动下载
downloadName = []
downloadList = []
#记录pdf下载失败的地址,手动下载
notOkList = []
notOkName = []
for index in range(numPageList):
url = pageList[index]
print(index, url)
name = pageName[index]
#print('i am try' + str(index))
try:
html = urllib.request.urlopen(url)
# print('i am hear' + str(index))
content = html.read()
soup = BeautifulSoup(content, 'html.parser')
getContent = soup.find(attrs = {'class' : r'btn-blue bd-btn'})
tags = getContent('a')
pattern = r'"/(cninfo-new.+?)"'
stringFinded = re.findall(pattern, str(tags))#用search的结果难处理,用findall得到list
dlAddress = 'http://www.cninfo.com.cn/' + stringFinded[0]
downloadList.append(dlAddress)
downloadName.append(name)
except TimeoutError:#
notOkList.append(url)
notOkName.append(name)
continue
关于异常捕捉的补充:
如果不指明异常类型则会捕捉try内容内所有异常,可能是findall的异常,但是可以让程序通畅运行。缺点是无法得知是否出现除了timeout外的错误,所以一般在调试时建议保留异常类型,正式运行删去可以确保完整运行,捕捉所有异常到notOkList再另行处理
一旦自定义的文件名中包含以上字符,则保存的文件名会只剩下左半部分,比如abc?d.pdf最后保存只剩下abc
处理方法:
patternIllegal = r'[^\\/\*\?"<>|]+'#[]内加^是代表了否定匹配,可以找出所有正常字符,最后再join就可以
results= re.findall(patternIllegal, preDifinedName)
fileName =docAddress + ''.join(results)
UseDownloadList.py
#使用parallel获得的downList
import urllib.request
import re
import socket
socket.setdefaulttimeout(10)
currRound = 20
print(currRound)
while (currRound > 0):
fhand = open('./downloadFile.txt', 'r', encoding = 'utf-8')
fContent = fhand.read()
fhand.close()
patternUrl = r'downloadList\n(.+?)\ndownloadName'
patternName = r'downloadName\n(.+)'
urlList = re.findall(patternUrl, fContent, re.S)
nameList = re.findall(patternName, fContent, re.S)
urls = urlList[0].splitlines()
names = nameList[0].splitlines()
docAddress = './pdfStores/'
failUrlList = []
failNameList = []
for index in range(len(urls)):
url = urls[index]
saveName = docAddress + names[index]
try:
print('hi', index)
urllib.request.urlretrieve(url, saveName)
except:
print(url)
print(saveName)
failUrlList.append(url)
failNameList.append(saveName)
continue
if (len(failUrlList) <= 5): break
downloadFile = open('downloadFile.txt', 'w', encoding = 'utf-8')
downloadFile.write('\ndownloadList\n')
downloadFile.write('\n'.join(failUrlList))
downloadFile.write('\ndownloadName\n')
downloadFile.write('\n'.join(failNameList))
downloadFile.close()
currRound -= 1
下文并没有讲到完全解决方案,但是提供几种尝试方法,其中一种便为使用user-agent的伪装
http://www.crifan.com/use_python_urllib-urlretrieve_download_picture_speed_too_slow_add_user_agent_for_urlretrieve/comment-page-1/
知乎讨论区。后几个有提供方法
最后,再最后水一下不再更新的破爬虫相关博客:
爬虫必备——requests
01. 准备
02. 简单的尝试
番外篇. 搭建称手的Python开发环境
03. 豆瓣电影TOP250
04. 另一种抓取方式
05. 存储
06. 海量数据的抓取策略
07. 反爬机制<1>
08. 模拟登录
09. 通过爬虫找出我和轮子哥之间的最短关注链
作者:xlzd
链接:https://www.zhihu.com/question/28168585/answer/120205863
来源:知乎
著作权归作者所有,转载请联系作者获得授权。
防止被检测下载频繁,可以设置爬虫暂停
Import time
time.sleep(500)
浏览器f12查看network中的情况。
Request header的情况累死
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Host:www.cninfo.com.cn
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 10.0;WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101Safari/537.36
详细请看下面的讲解
https://jecvay.com/2014/09/python3-web-bug-series3.html
代码如下:
import urllib.request
url = 'http://www.baidu.com/'
req = urllib.request.Request(url, headers ={
'Connection': 'Keep-Alive',
'Accept': 'text/html, application/xhtml+xml, */*',
'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0)like Gecko'
})
oper = urllib.request.urlopen(req)
data = oper.read()
print(data.decode())
#写入的话,因为是urlopen返回byte流,因此直接以byte形式写入, open参数写’wb’,
data = oper.read()
fhand = open(saveName, 'wb')
fhand.write(data)
fhand.close()
可以用type观察数据流的格式。利用decode(byte解码为字符串)来转换
>>> type(contents)
>>> type(contents.decode('utf-8'))
>>>type(contents.decode('utf-8','ignore'))
作者:Fel Peter
链接:https://www.zhihu.com/question/35838789/answer/65794367
来源:知乎
著作权归作者所有,转载请联系作者获得授权。