首先想写一个爬取一个网站的优酷会员分享,但是是要输入验证码。
首先,我用谷歌分析其验证码的请求。
然后拼接url 去访问发现做了限制
那么应该是做了检测对请求头。
复制刷新验证码图片的请求头。自己构造个请求,并写出图片
def getyzm():
headers={
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
#Cookie:PHPSESSID=d763fd34e25925880c490955de8e0f2c
'Host':'vip.cengfan6.com',
'Referer':'http://vip.cengfan6.com/y/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}
i =random.randint(1,999999)
print(i)
url='http://vip.cengfan6.com/y/../code.php?s=%i' %i
html = requests.get(url,headers=headers)
#写出图片
with open('yzm.png','wb') as f:
f.write(html.content)
然后就是验证码识别了。开始用的pyteeser。真不是很好安装(苦笑)
参考
http://www.th7.cn/Program/Python/201602/768304.shtml
http://m.blog.csdn.net/article/details?id=53537010
https://my.oschina.net/jhao104/blog/647326?fromerr=xJxwPW5X
太麻烦了,然后用的 pytesseract
测试
import pytesseract
from PIL import Image
image = Image.open('c:/yzm.png')
code = pytesseract.image_to_string(image)
print(code)
啊,识别出了英文。我的是数字啊orz
想了下要么看下机器学习训练下。啊,我不会啊,要学!
参考学习 http://www.cnblogs.com/beer/p/5672678.html
先用人工的把(伤心)
#识别验证码
def viewyzm():
print('please input yanzhengma')
time.sleep(2)
image = Image.open('yzm.png')
image.show()
yzm = raw_input(u'关闭图片才能输入')
print(yzm)
getyzm()
viewyzm()
后面又遇到了ajax请求。
谷歌看到请求
很有意思的是,刷新页面请求的是历史记录,先获取之前获取的账号密码。
我写了两个函数,一个是请求新的账号密码和请求历史记录的账号密码。
网站做了限制,只能获取5个。我做了代理还是只能5个。what?不是对ip做了限制?
def get_vip():
#请求,但是没有解密,可以在历史记录中获取获取到的vip账号
headers={
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
#'Cookie':PHPSESSID=d3a9d9a7a9ad9fee71a9588773388ead
'Host':'vip.cengfan6.com',
'Referer':'http://vip.cengfan6.com/y/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}
proxies={
'117.90.6.65':9000
}
vip_url='http://vip.cengfan6.com/ajax.php?code=%s &typename=2' %viewyzm()
viphtml = requests.get(vip_url,headers=headers,proxies=proxies)
print(viphtml.content)
def get_host_vip():
proxies={
'117.90.6.65':9000
}
vip_url= 'http://vip.cengfan6.com/ajax_jilu.php?viptype=2'
viphtml = requests.get(vip_url,proxies=proxies)
vips =re.findall('优酷(土豆)帐号:(.+?)密码:(.+?)
',viphtml.content)
for vip in vips:
print(vip[0]+":"+vip[1])
应该是我设置代理的方式有误。
不过5个也是够的。我经常用这个网站的会员。手动滑稽
所有代码记录下~~:
# -*- coding: UTF-8 -*-
#../code.php?s=992671249
#url='http://vip.cengfan6.com/y/'
import requests
from bs4 import BeautifulSoup
import random
from PIL import Image
import time
import re
#获取验证码
def getyzm():
headers={
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
#Cookie:PHPSESSID=d763fd34e25925880c490955de8e0f2c
'Host':'vip.cengfan6.com',
'Referer':'http://vip.cengfan6.com/y/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}
i =random.randint(1,999999)
print(i)
url='http://vip.cengfan6.com/y/../code.php?s=%i' %i
html = requests.get(url,headers=headers)
#写出图片
with open('yzm.png','wb') as f:
f.write(html.content)
#识别验证码
def viewyzm():
print('please input yanzhengma')
time.sleep(2)
image = Image.open('yzm.png')
image.show()
yzm = raw_input(u'关闭图片才能输入')
return yzm
xhrhd ='''
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Cookie:PHPSESSID=d3a9d9a7a9ad9fee71a9588773388ead
Host:vip.cengfan6.com
Referer:http://vip.cengfan6.com/y/
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36
X-Requested-With:XMLHttpRequest
'''
def get_vip():
#请求,但是没有解密,可以在历史记录中获取获取到的vip账号
headers={
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Cookie':'PHPSESSID=d3a9d9a7a9ad9fee71a9588773388ewd',
'Host':'vip.cengfan6.com',
'Referer':'http://vip.cengfan6.com/y/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}
proxies={
'117.90.6.65':9000
}
vip_url='http://vip.cengfan6.com/ajax.php?code=%s &typename=2' %viewyzm()
viphtml = requests.get(vip_url,headers=headers,proxies=proxies)
print(viphtml.content)
def get_host_vip():
proxies={
'117.90.6.65':9000
}
vip_url= 'http://vip.cengfan6.com/ajax_jilu.php?viptype=2'
viphtml = requests.get(vip_url,proxies=proxies)
vips =re.findall('优酷(土豆)帐号:(.+?)密码:(.+?)
',viphtml.content)
for vip in vips:
print(vip[0]+":"+vip[1])
getyzm()
get_vip()
get_host_vip()
真的不是为了这个获取会员而做的。主要想多写些东西。不写就容易忘。