Python的验证码识别,模拟ajax请求,爬取优酷会员(滑稽)

首先想写一个爬取一个网站的优酷会员分享,但是是要输入验证码。
首先,我用谷歌分析其验证码的请求。

Python的验证码识别,模拟ajax请求,爬取优酷会员(滑稽)_第1张图片

然后拼接url 去访问发现做了限制

Python的验证码识别,模拟ajax请求,爬取优酷会员(滑稽)_第2张图片

那么应该是做了检测对请求头。
复制刷新验证码图片的请求头。自己构造个请求,并写出图片

def getyzm():
    headers={
    'Accept-Encoding':'gzip, deflate, sdch',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    #Cookie:PHPSESSID=d763fd34e25925880c490955de8e0f2c
    'Host':'vip.cengfan6.com',
    'Referer':'http://vip.cengfan6.com/y/',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',
    'X-Requested-With':'XMLHttpRequest'
    }
    i =random.randint(1,999999)
    print(i)
    url='http://vip.cengfan6.com/y/../code.php?s=%i' %i
    html = requests.get(url,headers=headers)
    #写出图片
    with open('yzm.png','wb') as f:
        f.write(html.content)

然后就是验证码识别了。开始用的pyteeser。真不是很好安装(苦笑)
参考
http://www.th7.cn/Program/Python/201602/768304.shtml

http://m.blog.csdn.net/article/details?id=53537010

https://my.oschina.net/jhao104/blog/647326?fromerr=xJxwPW5X

太麻烦了,然后用的 pytesseract

测试

import pytesseract
from PIL import Image

image = Image.open('c:/yzm.png')
code = pytesseract.image_to_string(image)
print(code)

啊,识别出了英文。我的是数字啊orz

想了下要么看下机器学习训练下。啊,我不会啊,要学!
参考学习 http://www.cnblogs.com/beer/p/5672678.html
先用人工的把(伤心)


#识别验证码
def viewyzm():
    print('please input yanzhengma')
    time.sleep(2)
    image = Image.open('yzm.png')
    image.show()
    yzm = raw_input(u'关闭图片才能输入')
    print(yzm)
getyzm()
viewyzm()

后面又遇到了ajax请求。
谷歌看到请求
很有意思的是,刷新页面请求的是历史记录,先获取之前获取的账号密码。
我写了两个函数,一个是请求新的账号密码和请求历史记录的账号密码。
网站做了限制,只能获取5个。我做了代理还是只能5个。what?不是对ip做了限制?

def  get_vip():
    #请求,但是没有解密,可以在历史记录中获取获取到的vip账号
    headers={
    'Accept-Encoding':'gzip, deflate, sdch',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    #'Cookie':PHPSESSID=d3a9d9a7a9ad9fee71a9588773388ead
    'Host':'vip.cengfan6.com',
    'Referer':'http://vip.cengfan6.com/y/',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',
    'X-Requested-With':'XMLHttpRequest'
    }
    proxies={
        '117.90.6.65':9000
    }
    vip_url='http://vip.cengfan6.com/ajax.php?code=%s &typename=2' %viewyzm()
    viphtml  = requests.get(vip_url,headers=headers,proxies=proxies)
    print(viphtml.content)
def get_host_vip():
    proxies={
        '117.90.6.65':9000
    }
    vip_url= 'http://vip.cengfan6.com/ajax_jilu.php?viptype=2'
    viphtml  = requests.get(vip_url,proxies=proxies)
    vips =re.findall('

优酷(土豆)帐号:(.+?)密码:(.+?)

'
,viphtml.content) for vip in vips: print(vip[0]+":"+vip[1])

应该是我设置代理的方式有误。
不过5个也是够的。我经常用这个网站的会员。手动滑稽

所有代码记录下~~:

# -*- coding: UTF-8 -*-
#../code.php?s=992671249
#url='http://vip.cengfan6.com/y/'
import requests
from bs4 import BeautifulSoup
import random
from PIL import Image
import time
import re
#获取验证码
def getyzm():
    headers={
    'Accept-Encoding':'gzip, deflate, sdch',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    #Cookie:PHPSESSID=d763fd34e25925880c490955de8e0f2c
    'Host':'vip.cengfan6.com',
    'Referer':'http://vip.cengfan6.com/y/',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',
    'X-Requested-With':'XMLHttpRequest'
    }
    i =random.randint(1,999999)
    print(i)
    url='http://vip.cengfan6.com/y/../code.php?s=%i' %i
    html = requests.get(url,headers=headers)
    #写出图片
    with open('yzm.png','wb') as f:
        f.write(html.content)

#识别验证码
def viewyzm():
    print('please input yanzhengma')
    time.sleep(2)
    image = Image.open('yzm.png')
    image.show()
    yzm = raw_input(u'关闭图片才能输入')
    return yzm



xhrhd ='''
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Cookie:PHPSESSID=d3a9d9a7a9ad9fee71a9588773388ead
Host:vip.cengfan6.com
Referer:http://vip.cengfan6.com/y/
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36
X-Requested-With:XMLHttpRequest
'''


def  get_vip():
    #请求,但是没有解密,可以在历史记录中获取获取到的vip账号
    headers={
    'Accept-Encoding':'gzip, deflate, sdch',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    'Cookie':'PHPSESSID=d3a9d9a7a9ad9fee71a9588773388ewd',
    'Host':'vip.cengfan6.com',
    'Referer':'http://vip.cengfan6.com/y/',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',
    'X-Requested-With':'XMLHttpRequest'
    }
    proxies={
        '117.90.6.65':9000
    }
    vip_url='http://vip.cengfan6.com/ajax.php?code=%s &typename=2' %viewyzm()
    viphtml  = requests.get(vip_url,headers=headers,proxies=proxies)
    print(viphtml.content)
def get_host_vip():
    proxies={
        '117.90.6.65':9000
    }
    vip_url= 'http://vip.cengfan6.com/ajax_jilu.php?viptype=2'
    viphtml  = requests.get(vip_url,proxies=proxies)
    vips =re.findall('

优酷(土豆)帐号:(.+?)密码:(.+?)

'
,viphtml.content) for vip in vips: print(vip[0]+":"+vip[1]) getyzm() get_vip() get_host_vip()

真的不是为了这个获取会员而做的。主要想多写些东西。不写就容易忘。

你可能感兴趣的:(Python,爬虫)