验证码的处理 google recaptcha

google验证码的第三方处理

#经过批量测试,成功率高达百分之九十,1000个邮箱大概4美金,比较实惠

第一部分.
通过请求youtube简介页面,获取网红的channel_id和页面产生的session(后面的请求需要作为参数使用)
1.请求方式:get
2.url:https://www.youtube.com/channel/UCUHDuZbkCs7gs_cVDV5p3Yw/about?pbj=1
解析返回的response来获取token和channel_id

url = “https://www.youtube.com/channel/UCUHDuZbkCs7gs_cVDV5p3Yw”
headers = {
        "accept-language": "zh-CN,zh;q=0.9",
        "referer": url,
        "user-agent": UserAgent().random,
        "x-client-data": "CIu2yQEIpLbJAQjBtskBCKmdygEIqKPKARj5pcoB",
        "x-spf-previous": url,
        "x-spf-referer": url,
        "x-youtube-client-name": "1",
        "x-youtube-client-version": "2.20181030",
        "x-youtube-page-cl": "218622685",
        "x-youtube-page-label": "youtube.ytfe.desktop_20181024_8_RC0",
        "x-youtube-sts": "17829",
        "x-youtube-utc-offset": "480",
        "x-youtube-variants-checksum": "92a32082074d908eccfd9ce4e205165f"
    }

response = requests.get(url=url, headers=headers, verify=False)
responseBody = json.loads(response.text)
token = jsonpath.jsonpath(responseBody, "$..xsrf_token")[0]
serviceTrackingParams = jsonpath.jsonpath(responseBody, "$..serviceTrackingParams")[0]
for item in serviceTrackingParams:
     if item["service"] == "GFEEDBACK":
         params = item["params"]
         for i in params:
             if i["key"] == "browse_id":
                 value = i["value"]
                 channel_id = value
                 break

第二部分
获取所需的data-sitekey
1 请求方式: post
2 data参数:session_token # 第一部分获取的token
3 url = “https://www.youtube.com/channels_profile_ajax?action_get_business_email_captcha=1”

def getdata_sitekey(token, url):
    url_sitekey = "https://www.youtube.com/channels_profile_ajax?action_get_business_email_captcha=1"
    headers = {
        "accept-language": "zh-CN,zh;q=0.9",
        "referer": url + "/about",
        "user-agent": UserAgent().random,
        "x-client-data": "CIu2yQEIpLbJAQjBtskBCKmdygEIqKPKARj5pcoB",
        "x-spf-previous": url + "/about",
        "x-spf-referer": url + "/about",
        "x-youtube-client-name": "1",
        "x-youtube-client-version": "2.20181030",
        "x-youtube-page-cl": "218622685",
        "x-youtube-page-label": "youtube.ytfe.desktop_20181024_8_RC0",
        "x-youtube-sts": "17829",
        "x-youtube-utc-offset": "480",
        "x-youtube-variants-checksum": "92a32082074d908eccfd9ce4e205165f"
    }
    formdata = {
        "session_token": token
    }
    data_sitekey = ""
    for i in range(3):
        try:
            response = session.post(url=url_sitekey, data=formdata, headers=headers, verify=False)
            responseBody = json.loads(response.text)["html_content"]
            data_sitekey = re.search(r'data-sitekey="(.*?)"', responseBody).group(1)
            break
        except Exception as e:
            print(e)

    return data_sitekey

第三部分
通过第三方接码平台https://anti-captcha.com/mainpage 获取后面请求所需要的post参数g-recaptcha-response
需要安装python包 pip install python-anticaptcha
可参考 https://pypi.org/project/python-anticaptcha/

from python_anticaptcha import AnticaptchaClient, NoCaptchaTaskProxylessTask
api_key = “”  # 在[第三方解码平台](https://pypi.org/project/python-anticaptcha/)注册并付费即可获取
client = AnticaptchaClient(api_key)
site_key = ''    # 通过请求获取响应,然后正则方式提取
url = “https://www.youtube.com/channel/UCUHDuZbkCs7gs_cVDV5p3Yw/about”  # 此处为出现验证码的页面地址
task = NoCaptchaTaskProxylessTask(
	website_url=url,
	website_key=site_key,
	website_s_token=None
)
job = client.createTask(task)
job.join()
return job.get_solution_response()

第四部分
获取邮箱的请求:
1.请求方式:post
2.url = “https://www.youtube.com/channels_profile_ajax?action_verify_business_email_recaptcha=1”
3.post请求需要传入的data参数
channel_id youtube的每个网红都有一个ID
recaptcha_response 上一个请求返回的字符串

data = {
        "channel_id": channel_id,
        "g-recaptcha-response": recaptcha_response,
        "session_token": token
    }

通过请求地址以及正则去除邮箱

response = session.post(
   url=url_mail,
    data=formdata,
    headers=headers,
    timeout=5,
    allow_redirects=False,
    # cookies=cookie
)
if response.status_code == 200:
    responseBody = json.loads(response.text)
    html_content = responseBody["html_content"]
    mail_addr = re.search(r'"mailto:(.*?)"', html_content).group(1)
    break
else:
    print(response.status_code)

mail_addr 就是最终需要取出来的邮箱

你可能感兴趣的:(爬虫,python)