爬取拉勾网终于成功---期间遇见的各类问题大汇总

初入爬虫一行的学习,对爬虫有了简单的了解,一直都想试试拉勾网,但是频频被拒,拉勾的反爬简直太强了。天下没有不透风的墙,查找了各种文献,看了各种帖子,终于克服了一个又一个的难题。下面我们来总结一下,以备今后引以为戒。

Problem 1:'status': False, 'msg': '您操作太频繁,请稍后再访问', 'clientIp': '117.136.41.41', 'state': 2402

import requests
url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
}
r = requests.get(url,headers= headers)
print(r.text)

基本80%的网站都可以通过以上的方式获取基本信息,但是拉勾网太强大了,试过多次一直出现这个问题,这句话是说我访问的太过频繁。所以我就在网上找了一些免费的代理IP,但是依旧出现这个问题,而且clientIp仍然是我本网的IP。后来看了一些帖子,大致了解到拉勾应该是记录了我的访问一些cookie,所以在headers中加入了cookie.问题解决,但是出现了Problem 2的问题。

原文:https://blog.csdn.net/weixin_40576010/article/details/88336980

Problem 2: HTTPConnectionPool(host:XX)Max retries exceeded with url ': Failed to establish a new connection: [Errno 99] Cannot assign requested address'

爬虫多次访问同一个网站一段时间后就会出现错误,原因是因为在每次数据传输前客户端要和服务器建立TCP连接,为节省传输消耗,默认为keep-alive,即连接一次,传输多次,然而在多次访问后不能结束并回到连接池中,导致不能产生新的连接
headers中的Connection默认为keep-alive,
将header中的Connection一项置为close

headers = {

    'Connection': 'close',

}

原文:https://blog.csdn.net/ZTCooper/article/details/80220063

 

完整代码:

import requests
import time
import json 
def main(): 
    proxie = [
    "134.249.156.3:82",
    "1.198.72.239:9999",
    "103.26.245.190:43328",]
    proxies = {
               "http": str(random.sample(proxie, 1))
#         "http":"103.26.245.190:43328"
    }
#     agents = random.sample(agent, 1)
    url_start = "https://www.lagou.com/jobs/list_数据分析?city=%E6%88%90%E9%83%BD&cl=false&fromSearch=true&labelWords=&suginput=" 
    url_parse = "https://www.lagou.com/jobs/positionAjax.json?city=全国&needAddtionalResult=false"
    headers = { 
        'Accept': 'application/json, text/javascript, */*; q=0.01', 
        'Referer': 'https://www.lagou.com/jobs/list_%E8%BF%90%E7%BB%B4?city=%E6%88%90%E9%83%BD&cl=false&fromSearch=true&labelWords=&suginput=', 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36' } 
    for x in range(1, 5): 
        data = { 'first': 'true', 'pn': str(x), 'kd': '数据分析' }
        s = requests.Session() 
        s.get(url_start, headers=headers, timeout=3)   # 请求首页获取cookies 
        cookie = s.cookies   # 为此次获取的cookies 
        response = s.post(url_parse, data=data, headers=headers, proxies=proxies,cookies=cookie, timeout=3)  # 获取此次文本 
        time.sleep(5)
        response.encoding = response.apparent_encoding 
        text = json.loads(response.text)
        print(text)
        info = text["content"]["positionResult"]["result"] 
        for i in info: 
            print(i["companyFullName"]) 
            companyFullName = i["companyFullName"] 
            print(i["positionName"])
            positionName = i["positionName"] 
            print(i["salary"]) 
            salary = i["salary"] 
            print(i["companySize"]) 
            companySize = i["companySize"] 
            print(i["skillLables"])
            skillLables = i["skillLables"] 
            print(i["createTime"]) 
            createTime = i["createTime"] 
            print(i["district"]) 
            district = i["district"] 
            print(i["stationname"]) 
            stationname = i["stationname"] 
if __name__ == '__main__':
    main()


 

 

 

你可能感兴趣的:(Python-爬虫)