爬虫学习记录-9

  • requests基本使用

首先定义一个url,然后要做的是访问这个url,原先我们是模拟浏览器向服务器发送请求,而requests使用直接response = requests.get(url=url)

  • 一个类型和六个属性

首先看一下response的属性,print(type(response)),urllib的response是HTTPResponse类型,而requests的response是Response类型

#因为前面没有b,所以返回的是以字符串形式的网页源码,但其中会有一些乱码
print(response.text)
#设置相应的编码格式,同时也可以将乱码变成文字
response.encoding = 'utf-8'
#返回url地址
print(response.url)
#返回二进制数据
print(response.content)
#返回响应的状态码,和getcode一样,结果为200
print(response.status_code)
#返回响应头
print(response.headers)

  • requests的get请求

我们通过对比与urllib的get请求来学习requests的get请求,首先和urllib一样,定义一个url,url里不能有?后面的wd='',因为文字需要编码,不能直接写在url里,再写一个请求头,然后设置参数,但这个参数应该如何使用或传入呢,我们可以先尝试用requests发送请求response = requests.get(),然后Ctrl+鼠标左键点击get可以看到get()里面的上参数url、params和kwargs

 url为请求资源路径,params为参数,kwargs为字典,这时我们可以对比urllib的请求对象定制,因为urllib的rulopen()里不允许放请求头,所以才需要请求对象定制,而requests可以直接将参数和请求头传入

response = requests.get(url=url,params=data,headers=headers)
content = response.text
print(content)

运行后出现提示百度安全验证,猜测应该是遇到了反爬,于是先后在请求头添加了Accept、Accept-Language、cookie和HOST,最终得到了数据,但数据里面有乱码,所以需要加一句response.encoding='utf-8'

总结一下requests:(1)参数使用params传递,(2)参数无需urlencode编码,(3)不需要请求对象的定制

完整代码如下

url = 'https://www.baidu.com/s?'
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'Cookie': 'BIDUPSID=A9033E673693226562565BC0C5D5CB90; PSTM=1579152228; __yjs_duid=1_c3eab7486e61dc9f7b99b255ee9201311626010676599; BDUSS=VdEamY4VThWWnU5bmV5a2prWkhQNGVJQklOUH5mU3pncFUwV2N-Nlh5RnZ4YWxpRVFBQUFBJCQAAAAAAQAAAAEAAACRKBoQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG84gmJvOIJiV; BDUSS_BFESS=VdEamY4VThWWnU5bmV5a2prWkhQNGVJQklOUH5mU3pncFUwV2N-Nlh5RnZ4YWxpRVFBQUFBJCQAAAAAAQAAAAEAAACRKBoQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG84gmJvOIJiV; BAIDUID=8BA77C9C54F42CDB9CCFC4413033A89A:FG=1; BAIDUID_BFESS=8BA77C9C54F42CDB9CCFC4413033A89A:FG=1; ZFY=MCepCr8PT4:B4:Ab6HFt:AkaBEcXXM6sCgCmHeAwBnng14:C; BD_UPN=12314753; __bid_n=1860bbb4f2d79ba10a4207; FPTOKEN=Zsxtq0sfozj7N9DPeltsqGyLbmng9KX2Wh4SS8Ri8GkLf7+G6a8X9ywfoLoOsKPon4vQsarz6ykmf9KJLBAvmqOqayu0t2Pqq9T7As4J0lz+xCvPTa7LTgoqHFALqG7hyZTlN3OWFrW6uvV2/K/G8S3Q87DuGELT76BRim5Md6BLZMLdxPC9cTgD+VPJjjIsX9gBDchFQt4PvEka51OkI11V9nJNQJ8xrLb2RlwBCsHeCvBALH5PcQkxP1FGoCrrnvXGn+/9yl37w3qxITgpBONgb9Q52JJiRo+jyZkFAvPjN59W0NZafbKiqxtNse6Do8Rss/Y0yW1xQhc1Ch46ekdOTwgi3TwpOQ5YGNbZs3bT+cnORSK7PhGMOeEe/Y2O4K3BuRYdg4fCT0RroT+nAA==|WpWw/VjZYjWowTIWSeCy2tTpDzPPTvCl0BwTxQvLgRg=|10|2c0e96057a00c5f1999e80b011fd9b1b; ab_sr=1.0.1_N2I2ZTcwNDNkMGRhZGY0MDA0NzI4N2YyZjA0ZTczZTE1ZmI5MDNhYmM3YjdmMTI2NGRlNTU0MzBhZDZlOWI1YTAyYjI5ZTI1YjExZGMzMDVjZjUyZGRjZGVlYTlmNGE1ODRmZmIyMjJmNWJkZGE5MDczODRjMzU1ZTZmNTFhZjdmNzgxNjcyODFlMjRhNDMxOWRhZjc3ZGUxMGRiMjU1NjU5ZDU1NWNhYTdlOTgxYWNhYTBlYjdjZjIzMmM2OTdm; BD_HOME=1; H_PS_PSSID=36557_38092_38125_37907_37989_37796_37938_26350_22157_38008_37881; BA_HECTOR=8k8h208h2k248l04258k0get1hu1hu61l; BD_CK_SAM=1; PSINO=1; delPer=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; COOKIE_SESSION=1389282_5_3_1_16_3_1_1_1_2_0_3_1389364_0_0_0_1580563062_1579152614_1588672134%7C6%230_1_1588672134%7C1; H_PS_645EC=d5dbEZPK241p%2F4CyVbEeYI9FZzCkLwe7lduCfeb2v62BkHhqbKLwy9SijfY; baikeVisitId=814286fd-3ef3-40d6-a399-27c512eb8181; B64_BOT=1',
    'Host': 'www.baidu.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.78'}
data = {
    'wd':'石家庄'
}
response = requests.get(url=url,params=data,headers=headers)
response.encoding='utf-8'
content = response.text
print(content)

  • requests的post请求

我们还是用百度翻译的案例来联系一下post请求,首先是抓接口,定义url,headers,然后就是定义参数,接下来就是实现requests的post请求:response = requests.post(),我们Ctrl+鼠标左键点击post可以看到四个参数,其中的data和get请求里的params一样都是参数的,json暂时用不到,headers为请求头​​

response = requests.post(url=url,data=data,headers=headers)
content = response.text

 运行后出现报{"errno":998,"errmsg":"\u672a\u77e5\u9519\u8bef","from":"en","to":"zh","error":998}首先得到的结果为json类型,需要转换成json字符串类型转换成json对象类型

obj = json.loads(content,encoding='utf-8')

然后出现了报错:JSONDecoder.__init__() got an unexpected keyword argument 'encoding',先解决这个问题,经查阅后原因在于Python3.9版本的json_init_.py中loads()方法中没有了encoding字段, 所以这个地方返回encoding时 就会报错 修改以下返回值 删除掉encoding即可,即obj = json.loads(content)

然后在解决上面998的问题,经查找原因是遇到了反爬,出现在Cookie和请求体中的sign,它们会随着输入字符串(翻译内容)的不同,进行变化,从而判断是否是真人操作,我们需要做的是将负载中的From data代替'kw':'continue'

完整代码如下

url = 'https://fanyi.baidu.com/v2transapi?from=en&to=zh'
headers = {
    'Accept': '*/*',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'Acs-Token': '1675670723149_1675751957167_go9xXe6XmzUjvh7gswClpabqnrDJlOMPm2Hf5fDg+8w8n9M3JpJNq0T/IHl68clPetNe04AZqpZvuk5X8a7TZ4it34oxPc2d7Tr7/YyJl1iYnyFvRy4VhRDqwoKePnh6fc3+SPUABt60xPm61uHJ+b6IMTX6v+dP5jIjKRskAatDrebd9cnRyBgqXyRaqdYRUAfpwk3AuGUf6Vr8MwafAeevkbluOuuXeeF3Ls9Rw+33LfYJtl89YEw9NTdKDN9SKsZzBEJRzIrq1BBZrxtIPxkJD93yCG9sQEpS1qbg2GFsC8Pp25+4epRsjw2l35INW36L9EeV/7zG72sE0zOEDw==',
    'Cookie': 'BIDUPSID=A9033E673693226562565BC0C5D5CB90; PSTM=1579152228; __yjs_duid=1_c3eab7486e61dc9f7b99b255ee9201311626010676599; BDUSS=VdEamY4VThWWnU5bmV5a2prWkhQNGVJQklOUH5mU3pncFUwV2N-Nlh5RnZ4YWxpRVFBQUFBJCQAAAAAAQAAAAEAAACRKBoQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG84gmJvOIJiV; BDUSS_BFESS=VdEamY4VThWWnU5bmV5a2prWkhQNGVJQklOUH5mU3pncFUwV2N-Nlh5RnZ4YWxpRVFBQUFBJCQAAAAAAQAAAAEAAACRKBoQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG84gmJvOIJiV; BAIDUID=8BA77C9C54F42CDB9CCFC4413033A89A:FG=1; BAIDUID_BFESS=8BA77C9C54F42CDB9CCFC4413033A89A:FG=1; ZFY=MCepCr8PT4:B4:Ab6HFt:AkaBEcXXM6sCgCmHeAwBnng14:C; APPGUIDE_10_0_2=1; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; __bid_n=1860bbb4f2d79ba10a4207; H_PS_PSSID=36557_38092_38125_37907_37989_37796_37938_26350_22157_38008_37881; BA_HECTOR=8k8h208h2k248l04258k0get1hu1hu61l; PSINO=1; delPer=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BCLID=5735152893600916403; BCLID_BFESS=5735152893600916403; BDSFRCVID=7WLOJeC62uSZjm5jIEjfeEAIHIWe6T3TH6_npn1jHiaOQXYjeqf5EG0PWx8g0K4bxzdxogKKBmOTHn_F_2uxOjjg8UtVJeC6EG0Ptf8g0f5; BDSFRCVID_BFESS=7WLOJeC62uSZjm5jIEjfeEAIHIWe6T3TH6_npn1jHiaOQXYjeqf5EG0PWx8g0K4bxzdxogKKBmOTHn_F_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tbu8oK-aJC83fP36qRQVM4F_hxv0etJyaR0f-hRvWJ5TMCoG0p68hjLqytoBKtQ40m57ax3HJf-5ShPCBPT2h6_qyMrKWhovyCQ-QRLa3l02VM39e-t2yT0V2h6bhPRMW23GWl7mWnrrsxA45J7cM4IseboJLfT-0bc4KKJxbnLWeIJIjj6jK4JKjatjtjvP; H_BDCLCKID_SF_BFESS=tbu8oK-aJC83fP36qRQVM4F_hxv0etJyaR0f-hRvWJ5TMCoG0p68hjLqytoBKtQ40m57ax3HJf-5ShPCBPT2h6_qyMrKWhovyCQ-QRLa3l02VM39e-t2yT0V2h6bhPRMW23GWl7mWnrrsxA45J7cM4IseboJLfT-0bc4KKJxbnLWeIJIjj6jK4JKjatjtjvP; FPTOKEN=GDoGKHi/NRTQQPjT+wa5cazceQBscXDkQhxVv1EkGr0yirJs+iEBwmnnOxxogtM9Wd2PXknUBD16ObKaOvw1J8HMDtdXHsYa/DtCFRBFXV/6Tz/6/X+kKeQkryHgeJOJ03FjTD4g8RQXg6Yvq325DsP3JZ/fHKb1npqFgLLqnbXmwRUdZGMCQ3CcisjOpjbHPWhDX8FrzGfzqe02GZ86NezrzW90hCVWWzY7vuVJ/KWQZ2mqkSoeM0U8mTEo2o6wNziiIXrNwhM0OJv5xqOgQjZ0VfM+kWkzm7FQasysSCik2etulNxYfMu0kCtcFQFvNTo19bXZHlG5pqmiKx0C2p5yzYasX0kzER8t6Na0tAdD6+LbQ/lPaVaYDCUqco2MvrZhT3V6reCI3EGbT/OL0Q==|GC7WIeIcuQsSi4yXSZ4kiDEibpS0bFVmyyitOMYe9k4=|10|8c324b95f0e27e1eee6786b4f994cec9; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1673681256,1675751810; ab_sr=1.0.1_MTQ0MDA0ZDQxMGRmMWM0Mzg2ODY1MWVkZTAwOWY2YTBkZTE0YTM0YTIzMDdlYTc1M2Q0MTM0YTRlOTE1MTRkNWE3YjRiOGU1YzU5OWVjMzE3ODZjNzhhMDg0NDhjZmFlMDhlYTMwYTY5ZmRhY2JlZWQwOTBjNmQxNzQyMGM0ZDMyZGE3MTdiZWM5NzIzYTk1OWIxMTRhNjc1MGNhMDM1ZjAyYjVmYmI0NDBkZmM3MTNkNjhmYzU3ZmI5OTIzZmE2; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1675757177',
    'Host': 'fanyi.baidu.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.78',
}
data = {
    'from': 'en',
    'to': 'zh',
    'query': 'continue',
    'simple_means_flag': '3',
    'sign': '761098.1015355',
    'token': '3613d79ecb5f10d666bf10d01faa0588',
    'domain': 'common'
}
response = requests.post(url=url,data=data,headers=headers)
content = response.text
obj = json.loads(content)
print(obj)

你可能感兴趣的:(爬虫,爬虫,学习)