目录
一、代理IP
二、正则表达式re
三、通过代理IP对网站循环访问
四、通过selenium工具实现访问控制
注:文末有干货,不过不认真看完你可学不懂!(偷笑
在爬虫对服务器做资源请求时,通常情况是不需要用到代理IP的,但是如果需要频繁的访问某个服务器,为了避开服务器的反爬机制,我们需要用代理IP来伪装自己爬虫的真实身份,使服务器无法封锁我们真正的IP地址。
代理IP可以并不只是仅仅伪装ip地址,还包括了整个请求头里的信息:
请求头里面的信息可以视情况进行添加或伪装,如不填写会使用浏览器的默认值。
有时候不对请求头进行填写或伪装也可以访问到资源,通常情况访问一些需要特殊权限(如VIP权限)的资源,是需要拿到足够权限的Cookie值才能访问到的。
代理IP地址的获取途径通常是去代理IP的资源网站获取,这里推荐一个:
http://www.kxdaili.com/dailiip.html
通过简单的爬虫技术(HTML数据解析),即可从这个网站获取免费的100个代理IP,将每个代理IP以字典格式 {协议: ip地址} 存入列表,即构成了代理IP池。
import requests
from lxml import etree
proxies_lst = []
for i in range(1, 11):
ip_url = f'http://www.kxdaili.com/dailiip/1/{i}.html'
# http://www.kxdaili.com/dailiip/1/2.html
# http://www.kxdaili.com/dailiip/1/3.html
response = requests.get(ip_url)
# print(response.text)
html = response.text
html = etree.HTML(html)
ip_lst = html.xpath('//div[@class="header-container"]/div[2]/div[2]/div/div[2]/table/tbody/tr')
# print(ip_lst)
# print(len(ip_lst))
for ip_info in ip_lst:
ip = ip_info.xpath('./td[1]/text()')[0]
port = ip_info.xpath('./td[2]/text()')[0]
ht = ip_info.xpath('./td[4]/text()')[0]
# print(ip, port, ht)
proxies_info = {
ht: ip + ':' + port
}
proxies_lst.append(proxies_info)
for i in proxies_lst:
print(i)
print(len(proxies_lst))
Cookie通常是不好做伪装的,如果资源对Cookie有限制,那么有则用,没有则一般是访问不到的,需要找其他办法(本人爬虫弱鸡暂无其他办法)。
对 User-Agent 和 Referer 做伪装,再通过random随机库随机获取,代理IP的获取也是随机从代理IP池里面获取,所以代理IP池的容量越大越好(重复IP的使用频率越低):
import random
user_agent_list=[
'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)',
'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)',
'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)',
'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11',
'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
]
referer_list=[
'http://blog.csdn.net/dala_da/article/details/79401163',
'http://blog.csdn.net/',
'https://www.sogou.com/tx?query=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F&hdq=sogou-site-706608cfdbcc1886-0001&ekv=2&ie=utf8&cid=qb7.zhuye&',
'https://www.baidu.com/s?tn=98074231_1_hao_pg&word=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F'
]
user_agent = random.choice(user_agent_list)
referer = random.choice(referer_list)
正则表达式的re模块是Python中处理字符串数据的重要方式,不过正则表达式的语法相当复杂,本文不做细说,只简单说说re在爬虫常用的一些功能。
在使用爬虫的很多时候,我们需要从字符串中提取到部分信息,特别是从某一个url链接之中提取信息。
一个URL链接,通常包括:协议(https://)、域名(www.baidu.com)、资源路径、参数,在很多时候,链接中的资源路径和参数里面会有我们需要的字符串字段,这时候就需要我们使用re正则表达式做字符串切割,拿到我们需要的数据。
示例一:https://blog.csdn.net/phoenixFlyzzz
获取示例一的url链接中的用户ID:
import re
url = "https://blog.csdn.net/phoenixFlyzzz"
user_id = re.split("/", url)[3]
print(user_id)
# phoenixFlyzzz
由此可知,re.split()函数可以进行字符串切割,并且将切割之后的字符串以列表的形式存储。
示例二:https://blog.csdn.net/phoenixFlyzzz?type=blog
获取示例二的url链接中的用户ID:
import re
url = "https://blog.csdn.net/phoenixFlyzzz"
user_id = re.split("/|\?", user_url)[3]
print(user_id)
# phoenixFlyzzz
由此可见,re.split()函数可以定义多个字符进行切割,此处是定义了 / 和 ? 进行切割, | 用于分割切割符,\ 是因为 ? 有其他含义,用 \ 转义字符将其变为问号本身。
当使用爬虫对某网站频繁访问的时候,切忌访问太过频繁,这样会加大服务器的资源开销,一定要控制好访问的频率,通过time时间模块进行代码的休眠控制。
(郑重声明:本文所有代码仅供学习使用,不能用作任何商业用途)
这是一个自动循环访问博客的爬虫:
import requests
from lxml import etree
import random
import time
import re
import json
user_url = input('请输入用户的url: ')
# 通过主页链接获取用户的全部文章url
# 用re正则表达式从user_url中获得user_id
user_id = re.split("/|\?", user_url)[3]
json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'
# 请求json资源包
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36',
'referer': user_url,
'cookie': 'uuid_tt_dd=10_3110927480-1676090223071-792047; __bid_n=1863ec38aea95f6a424207; UN=phoenixFlyzzz; p_uid=U010000; _ga=GA1.2.993941723.1676213175; historyList-new=%5B%5D; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_3110927480-1676090223071-792047!5744*1*phoenixFlyzzz; FPTOKEN=rGJaKVnrAyrd9c6PNrWR621PRkeUFNL5oQN+ZcnMlhc1gi9jUB2f+3Lre4ssgxxkoHCAjPSQg38FYQVulxS85MVFhuGNp4Tj1sDo6/tLmWw+NYhN9elmUgZ6NEC48t5v2yT3LT4H61ZZJyeAvtv55Yd0cn6v3uEN4FoVd0mM1x2hF/Qz68/K5Hf63vIdlfpl+urOIv9VIuQSmABf0uxvOnsxMnMJOZInkuHt8hsy1qna5lTtPF6VWxTUPIC8dvoTqbr67BjcuEi4naB2tLElGXT5TjgnoWsInXpmD6ABYeF630/ex1x49imDOOKTGvYoNrbA4gYKSh3ePcRv1K8FPNuI8oRj1F+4gFTT9dJcgeK3lI4wO+NY0TiAAgWS4k8VpuntN0kYay1eKtUE2En3sA==|lzoBrn2+9F0BmgSIvcEt7t/AAp7YH4Yr0nrG43bNJ48=|10|fd2bfb9200cc0d87abf868edf8f4d31a; dp_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6MTk3NjQ1MCwiZXhwIjoxNjg5NTMxMzE2LCJpYXQiOjE2ODg5MjY1MTYsInVzZXJuYW1lIjoicGhvZW5peEZseXp6eiJ9.rg0DgrqX7TQWPJosI-6OKmQtAraxmyBMfg0H0xerRpY; log_Id_view=24395; management_ques=1689227893320; hide_login=1; c_dl_fref=https://so.csdn.net/so/search; c_dl_prid=1689264739921_862614; c_dl_rid=1689264756287_665500; c_dl_fpage=/download/weixin_38722164/13767050; c_dl_um=distribute.pc_search_result.none-task-download-2%7Eall%7Efirst_rank_ecpm_v1%7Erank_v31_ecpm-3-13993802-null-null.142%5Ev88%5Econtrol_2%2C239%5Ev2%5Einsert_chatgpt; loginbox_strategy=%7B%22taskId%22%3A270%2C%22abCheckTime%22%3A1689240353169%2C%22version%22%3A%22notInDomain%22%2C%22blog-sixH-default%22%3A1689265737075%7D; UserName=phoenixFlyzzz; UserInfo=e8f9153e71c94dcabecc0827927e50c5; UserToken=e8f9153e71c94dcabecc0827927e50c5; UserNick=%E5%91%BD%E8%BF%90on-9; AU=D18; BT=1689265829191; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22phoenixFlyzzz%22%2C%22scope%22%3A1%7D%7D; log_Id_pv=3995; log_Id_click=6559; firstDie=1; Hm_lvt_e5ef47b9f471504959267fd614d579cd=1689268533; ssxmod_itna=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1mDlO3xA5D8D6DQeGTb0Y7eb=d1e7DCqfsqYZ2x3QtiA8GhmtCnxPhfmmDB3DEx0=KmCYxiinDCeDIDWeDiDGR7D=xGYDj0F/C9Dm4i7DYqGRDB6UCqDf+qGW7uQDmLNDGup6D7QDIw6g9R2DLeDSK7Ub7qDMUeGXSDa47dRWHpGMITnbWePuKCiDtqD94m=DbfL3x0pyRTrz88hr9OxQmG3Y4rqeY7DImDesQADe4SeYQD+GYGGNS7xj9O44DD3YY01beD===; ssxmod_itna2=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1D61frD0HPe031i70peDy09Dqn4nDkt7ORHokSGi0vxmjCBqhiF1l60OcsTX9M3e1ic/ZEcEBQSlbnEfMopKrUz54r8XGHYIckRuyTyWHEPm7novTcYFbdaYr2AYr/h51QKu73a9p5fENTb9sHRYzSeBAjeBCjB5sUmo10jn7CPTx6eTjqrAEe8Et9pfUtZLTCOSwFIkveM3dxNKhj/7fdPkb04uD1incIipNa=F7X=m1Kw974UDtx6DKq0RN9cdldWU=7DNq/CFzUpPeEf5BYrlD11YiPEsu0YjR=9EoZTxK2bBu=l3GYAbwds9EKAwqMuo1hrkCmLx1srOsmrlkY1oQiW5VYQ6ez6oI9jw+jt/0wRlYZ0wanNXrkUgmRmHTrd4SwObIMOE5uoWqKdAzjGrzEPVg5aqzRuwUQrlWhK2W4S5lMvKrjguYGdE6amV4OnuYspEiOQmWYvDDwc4DjKDewD4D=; c_utm_source=edu_txxl_mh; dc_session_id=10_1689309742332.208593; c_first_ref=default; c_segment=15; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1688911197,1688917774,1689304257,1689309744; dc_sid=a1dffd08dd905125e95cd269df2ea4bc; FCNEC=%5B%5B%22AKsRol92q1iv8tx72fkK9bOYJMj_ruoB23PUFbGwA9z1pdh2biHzNAYEWChj9ex5C9gx7naL_pBnalXM2c1sI4Z6eFDqouJ775-0J12K75yqXnRA5tCEXkZiuEAZmQkJKkEPP--Di9CH84WWirUA2luc25OT2gWTBA%3D%3D%22%5D%2Cnull%2C%5B%5D%5D; csrfToken=PWrKJ_3MqdFIcAdzeDpS99mD; __gads=ID=be94ab085530c60b-22868fbfd3d900f6:T=1676560572:RT=1689312851:S=ALNI_MYNNxc0dxyRCaKnMGQnAKL5Qppr5g; __gpi=UID=00000bc4df7125c3:T=1676560572:RT=1689312851:S=ALNI_MZVPQ9kZkGSCUXxaL5KbHyGT69GBQ; log_Id_click=6560; c_utm_medium=distribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; https_waf_cookie=b23550e2-1410-49c5e754af82b31d803cdb7794d5e2b68935; log_Id_pv=3996; c_pref=default; c_first_page=https%3A//blog.csdn.net/m0_61780496; c_dsid=11_1689314745151.983284; c_ref=https%3A//blog.csdn.net/liusuihong919520/article/details/131698929%3Fspm%3D1001.2100.3001.7377%26utm_medium%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase%26depth_1-utm_source%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1689315357; c_page_id=default; dc_tos=rxrw3v'
}
json_response = requests.get(json_url, headers=headers)
time.sleep(2)
article_info_lst = []
json_data = json.loads(json_response.text)
article_num = json_data['data']['total']
print(f'article_num={article_num}')
n = article_num // 20 + 1
try:
for i in range(n):
json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page={i+1}&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'
json_response = requests.get(json_url, headers=headers)
json_data = json.loads(json_response.text)
article_lst = json_data['data']['list']
for article in article_lst:
article_info_lst.append((article['url'], article['title']))
except:
print(Exception)
# 获取代理IP
proxies_lst = []
for i in range(1, 11):
ip_url = f'http://www.kxdaili.com/dailiip/1/{i}.html'
# http://www.kxdaili.com/dailiip/1/2.html
# http://www.kxdaili.com/dailiip/1/3.html
response = requests.get(ip_url)
# print(response.text)
html = response.text
html = etree.HTML(html)
ip_lst = html.xpath('//div[@class="header-container"]/div[2]/div[2]/div/div[2]/table/tbody/tr')
# print(ip_lst)
# print(len(ip_lst))
for ip_info in ip_lst:
ip = ip_info.xpath('./td[1]/text()')[0]
port = ip_info.xpath('./td[2]/text()')[0]
ht = ip_info.xpath('./td[4]/text()')[0]
# print(ip, port, ht)
proxies_info = {
ht: ip + ':' + port
}
proxies_lst.append(proxies_info)
for i in proxies_lst:
print(i)
print(len(proxies_lst))
# 伪装浏览器和浏览足迹
user_agent_list=[
'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)',
'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)',
'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)',
'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11',
'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
]
referer_list=[
'http://blog.csdn.net/dala_da/article/details/79401163',
'http://blog.csdn.net/',
'https://www.sogou.com/tx?query=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F&hdq=sogou-site-706608cfdbcc1886-0001&ekv=2&ie=utf8&cid=qb7.zhuye&',
'https://www.baidu.com/s?tn=98074231_1_hao_pg&word=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F'
]
test_num = 1
while True:
print(f'第{test_num}轮')
test_num += 1
for article in article_info_lst:
url = article[0]
headers = {
'Referer': random.choice(referer_list),
'User-Agent': random.choice(user_agent_list)
}
pos = random.randint(0, len(proxies_lst) - 1)
proxies = proxies_lst[pos]
try:
response = requests.get(url, headers=headers, proxies=proxies)
html = response.text
html = etree.HTML(html)
read_num = html.xpath('//*[@id="mainBox"]/main/div/div/div/div[2]/div/div/span[@class="read-count"]/text()')[0]
except ValueError:
break
else:
print(f'状态码: {response.status_code}, ', end='')
if response.status_code == 200:
print(f'{url}访问成功,当前访问量为: {read_num}, 当前ip: {proxies}')
else:
print(f'{url}访问失败')
time.sleep(1)
time.sleep(10)
selenium工具是一个网站的自动化测试工具,在很多时候也用于爬虫爬取资源,不过selenium的效率相比于requests慢很多,所以很多时候能用requests直接拿到资源就不用selenium。
在很多爬虫之中,selenium对于资源的爬取只是一个辅助作用,它通过对浏览器的可视化访问控制,方便程序员对爬虫代码进行编写和优化。
通过selenium和requests可以轻松拿到前端代码,也可以通过selenium控制的访问按键改变浏览器路径,进行相关资源的访问或循环访问(翻页访问)。
拿到资源之后,便是对数据做处理,通过HTML或Json数据解析,提取到我们想要的数据,再做数据处理。
这是一个自动登录和批量三连(关注、点赞、评论)博客的爬虫:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from lxml import etree
import random
import time
import re
import json
import requests
# 配置无头浏览器
opt = Options()
opt.add_argument("--headless")
opt.add_argument("--disable-gpu")
# 打开浏览器,无头浏览器,可设可不设
driver = webdriver.Chrome(options=opt)
# driver = webdriver.Chrome()
# 登录
url = "https://passport.csdn.net/login"
driver.get(url)
time.sleep(2)
driver.find_element(By.XPATH, "/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[1]/span[4]").click()
time.sleep(2)
# 填写自己登录的账号密码
id_number = input('请输入你的csdn账号: ')
password = input('请输入你的csdn密码: ')
driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/div/div[1]/div/input').send_keys(f'{id_number}')
driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/div/div[2]/div/input').send_keys(f'{password}')
time.sleep(2)
driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/div/div[4]/button').click()
time.sleep(2)
# 用户主页
user_url = input('请输入目标博主的主页链接:')
driver.get(user_url)
time.sleep(2)
# 用re正则表达式从user_url中获得user_id
user_id = re.split("/|\?", user_url)[3]
json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'
# 关注
try:
driver.find_element(By.LINK_TEXT, '关注').click()
print(f'关注{user_id}成功')
time.sleep(2)
except:
print(f'用户{user_id}已关注')
# 请求json资源包
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36',
'referer': user_url,
'cookie': 'uuid_tt_dd=10_3110927480-1676090223071-792047; __bid_n=1863ec38aea95f6a424207; UN=phoenixFlyzzz; p_uid=U010000; _ga=GA1.2.993941723.1676213175; historyList-new=%5B%5D; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_3110927480-1676090223071-792047!5744*1*phoenixFlyzzz; FPTOKEN=rGJaKVnrAyrd9c6PNrWR621PRkeUFNL5oQN+ZcnMlhc1gi9jUB2f+3Lre4ssgxxkoHCAjPSQg38FYQVulxS85MVFhuGNp4Tj1sDo6/tLmWw+NYhN9elmUgZ6NEC48t5v2yT3LT4H61ZZJyeAvtv55Yd0cn6v3uEN4FoVd0mM1x2hF/Qz68/K5Hf63vIdlfpl+urOIv9VIuQSmABf0uxvOnsxMnMJOZInkuHt8hsy1qna5lTtPF6VWxTUPIC8dvoTqbr67BjcuEi4naB2tLElGXT5TjgnoWsInXpmD6ABYeF630/ex1x49imDOOKTGvYoNrbA4gYKSh3ePcRv1K8FPNuI8oRj1F+4gFTT9dJcgeK3lI4wO+NY0TiAAgWS4k8VpuntN0kYay1eKtUE2En3sA==|lzoBrn2+9F0BmgSIvcEt7t/AAp7YH4Yr0nrG43bNJ48=|10|fd2bfb9200cc0d87abf868edf8f4d31a; dp_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6MTk3NjQ1MCwiZXhwIjoxNjg5NTMxMzE2LCJpYXQiOjE2ODg5MjY1MTYsInVzZXJuYW1lIjoicGhvZW5peEZseXp6eiJ9.rg0DgrqX7TQWPJosI-6OKmQtAraxmyBMfg0H0xerRpY; log_Id_view=24395; management_ques=1689227893320; hide_login=1; c_dl_fref=https://so.csdn.net/so/search; c_dl_prid=1689264739921_862614; c_dl_rid=1689264756287_665500; c_dl_fpage=/download/weixin_38722164/13767050; c_dl_um=distribute.pc_search_result.none-task-download-2%7Eall%7Efirst_rank_ecpm_v1%7Erank_v31_ecpm-3-13993802-null-null.142%5Ev88%5Econtrol_2%2C239%5Ev2%5Einsert_chatgpt; loginbox_strategy=%7B%22taskId%22%3A270%2C%22abCheckTime%22%3A1689240353169%2C%22version%22%3A%22notInDomain%22%2C%22blog-sixH-default%22%3A1689265737075%7D; UserName=phoenixFlyzzz; UserInfo=e8f9153e71c94dcabecc0827927e50c5; UserToken=e8f9153e71c94dcabecc0827927e50c5; UserNick=%E5%91%BD%E8%BF%90on-9; AU=D18; BT=1689265829191; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22phoenixFlyzzz%22%2C%22scope%22%3A1%7D%7D; log_Id_pv=3995; log_Id_click=6559; firstDie=1; Hm_lvt_e5ef47b9f471504959267fd614d579cd=1689268533; ssxmod_itna=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1mDlO3xA5D8D6DQeGTb0Y7eb=d1e7DCqfsqYZ2x3QtiA8GhmtCnxPhfmmDB3DEx0=KmCYxiinDCeDIDWeDiDGR7D=xGYDj0F/C9Dm4i7DYqGRDB6UCqDf+qGW7uQDmLNDGup6D7QDIw6g9R2DLeDSK7Ub7qDMUeGXSDa47dRWHpGMITnbWePuKCiDtqD94m=DbfL3x0pyRTrz88hr9OxQmG3Y4rqeY7DImDesQADe4SeYQD+GYGGNS7xj9O44DD3YY01beD===; ssxmod_itna2=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1D61frD0HPe031i70peDy09Dqn4nDkt7ORHokSGi0vxmjCBqhiF1l60OcsTX9M3e1ic/ZEcEBQSlbnEfMopKrUz54r8XGHYIckRuyTyWHEPm7novTcYFbdaYr2AYr/h51QKu73a9p5fENTb9sHRYzSeBAjeBCjB5sUmo10jn7CPTx6eTjqrAEe8Et9pfUtZLTCOSwFIkveM3dxNKhj/7fdPkb04uD1incIipNa=F7X=m1Kw974UDtx6DKq0RN9cdldWU=7DNq/CFzUpPeEf5BYrlD11YiPEsu0YjR=9EoZTxK2bBu=l3GYAbwds9EKAwqMuo1hrkCmLx1srOsmrlkY1oQiW5VYQ6ez6oI9jw+jt/0wRlYZ0wanNXrkUgmRmHTrd4SwObIMOE5uoWqKdAzjGrzEPVg5aqzRuwUQrlWhK2W4S5lMvKrjguYGdE6amV4OnuYspEiOQmWYvDDwc4DjKDewD4D=; c_utm_source=edu_txxl_mh; dc_session_id=10_1689309742332.208593; c_first_ref=default; c_segment=15; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1688911197,1688917774,1689304257,1689309744; dc_sid=a1dffd08dd905125e95cd269df2ea4bc; FCNEC=%5B%5B%22AKsRol92q1iv8tx72fkK9bOYJMj_ruoB23PUFbGwA9z1pdh2biHzNAYEWChj9ex5C9gx7naL_pBnalXM2c1sI4Z6eFDqouJ775-0J12K75yqXnRA5tCEXkZiuEAZmQkJKkEPP--Di9CH84WWirUA2luc25OT2gWTBA%3D%3D%22%5D%2Cnull%2C%5B%5D%5D; csrfToken=PWrKJ_3MqdFIcAdzeDpS99mD; __gads=ID=be94ab085530c60b-22868fbfd3d900f6:T=1676560572:RT=1689312851:S=ALNI_MYNNxc0dxyRCaKnMGQnAKL5Qppr5g; __gpi=UID=00000bc4df7125c3:T=1676560572:RT=1689312851:S=ALNI_MZVPQ9kZkGSCUXxaL5KbHyGT69GBQ; log_Id_click=6560; c_utm_medium=distribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; https_waf_cookie=b23550e2-1410-49c5e754af82b31d803cdb7794d5e2b68935; log_Id_pv=3996; c_pref=default; c_first_page=https%3A//blog.csdn.net/m0_61780496; c_dsid=11_1689314745151.983284; c_ref=https%3A//blog.csdn.net/liusuihong919520/article/details/131698929%3Fspm%3D1001.2100.3001.7377%26utm_medium%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase%26depth_1-utm_source%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1689315357; c_page_id=default; dc_tos=rxrw3v'
}
json_response = requests.get(json_url, headers=headers)
time.sleep(2)
article_info_lst = []
json_data = json.loads(json_response.text)
article_num = json_data['data']['total']
print(f'article_num={article_num}')
n = article_num // 20 + 1
try:
for i in range(n):
json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page={i+1}&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'
json_response = requests.get(json_url, headers=headers)
json_data = json.loads(json_response.text)
article_lst = json_data['data']['list']
for article in article_lst:
article_info_lst.append((article['url'], article['title']))
except:
print(Exception)
article_num = 0
# 每天的评论上限为10次
for article_info in article_info_lst:
article_num += 1
driver.get(article_info[0])
time.sleep(3)
# 页面滑动
js = 'window.scrollTo(0, 1000)' # 向下滑
driver.execute_script(js)
time.sleep(1)
# 点赞,若已经赞过则不点,而且点过赞说明也评论过,可以直接跳过不评论
html_data = driver.page_source
html_data = etree.HTML(html_data)
flag = html_data.xpath('/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[1]/a/img[3]/@style')[0]
if flag == 'display:none':
print(f'第{article_num}篇文章:{article_info[1]},该文章已经点赞过')
continue
else:
driver.find_element(By.XPATH, '/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[1]').click()
# 评论
content_lst = [
'博主讲解得太详细了,通俗易懂,优质好文,必须三连支持!!!',
'感谢博主细致的讲解,让我豁然开朗,非常感谢, 三连支持一波!!!',
'非常优秀的博文,感谢博主!!!三连奉上!!!',
'复习打卡冲冲冲,一起加油呀!!!感谢博主的细致讲解',
'正在学习这方面的知识,这篇博文对我的帮助很大,非常感谢!'
]
# 如果是对自己的文章进行评论,没有打赏标签,最后的标签是第4个,对别人的文章评论最后标签是第五个
# driver.find_element(By.XPATH, '/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[4]').click()
driver.find_element(By.XPATH, '/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[5]').click()
time.sleep(1)
driver.find_element(By.XPATH, '//*[@id="comment_content"]').send_keys(random.choice(content_lst))
time.sleep(1)
driver.find_element(By.XPATH, '//*[@id="commentform"]/div[2]/div[3]/div[4]/a/input').click()
time.sleep(2)
print(f'第{article_num}篇文章:{article_info[1]},三连已完成')