首先呢,得会用Fiddler这个工具,他下载比较容易,网上一找一大堆,下载完之后呢要对他进行一些设置,我抓取的是pc端的微信文章,这里有详细的配置说明(https://blog.csdn.net/Tester_xjp/article/details/80087014),配置完成之后,可以打开浏览器随便查询一下,看有没有流量包,如果有,则说明配置成功。下面就让我们进行微信公众号的爬虫吧,在流量包很多的情况下如图所示:
点击图中所示图标,下拉菜单中点击remove all 即可清除所有的包,然后进入你要爬取的公众号,在历史信息中下拉让他刷新,就会有流量包生成,如图:
点击链接,然后按图中顺序依次点击,就会看到你想要的json数据,上面部分是url和请求头,复制粘贴即可,粘贴完之后就开始写代码啦,
import requests
import time
import json
import pymysql
import random
from lxml import etree
url1 = "https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MjM5MjAxNDM4MA==&f=json&offset=1364&count=10&is_ok=1&scene=&uin=MjIzMzAyMTc3Mw%3D%3D&key=89d12b870c1b66b55dda3f5d96949191facfdbe5b85fb04febea6507359e2933e7047e8a492e96459539339c329c204b4ebafb430f7f9abd1140e0f41683cad25e1c63b841858a7210dd801df3e696a3&pass_ticket=i8vG65b0f5w3YbINsxgKoJKE%2BADk1WM8sxZ1LYi22FC3WC5aSatNLYe6YZzz5RdB&wxtoken=&appmsg_token=997_%252FomavAR9WcqYeWKQ_IZYJxtOMPFKYXGaIRpjnQ~~&x5=0&f=json HTTP/1.1"
def weixin_spider(url1,author):
headers = {
# 'Host':'mp.weixin.qq.com',
# 'Connection':'keep-alive',
# 'Accept': '*/*',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat QBCore/3.43.901.400 QQBrowser/9.0.2524.400',
# 'X-Requested-With':'XMLHttpRequest',
# 'Referer: https':'//mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MjM5MjAxNDM4MA==&uin=MjIzMzAyMTc3Mw%3D%3D&key=904312b286f32b60a8dbd9f5fe33159b791adcb96ba37270028681196ab81e4b243785c64ccfd243be4e72664b413c323ade80dcffa498ed2758ad33bc9a85d30932503340b7e8279cb519c6593c373a&devicetype=Windows+10&version=6206061c&lang=zh_CN&a8scene=7&pass_ticket=bAiTfmiO75%2BCxpgdwo9%2FyXSXCrRVLFOdZjW7mCeIrsZKo%2B4ol%2BzOUS%2FWUveafELy&winzoom=1',
# 'Accept-Encoding':'gzip, deflate',
# 'Accept-Language':'zh-CN,zh;q=0.8,en-us;q=0.6,en;q=0.5;q=0.4',
'Cookie':'wxuin=2233021773; devicetype=Windows10; version=6206061c; lang=zh_CN; pass_ticket=bAiTfmiO75+Cxpgdwo9/yXSXCrRVLFOdZjW7mCeIrsZKo+4ol+zOUS/WUveafELy; wap_sid2=CM3q5KgIElw1STZwQUp4ZENEVHM5a3hrSmQxRlJOcjRqWnZyWHBiMmRUcGppckhmMjNSZUV6clBrSGxhSTNRbmV0RjR5NTlmbkZZczRaNHNJaVlsdUNIRGVwRlhzT1FEQUFBfjC9tKnjBTgNQJVO',
}
result = requests.get(url=url1,headers=headers,verify=False)
html = json.loads(result.text)
for item in json.loads(html['general_msg_list'])['list']:
datatime = item['comm_msg_info']['datetime']
title_ = item['app_msg_ext_info']['title']
content_url_ = item['app_msg_ext_info']['content_url']
eleinums = item['app_msg_ext_info']['multi_app_msg_item_list']
if title_ != "":
# time.sleep(random.uniform(0.5,1.0))
res = requests.get(content_url_, verify=False)
html_page = etree.HTML(res.text)
text_ = ''.join(html_page.xpath("//div[@class='rich_media_content ']//text()")).replace('\n', '').replace(' ', '').replace(' ', '')
time_ = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(datatime)))
print('==>',title_)
if eleinums != []:
for ele in eleinums:
title_ = ele['title']
content_url_ = ele['content_url']
time_ = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(datatime)))
print('con', content_url_)
# time.sleep(random.uniform(0.5,1.0))
res = requests.get(content_url_, verify=False)
html_page = etree.HTML(res.text)
text_ = ''.join(html_page.xpath("//div[@class='rich_media_content ']//text()")).replace('\n','').replace(' ', '').replace(' ', '')
部分结果呢就是下面这些咯
F:\Anaconda3\python.exe F:/PycharmProjects/jieba_demo/zhejaing_spider.py
F:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
con http://mp.weixin.qq.com/s?__biz=MzA4ODY3MjkxNA==&mid=2651027517&idx=1&sn=1e6dbe0f056c998d01c8bc1ae93eae8e&chksm=8bd147c0bca6ced6e8ea74f0fd5a104f708b6479041bc0ea4b3779f2a5398f503abd5ecf000b&scene=27#wechat_redirect
F:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
==> 中共中央关于深化党和国家机构改革的决定
ok
con http://mp.weixin.qq.com/s?__biz=MzA4ODY3MjkxNA==&mid=2651027517&idx=2&sn=e39a2415f9765cdae0b25454bb8ee781&chksm=8bd147c0bca6ced6f75a4b5e41a352e7a704bc777901a0298eb35959cd4221da063c73ce70fc&scene=27#wechat_redirect
F:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
ok
人大议程定了!大会发言人还回应了这些“刁钻”问题
con http://mp.weixin.qq.com/s?__biz=MzA4ODY3MjkxNA==&mid=2651027510&idx=1&sn=7fa8d12ab7c88fff7943f844d7c964cc&chksm=8bd147cbbca6cedd9828e71fa9d0c894fb4f0ff0fe2458b2a9c2ccaeaa98b722b0c1ad65d069&scene=27#wechat_redirect
F:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
==> 浙江气温“断崖式”下跌,今明温差高达20℃|明日惊蛰,养生有道~
ok
con http://mp.weixin.qq.com/s?__biz=MzA4ODY3MjkxNA==&mid=2651027510&idx=2&sn=70017b13868521b6fc2b441409f078c9&chksm=8bd147cbbca6ceddd8d6d76c5bd382dc82be44c13e58a891461e6ee779d1c552c12b88f9debf&scene=27#wechat_redirect
F:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
ok
绝美!浙6条徒步线路,正适合春天走起
con http://mp.weixin.qq.com/s?__biz=MzA4ODY3MjkxNA==&mid=2651027510&idx=3&sn=02c290f5ee45f3746120667d51610000&chksm=8bd147cbbca6cedd6efc5976aad7f97a033c75a0f53f0a91e8553bd5426970d6ce6ca8977a35&scene=27#wechat_redirect
F:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
ok
文创街区“寻”礼|在杭州的一处“世外桃源”,有条浙江最大的雕塑创意街区
con http://mp.weixin.qq.com/s?__biz=MzA4ODY3MjkxNA==&mid=2651027493&idx=1&sn=744058357786397125d253284d9092bc&chksm=8bd147d8bca6cece981c0293b3ec895be6e2a3aef6471eb763c12464c8846334fc749e51b1f4&scene=27#wechat_redirect
F:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
==> 全国政协十三届一次会议在京开幕
ok
con http://mp.weixin.qq.com/s?__biz=MzA4ODY3MjkxNA==&mid=2651027493&idx=2&sn=0e7aeddc70ba407068640106dc2cb2cf&chksm=8bd147d8bca6cece7e98f1e7accbe7a0230ebf9f4034eae9cd9e5246475da1a3f31b42aed325&scene=27#wechat_redirect
F:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
ok
浙江代表团举行全体会议 推选车俊为团长
con http://mp.weixin.qq.com/s?__biz=MzA4ODY3MjkxNA==&mid=2651027463&idx=1&sn=2b7baba7e53f46abbdf037a68707aeda&chksm=8bd147fabca6ceec9b651e32a7dbdde2751c01f44025a1a86d42d3996de6ec3c51c4024e6b32&scene=27#wechat_redirect
F:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
==> 浙江省委办公厅省政府办公厅印发《关于进一步深化文化市场综合执法改革的实施意见》
ok
con http://mp.weixin.qq.com/s?__biz=MzA4ODY3MjkxNA==&mid=2651027463&idx=2&sn=0ba55c7c5de763db3694a3630ce8cd04&chksm=8bd147fabca6ceec4611dbbd5e0009617164ab436bb72dd4eda3f12102b312f255147bae805d&scene=27#wechat_redirect
F:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
ok
新“浙八味”中药材培育品种名单出炉,哪些药材入选了?
con http://mp.weixin.qq.com/s?__biz=MzA4ODY3MjkxNA==&mid=2651027463&idx=3&sn=3b0e6db5127297ab5da1751edd23a64b&chksm=8bd147fabca6ceec433f898cb84ffd6e80fbd076fa5b6410178e37464edff2170419bd50f7d1&scene=27#wechat_redirect
这里的请求头你可以试着删除一些,不是全部有用的,我尝试之后其实只有cookie是有用的,但在这我也保留了User-Agent。
我们接收到的json数据中只有文章的url,要爬取文章内容,我对文章内容再次进行请求。
在requests请求url时 有可能会出现:requests.exceptions.SSLError:HTTPSConnectionPool,这样一个错误,所以在requests中把verify这个参数设置为False即可,这里是由于这个网页的证书没有被官方CA机构信任,所以这里会出现证书验证的错误。
嗯,不得不说这个工具确实挺棒,另外一个抓包工具Charles,听说功能比Fiddler要强大,我尝试了一下感觉还可以,它每抓到一个会有高亮,比Fiddler确实要好,但是是收费的,这就不高兴了啊。话说回来,我要爬取的内容是历史所有数据,这样一个个抓包很费事的,有没有更有效地方法呢,呜呜~~