动态网页的爬取
在动态网页爬取之前,我们要了解一种异步加载更新技术——AJAX(异步的JavaScript 和XML)
他的价值在于通过在后台与服务器进行少量的数据交换就可以使用网页的某部分进行更新
1.动态抓取实例
相对于传统的网页,不需要重新加载整个网页,从而使得互联网应用程序更小,更快,更友好,但是爬虫的过程就变得十分麻烦了。
我们可以通过以下两种方法爬取AJAX动态加载的网页:
(1)通过浏览器审查元素来解析地址
(2)使用selenium模拟浏览器进行抓取
2.解析真实地址抓取
下面我们以知乎上的一篇评论为例:
https://www.zhihu.com/question/22913650
打开Chrome浏览器的检查功能,然后找到数据的真实地址,单击页面中的“network”选项。再点击XMR按钮
看到以js, json,等格式结尾的文件,我们可以发现上图红选中的文件就是真实的评论文件。单击“Preview”即可查看数据
代码如下:
# coding: UTF-8
import requests
from bs4 import BeautifulSoup
url = "https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
r = requests.get(url, headers = headers)
print r.text
运行上述代码,得到如下结果:
{"featured_counts":16,"common_counts":1125,"collapsed_counts":2,"reviewing_counts":0,"paging":{"is_end":false,"is_start":true,"next":"https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B%2A%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right\u0026limit=20\u0026offset=20\u0026order=normal\u0026status=open","previous":"https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B%2A%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right\u0026limit=20\u0026offset=0\u0026order=normal\u0026status=open","totals":1123},"data":[{"id":367943240,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943240","content":"\u003cp\u003e越长大越觉得,生命的寄托,最终都会落脚到爱与责任\u003c/p\u003e","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512421298,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"79ed1b3de8d0df2a309cdabe895198a9","url_token":"qi-yu-50-87","name":"迟嘉澍","avatar_url":"https://pic4.zhimg.com/v2-ee7c28e69de49593eae38e7c973d8d7b_r.jpg","avatar_url_template":"https://pic4.zhimg.com/v2-ee7c28e69de49593eae38e7c973d8d7b_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/79ed1b3de8d0df2a309cdabe895198a9","user_type":"people","headline":"寒鸦赴水,渴马奔泉","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":2388,"voting":false,"disliked":false},{"id":367943256,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943256","content":"我真喜欢你那句 奶奶的答案是你 爸妈的答案是你 你的答案是他们","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512421335,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"763c1ed453279d6e3c39124ecedec96a","url_token":"bo-chong-jing","name":"冲静","avatar_url":"https://pic3.zhimg.com/v2-7cdc882e0c764b0b22087663bad8ea8f_r.jpg","avatar_url_template":"https://pic3.zhimg.com/v2-7cdc882e0c764b0b22087663bad8ea8f_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/763c1ed453279d6e3c39124ecedec96a","user_type":"people","headline":"上下求索","badge":[],"gender":0,"is_advertiser":false}},"is_parent_author":false,"vote_count":2673,"voting":false,"disliked":false},{"id":367943603,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943603","content":"夜班真累,好想睡觉","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512421934,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"11c4f0962c4d1d183757322b5bf45da1","url_token":"xu-hui-peng-43","name":"媚俗","avatar_url":"https://pic4.zhimg.com/v2-0889a738c3cdb7086308b103e12606b8_r.jpg","avatar_url_template":"https://pic4.zhimg.com/v2-0889a738c3cdb7086308b103e12606b8_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/11c4f0962c4d1d183757322b5bf45da1","user_type":"people","headline":"","badge":[],"gender":-1,"is_advertiser":false}},"is_parent_author":false,"vote_count":236,"voting":false,"disliked":false},{"id":367943699,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943699","content":"可以说是石总到目前为止最好的答案么?","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512422087,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"e20f754c6ab428339ba081105af9114b","url_token":"bu-zhi-dao-21-38-16","name":"不知道","avatar_url":"https://pic4.zhimg.com/da8e974dc_r.jpg","avatar_url_template":"https://pic4.zhimg.com/da8e974dc_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/e20f754c6ab428339ba081105af9114b","user_type":"people","headline":"经济学博士/收入分配/地方政治","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":230,"voting":false,"disliked":false},{"id":367943757,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943757","content":"另,石总的父上顾老三应该是长子吧?\u003cbr\u003e那么,问题来了。\u003cbr\u003e为什么是\"顾老三\"?","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512422182,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"e20f754c6ab428339ba081105af9114b","url_token":"bu-zhi-dao-21-38-16","name":"不知道","avatar_url":"https://pic4.zhimg.com/da8e974dc_r.jpg","avatar_url_template":"https://pic4.zhimg.com/da8e974dc_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/e20f754c6ab428339ba081105af9114b","user_type":"people","headline":"经济学博士/收入分配/地方政治","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":82,"voting":false,"disliked":false},{"id":367943760,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943760","content":"你只看到了光鲜亮丽,却没看见后面付出的艰辛。努力是不会被白费的,如果会,那代表还不够努力。","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512422189,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"5a882a0c4156145978a0472621c06434","url_token":"sha-mo-hu-die-zi","name":"沙沫蝴蝶紫","avatar_url":"https://pic2.zhimg.com/v2-90bb76ecd50f0c70b6f31d06c17b34f2_r.jpg","avatar_url_template":"https://pic2.zhimg.com/v2-90bb76ecd50f0c70b6f31d06c17b34f2_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/5a882a0c4156145978a0472621c06434","user_type":"people","headline":"喜欢喵和汪的煮妇","badge":[],"gender":0,"is_advertiser":false}},"is_parent_author":false,"vote_count":198,"voting":false,"disliked":false},{"id":367943803,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943803","content":"他有两个姐姐,顾老三不比顾老大顺耳吗?","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512422259,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"author","member":{"id":"6e8e5b1439390b02c781b210bd8e4769","url_token":"hubertuswi","name":"顾宇","avatar_url":"https://pic2.zhimg.com/v2-60245ae62a8f6a549ecd30bb546b056e_r.jpg","avatar_url_template":"https://pic2.zhimg.com/v2-60245ae62a8f6a549ecd30bb546b056e_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/6e8e5b1439390b02c781b210bd8e4769","user_type":"people","headline":"做一个青史留名的好老公","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":76,"reply_to_author":{"role":"normal","member":{"id":"e20f754c6ab428339ba081105af9114b","url_token":"bu-zhi-dao-21-38-16","name":"不知道","avatar_url":"https://pic4.zhimg.com/da8e974dc_r.jpg","avatar_url_template":"https://pic4.zhimg.com/da8e974dc_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/e20f754c6ab428339ba081105af9114b","user_type":"people","headline":"经济学博士/收入分配/地方政治","badge":[],"gender":1,"is_advertiser":false}},"voting":false,"disliked":false},{"id":367943831,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943831","content":"再另,好像可以写一部《胡建人在纽约》。","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512422303,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"e20f754c6ab428339ba081105af9114b","url_token":"bu-zhi-dao-21-38-16","name":"不知道","avatar_url":"https://pic4.zhimg.com/da8e974dc_r.jpg","avatar_url_template":"https://pic4.zhimg.com/da8e974dc_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/e20f754c6ab428339ba081105af9114b","user_type":"people","headline":"经济学博士/收入分配/地方政治","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":181,"voting":false,"disliked":false},{"id":367943900,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943900","content":"哦,原来是总排名…\u003cbr\u003e顾老大比较武侠…","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512422435,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"e20f754c6ab428339ba081105af9114b","url_token":"bu-zhi-dao-21-38-16","name":"不知道","avatar_url":"https://pic4.zhimg.com/da8e974dc_r.jpg","avatar_url_template":"https://pic4.zhimg.com/da8e974dc_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/e20f754c6ab428339ba081105af9114b","user_type":"people","headline":"经济学博士/收入分配/地方政治","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":37,"voting":false,"disliked":false},{"id":367944049,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367944049","content":"终于没有被软糖支配了,莫名的小感动,感动地我点了个赞","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512422643,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"c2ee38fe05867189b151d4ff345678ce","url_token":"pi-pi-89-94","name":"不辣的皮皮","avatar_url":"https://pic4.zhimg.com/v2-d2f76b0fc23218eaa1a0ecf762108253_r.jpg","avatar_url_template":"https://pic4.zhimg.com/v2-d2f76b0fc23218eaa1a0ecf762108253_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/c2ee38fe05867189b151d4ff345678ce","user_type":"people","headline":"上善若水","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":58,"voting":false,"disliked":false},{"id":367944908,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367944908","content":"审题,转换下题目:努力在人这一生中有什么意义? 哲学问题的解答应该具有普世价值。","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512424019,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"f5c5dbdaed10f7fbe4f21829e4723c70","url_token":"liu-wei-61-51-87","name":"leo刘","avatar_url":"https://pic2.zhimg.com/v2-25489bb8c0fc743f4a8dee294c856724_r.jpg","avatar_url_template":"https://pic2.zhimg.com/v2-25489bb8c0fc743f4a8dee294c856724_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/f5c5dbdaed10f7fbe4f21829e4723c70","user_type":"people","headline":"非典型性山东人","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":20,"voting":false,"disliked":false},{"id":367946229,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367946229","content":"大清早的被感动到了。想到那句话,你之所以可以岁月静好,是有人替你负重前行。","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512425813,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"f20af3d54d11a1b02abea48f7e4e1216","url_token":"guan-jia-qi-18","name":"浅海小七","avatar_url":"https://pic2.zhimg.com/6f90059cefbcc78465045d48dec83f95_r.jpg","avatar_url_template":"https://pic2.zhimg.com/6f90059cefbcc78465045d48dec83f95_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/f20af3d54d11a1b02abea48f7e4e1216","user_type":"people","headline":"心所愿,力必至,无所畏惧","badge":[],"gender":0,"is_advertiser":false}},"is_parent_author":false,"vote_count":1158,"voting":false,"disliked":false},{"id":367946441,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367946441","content":"\u003cp\u003e只有濒临悬崖的人才知道,那一刻,怒力不努力已经不重要了,拼命挣扎活下去的本能而已。\u003c/p\u003e\u003cp\u003e\u003c/p\u003e","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512426050,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"330fa209766f5d906474f6915ab8e943","url_token":"zuo-cun","name":"左村","avatar_url":"https://pic1.zhimg.com/v2-fb068c9fa81ed54c9f1045097f8dc9c2_r.jpg","avatar_url_template":"https://pic1.zhimg.com/v2-fb068c9fa81ed54c9f1045097f8dc9c2_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/330fa209766f5d906474f6915ab8e943","user_type":"people","headline":"","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":63,"voting":false,"disliked":false},{"id":367946564,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367946564","content":"老师说过,人生总是得有几年的沉淀时间。那段日子,你会很孤独,很落魄,很狼狈,但这段的经历,会影响你未来几十年的人生。与君共勉。","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512426170,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"1b0c262dd523cea5ad0b99ea84713788","url_token":"li-hang-yu-38-16","name":"此间的少年","avatar_url":"https://pic1.zhimg.com/v2-57d6bf8b0ee4663fbf3be427b07654af_r.jpg","avatar_url_template":"https://pic1.zhimg.com/v2-57d6bf8b0ee4663fbf3be427b07654af_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/1b0c262dd523cea5ad0b99ea84713788","user_type":"people","headline":"读过几年私塾,写过几篇故事","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":277,"voting":false,"disliked":false},{"id":367949431,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367949431","content":"我们中国程序员过的就是美国时间","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512428562,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"c2ee38fe05867189b151d4ff345678ce","url_token":"pi-pi-89-94","name":"不辣的皮皮","avatar_url":"https://pic4.zhimg.com/v2-d2f76b0fc23218eaa1a0ecf762108253_r.jpg","avatar_url_template":"https://pic4.zhimg.com/v2-d2f76b0fc23218eaa1a0ecf762108253_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/c2ee38fe05867189b151d4ff345678ce","user_type":"people","headline":"上善若水","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":25,"reply_to_author":{"role":"author","member":{"id":"6e8e5b1439390b02c781b210bd8e4769","url_token":"hubertuswi","name":"顾宇","avatar_url":"https://pic2.zhimg.com/v2-60245ae62a8f6a549ecd30bb546b056e_r.jpg","avatar_url_template":"https://pic2.zhimg.com/v2-60245ae62a8f6a549ecd30bb546b056e_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/6e8e5b1439390b02c781b210bd8e4769","user_type":"people","headline":"做一个青史留名的好老公","badge":[],"gender":1,"is_advertiser":false}},"voting":false,"disliked":false},{"id":367955868,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367955868","content":"喜欢你的文字,让人掉泪,但内心软软的,感觉到爱的存在。","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512431357,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"fa3ccca406ac6c5cadfa83ce8dd33317","url_token":"li-jin-86-59","name":"小黑猫BC","avatar_url":"https://pic4.zhimg.com/da8e974dc_r.jpg","avatar_url_template":"https://pic4.zhimg.com/da8e974dc_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/fa3ccca406ac6c5cadfa83ce8dd33317","user_type":"people","headline":"教师","badge":[],"gender":0,"is_advertiser":false}},"is_parent_author":false,"vote_count":510,"voting":false,"disliked":false},{"id":367944157,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367944157","content":"那么多人醒了?","featured":false,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512422835,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"0cd7421483539b4d67c5517f95274e9f","url_token":"li-xian-lan-32","name":"我叫快乐","avatar_url":"https://pic1.zhimg.com/v2-17e9b39ec64137fe1948cf6d3198b371_r.jpg","avatar_url_template":"https://pic1.zhimg.com/v2-17e9b39ec64137fe1948cf6d3198b371_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/0cd7421483539b4d67c5517f95274e9f","user_type":"people","headline":"壁立千仞,无欲则刚","badge":[],"gender":0,"is_advertiser":false}},"is_parent_author":false,"vote_count":12,"voting":false,"disliked":false},{"id":367944682,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367944682","content":"\u003cp\u003e我说各位,你们过的也是美国的时间啊?\u003c/p\u003e","featured":false,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512423673,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"author","member":{"id":"6e8e5b1439390b02c781b210bd8e4769","url_token":"hubertuswi","name":"顾宇","avatar_url":"https://pic2.zhimg.com/v2-60245ae62a8f6a549ecd30bb546b056e_r.jpg","avatar_url_template":"https://pic2.zhimg.com/v2-60245ae62a8f6a549ecd30bb546b056e_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/6e8e5b1439390b02c781b210bd8e4769","user_type":"people","headline":"做一个青史留名的好老公","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":43,"voting":false,"disliked":false},{"id":367944758,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367944758","content":"谢谢,有种清晰了的感觉","featured":false,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512423787,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"928679f782e23ff45f1a573053178dbc","url_token":"liang-zhi-59-47","name":"惊蛰","avatar_url":"https://pic4.zhimg.com/v2-691d1017e666d9ccde2b983cec082d36_r.jpg","avatar_url_template":"https://pic4.zhimg.com/v2-691d1017e666d9ccde2b983cec082d36_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/928679f782e23ff45f1a573053178dbc","user_type":"people","headline":"好爱广泛到处跑的,学子","badge":[],"gender":-1,"is_advertiser":false}},"is_parent_author":false,"vote_count":5,"voting":false,"disliked":false},{"id":367944952,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367944952","content":"就为他们前半生都在为我奋斗而奋斗","featured":false,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512424071,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"e79c3c0deb1d656cc591cdf3a2ed5a9b","url_token":"zhao-hui-ting-42","name":"赵嘁嘁","avatar_url":"https://pic4.zhimg.com/v2-921c8f54adfbeb9e3bddf5ab14deccc0_r.jpg","avatar_url_template":"https://pic4.zhimg.com/v2-921c8f54adfbeb9e3bddf5ab14deccc0_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/e79c3c0deb1d656cc591cdf3a2ed5a9b","user_type":"people","headline":"北方有佳人,绝世而独立。","badge":[],"gender":0,"is_advertiser":false}},"is_parent_author":false,"vote_count":12,"voting":false,"disliked":false}]}
注意 Mac不能识别 UTF-8编码的问题:
使用 locale命令查看系统编码
若不是,则
vim ~/.bash_profile
#在最后一行插入
export LC_ALL="zh_CN.UTF-8"
source ~/.bash_profile
从json数据中提取评论:
上面得到的结果比较混乱,要从这些json数据中提取我们想要的数据,需要使用json库解析数据。
# coding: UTF-8
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
r = requests.get(url, headers = headers)
json_data = json.loads(r.text)
comments_list = json_data['data']
for eachone in comments_list:
message = eachone['content']
print message
上面的代码只能爬取单页,为了爬取更多内容我们需要了解URL的规律
第一页的真实 URL:https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open
第二页的真实 URL:https://www.zhihu.com/api/v4/answers/149018631/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=20&status=open
对比上面的两个网址,我们会发现两个特别重要的变量,即offset 和 limit
# coding : utf-8
import requests
import json
def single_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
r = requests.get(url, headers = headers)
json_data = json.loads(r.text)
comments_list = json_data['data']
for eachone in comments_list:
message = eachone['content']
print(message)
for page in (0,2):
link1 = "https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset="
link2 = "&status=open"
page_str = str(page * 20)
link = link1 + page_str + link2
single_page(link)
3.通过 Selenium模拟浏览器抓取
对于一些复杂的网站,前面介绍的方法将不再适用。此外,还有一些数据真实地址的URL十分冗长和复杂,还有的网站为了规避这些会对地址进行加密,造成其中的一些变量难以破解。
因此,我们将采用selenium浏览器渲染引擎,直接用浏览器显示网页,解析HTML,JS和CSS
(1)Selenium的安装与基本介绍
参考博文:Firefox浏览器驱动GeckoDriver安装方法https://blog.csdn.net/hy_696/article/details/80114065
# coding:utf-8
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import sys
#reload(sys)
#sys.setdefaultencoding("utf-8")
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = False
# windows版,需要安装 greckodriver
binary = FirefoxBinary(r'D:\Program Files (x86)\Mozilla Firefox\firefox.exe')
driver = webdriver.Firefox(firefox_binary = binary, capabilities = caps)
driver.get("https://www.baidu.com")
以下操作以Chrome浏览器为例;
# coding:utf-8
from selenium import webdriver
#windows版
driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe')
driver.get("https://www.baidu.com")
(2)Selenium 实践案例
现在我们使用浏览器渲染,来爬取前面的评论数据
首先在“检查”页面找到HTML代码标签,尝试获取第一条评论
代码如下:
# coding:utf-8
from selenium import webdriver
driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe')
driver.get("https://www.zhihu.com/question/22913650")
comment = driver.find_element_by_css_selector('')
print(comment.text)
(3)Selenium 获取文章的所有评论
若要获取所有评论,就要脚本程序能够自动单击“加载更多”,“所有评论”,“下一页”之类的。
代码如下:
# coding: utf-8
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import time
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = False
binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你电脑中Firefox程序的地址
driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
comments = driver.find_elements_by_css_selector('div.reply-content')
for eachcomment in comments:
content = eachcomment.find_element_by_tag_name('p')
print (content.text)
Selenium选择元素的方法:
# text
find_element_by_css_selector('div.body_inner')
#
要查找多个元素,可以在上面的‘element’后面加上s,变成elements即可。其中前两种比较常用
(4)Selenium的高级操作
为了加快selenium的爬取速度,常通过以下方法实现:
(1)控制CSS的加载
fp = webdriver.FirefoxProfile()
fp.set_preference("permissions.default.stylesheet",2)
(2)控制图片的显示
fp = webdriver.FirefoxProfile()
fp.set_preference("permissions.default.image",2)
(3)控制JavaScript的运行
fp = webdriver.FirefoxProfile()
fp.set_preference("javascript.enabled", False)
对于Chrome浏览器:
options=webdriver.ChromeOptions()
prefs={
'profile.default_content_setting_values': {
'images': 2,
'javascript':2
}
}
options.add_experimental_option('prefs',prefs)
browser = webdriver.Chrome(chrome_options=options)
4.Selenium爬虫实战:深圳短租数据
(1)网站分析
(2)项目实战
稍后分享demo