爬取知乎所有专栏文章

由于难度不高,且自己练手,所以没写多少注释,我大致说下思路

一般来说爬取一个网站就是那么几步走

1.先使用浏览器逛一逛想爬取的网站,找找规律
2.直接发一个傻瓜式请求,试一下能不能获取到想要的数据,万一就得到了呢
3.不行的话,换一下请求头中的User-Agent字段
这边推荐大家一个模块 – fake_useragent
安装: pip install fake_useragent 直接使用pip安装就可以
使用:导入模块中的UserAgent类创建实例对象,再用对象去点一个random属性就可以了
ua = UserAgent()
headers = {‘User-Agent’: ua.random}
这个模块使用简单便捷,每次调用random属性都会返回一个随机的请求头
4.要是换头还不行,把浏览器中的headers全复制来做一个字典
5.要是还不行,那就是登录状态之类的问题了,打开浏览器的设置查看一下哪个cookie是用于保存当前网站登录状态的
6.还不行,试试代理ip
7.再不行,那就是js加密了

当然了,不管怎么处理,一定要先看看人家浏览器发的是啥请求,是get还是post,不然只能是一步错步步错
import requests
from lxml import etree
import MySQLdb

conn = MySQLdb.connect(
    user = 'root',
    password = '123456',
    host = 'localhost',
    db = 'spider',
    charset = 'utf8',
    port = 3306
)
cursor = conn.cursor()

url = 'https://zhuanlan.zhihu.com/api/recommendations/columns?limit=200&offset=0'

headers = {'cookie': '_zap=f092cdcc-4278-44d3-b29d-da3fc28eaa66; _xsrf=R4iF43mZlYxCoqegJS8rHNJG49iMaO9J; d_c0="AGAZ6gs-pRCPTuJB6zc5XZ1R4LwViDxtKsk=|1578747134"; capsion_ticket="2|1:0|10:1578792144|14:capsion_ticket|44:ZDk4MmQ0MDQyOGExNDRjYThhZWMyZjk3MzJiZGVlOTA=|51c98794abaa756f4960896970098d77c9ce4e071745b3807edb3d9289df8f98"; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1578747134,1578747151,1578791652,1578792161; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1578794724; KLBRSID=cdfcc1d45d024a211bb7144f66bda2cf|1578794725|1578791651',
           'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
           }

resp = requests.get(url=url, headers=headers).json()
for i in resp['data']:  # 所有专栏
    try:
        column_url = i['url']
        column_key = column_url.split('.com/')[1]
        column_url = 'https://zhuanlan.zhihu.com/api/columns/'+ column_key +'/articles?include=data%5B*%5D.admin_closed_comment%2Ccomment_count%2Csuggest_edit%2Cis_title_image_full_screen%2Ccan_comment%2Cupvoted_followees%2Ccan_open_tipjar%2Ccan_tip%2Cvoteup_count%2Cvoting%2Ctopics%2Creview_info%2Cauthor.is_following%2Cis_labeled%2Clabel_info&limit=100&offset=0'
        resp = requests.get(url=column_url, headers=headers).json()  # 当前专栏下的所有文章
        for j in resp['data']:  # 遍历取出当前专栏下的所有文章标题和url
            title = j['title']
            article_url = j['url']
            resp = requests.get(url=article_url, headers=headers).text
            html = etree.HTML(resp)
            column_name = html.xpath('//a[@class="ColumnLink ColumnPageHeader-TitleColumn"]/text()')[0]
            content_list = html.xpath('//div[@class="RichText ztext Post-RichText"]//text()')
            content = ''
            for k in content_list:
                content += k
            sql = 'insert into zhihu_zhuanlan(column_title, article_title, content) values(%s, %s, %s)'
            rows = cursor.execute(sql, [column_name, title, content])
            if rows:
                conn.commit()
                print('成功存入数据库')
    except:
        print('出错了')
        continue

你可能感兴趣的:(爬虫实战)