Python使用requests再爬三国演义

再次使用神奇的requests对三国演义下手。

Image 1.png

首先在Chrome下打开三国演义主目录网址，按F12查看，每一个章节的链接很有规律，都可以通过如下的css访问。

'#middlediv > #mulu > ul > li > a'

indexUrl="http://www.shicimingju.com/book/sanguoyanyi.html"
base_url = 'http://www.shicimingju.com'
r = requests.get(indexUrl, proxies=proxies)
soup = BS(r.text, "lxml")
book_lists = soup.select('#middlediv > #mulu > ul > li > a')

在拿到每一章的链接之后，就可以顺利下手了。
首先决定以每一章节作为文件名。

Image 2.png

其css如下：

'#alldiv > #main > #chaptercontent > #con > h2'

然后爬正文：

Image 3.png

其css如下：

'#alldiv > #main > #chaptercontent > #con > #con2 > p'

注意问题，在出来中文网页的时候，果不其然遇到了编码问题，有些章节的编码是ISO-8859-1,有些是utf-8。
所以还要分别处理：

if r.encoding == "ISO-8859-1":    
    soup = BS(r.text.encode('ISO-8859-1', 'ignore').decode('utf-8'), "lxml")
else:    
    soup = BS(r.text, "lxml")

完整代码如下：

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import requests
import sys
import os
from bs4 import BeautifulSoup as BS

reload(sys)
sys.setdefaultencoding( "utf-8" )

sub_folder = os.path.join(os.getcwd(), "sanguoyanyi")
if not os.path.exists(sub_folder):
    os.mkdir(sub_folder)

proxies = {
  "http": "http://yourproxy.com:8080/",
  "https": "https://yourproxy.com:8080/",
}

indexUrl="http://www.shicimingju.com/book/sanguoyanyi.html"
base_url = 'http://www.shicimingju.com'
r = requests.get(indexUrl, proxies=proxies)
soup = BS(r.text, "lxml")
book_lists = soup.select('#middlediv > #mulu > ul > li > a')

for book in book_lists:
    real_url = base_url + book.get('href')
    print real_url

    r = requests.get(real_url, proxies=proxies)
    print r.encoding
    # soup = BS(r.text.encode('ISO-8859-1', 'ignore').decode('utf-8'), "lxml")
    try:
        if r.encoding == "ISO-8859-1":
            soup = BS(r.text.encode('ISO-8859-1', 'ignore').decode('utf-8'), "lxml")
        else:
            soup = BS(r.text, "lxml")
        title_lists = soup.select('#alldiv > #main > #chaptercontent > #con > h2')
        # print title_lists[0].get_text()

        file_name = title_lists[0].get_text() + ".txt"
        print file_name

        filename = sub_folder + "/" + file_name
        print filename

        content_lists = soup.select('#alldiv > #main > #chaptercontent > #con > #con2 > p')
        content_of_novel = ""

        for content in content_lists:
            content_of_novel += content.get_text()
            # print content.get_text()

        with open(filename, "wb") as f:
            f.write(content_of_novel)
        f.close()
    except UnicodeDecodeError:
        print "you need re check the encoding:" + real_url
        continue

Python使用requests再爬三国演义

你可能感兴趣的:(Python使用requests再爬三国演义)