课程作业-爬虫入门03-爬虫基础-WilliamZeng-20170716

课堂作业

  • 8月9日根据爬虫入门04课曾老师的讲解做了一些补充,代码和其执行修改成先爬取解密大数据专题下的文章链接,然后
  • 选择解密大数据专题里面前两次作业的网址爬虫入门01和爬虫入门02作为爬取页面
  • 爬取该页面中所有可以爬取的元素,我选择了爬取文章主体文字内容,文章主体中的图片和文字链接,包括他们的文字标识
  • 尝试用lxml爬取
参考资料
  • Beautiful Soup 4.2.0 文档中文版
  • Requests
  • urllib2
  • re
  • lxml参考1
  • lxml参考2
  • xpath

谢谢曾老师分享和介绍这些工具,为我们节省了很多时间。要了解它们需要投入一定专注的时间阅读文档和练习,希望老师没有高估我们的接受速度和投入程度,遇到一些跟不上的情况也能耐心指导。


代码部分一:beautifulsoup4实现
  1. 导入模块
  2. 基础的下载函数:download
  3. 抓取文章页上文章主体内容的函数:crawl_page
  4. 抓取文章内图片信息和链接的函数:crawl_article_images
  5. 抓取文章内文字链信息和链接的函数:crawl_article_text_link
  6. 抓取专题页(文章)标题类的链接的函数:crawl_links

结果都会写入带有文章标题的文件,这里抓取标题,生成文件并写入抓取内容的部分在上面后3个函数是共用的。在有限的作业时间内,本人缺乏科班训练,没有把这些共用语句写成一个函数。有一小部分冗余代码,可能将来能用上,没再修改。

导入模块
import os
import time
import urllib2
from bs4 import BeautifulSoup # 用于解析网页中文, 安装:pip install beautifulsoup4

在什么环境下运行pip我发觉提问和交流比较少,可能大多数同学都是用Python的IDE工具安装所需模块的,而不是直接调用pip或easy_install。我自己尝试Windows环境下需要在命令行模式运行,并且只能在安装pip的目录下运行,比如D:\Python27\Scripts。环境变量的配置我这次没时间研究了。因为所需模块已经通过别的方式安装,pip install beautifulsoup4的调用返回Requirement already satisfied: beautifulsoup4 in d:\python27\lib\site-packages。不完全确定命令行安装方式是否有错。

download函数
def download(url, retry=2):
    """
    下载页面的函数,会下载完整的页面信息
    :param url: 要下载的url
    :param retry: 重试次数
    :return: 原生html
    """
    print "downloading: ", url
    # 设置header信息,模拟浏览器请求
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
    }
    try: #爬取可能会失败,采用try-except方式来捕获处理
        request = urllib2.Request(url, headers=header) #设置请求数据
        html = urllib2.urlopen(request).read() #抓取url
    except urllib2.URLError as e: #异常处理
        print "download error: ", e.reason
        html = None
        if retry > 0: #未超过重试次数,可以继续爬取
            if hasattr(e, 'code') and 500 <= e.code < 600: #错误码范围,是请求出错才继续重试爬取
                print e.code
                return download(url, retry - 1)
    time.sleep(1) #等待1s,避免对服务器造成压力,也避免被服务器屏蔽爬取
    return html

这一部分除了改了header之外,照搬老师的代码。urllib2没仔细研究,不赘述。

crawl_page函数
def crawl_page(crawled_url):
    """
    爬取文章内容
    :param crawled_url: 需要爬取的页面地址集合
    """
    for link in crawled_url: #按地址逐篇文章爬取
        html = download(link)
        soup = BeautifulSoup(html, "html.parser")
        title = soup.find('h1', {'class': 'title'}).text #获取文章标题
        """
        替换特殊字符,否则根据文章标题生成文件名的代码会运行出错
        """
        title = title.replace('|', ' ')
        title = title.replace('"', ' ')
        title = title.replace('/', ',')
        title = title.replace('<', ' ')
        title = title.replace('>', ' ')
        title = title.replace('\x08', '')
        # print (title)
        content = soup.find('div', {'class': 'show-content'}).text #获取文章内容

        if os.path.exists('spider_output/') == False: #检查保存文件的地址
            os.mkdir('spider_output/')

        file_name = 'spider_output/' + title + '.txt' #设置要保存的文件名
        if os.path.exists(file_name):
            # os.remove(file_name) # 删除文件
            continue  # 已存在的文件不再写
        file = open('spider_output/' + title + '.txt', 'wb') #写文件
        content = unicode(content).encode('utf-8', errors='ignore')
        file.write(content)
        file.close()

这一部分也是基于老师的代码略作删减和修改。

crawl_article_images函数
def crawl_article_images(post_url):
    """
    抓取文章中图片链接
    :param post_url: 文章页面
    """
    image_url = set()  # 爬取的图片链接
    flag = True # 标记是否需要继续爬取
    while flag:
        html = download(post_url) # 下载页面
        if html == None:
            break

        soup = BeautifulSoup(html, "html.parser") # 格式化爬取的页面数据
        title = soup.find('h1', {'class': 'title'}).text  # 获取文章标题
        image_div = soup.find_all('div', {'class': 'image-package'}) # 获取文章图片div元素
        if image_div.__len__() == 0: # 爬取的页面中无图片div元素,终止爬取
            break

        i = 1
        image_content = ''
        for image in image_div:
            image_link = image.img.get('data-original-src') # 获取图片的原始链接
            image_caption = image.find('div', {'class': 'image-caption'}).text # 获取图片的标题
            image_content += str(i) + '. ' + (unicode(image_caption).encode('utf-8', errors='ignore')) + ' : '+ (unicode(image_link).encode('utf-8', errors='ignore')) + '\n'
            image_url.add(image_link)  # 记录未重复的爬取的图片链接
            i += 1

        if os.path.exists('spider_output/') == False:  # 检查保存文件的地址
            os.mkdir('spider_output')

        file_name = 'spider_output/' + title + '_images.txt'  # 设置要保存的文件名
        if os.path.exists(file_name) == False:
            file = open('spider_output/' + title + '_images.txt', 'wb')  # 写文件
            file.write(image_content)
            file.close()
        flag = False

    image_num = image_url.__len__()
    print 'total number of images in the article: ', image_num

这一部分是基于老师的Demo代码和抓取页面的信息写出来的。为了只抓取文章里图片的特定信息,先抓取image的div元素,再抓取下面包含的链接和图片说明。如果大家有更快捷清晰的方法,欢迎建议。

crawl_article_text_link函数
def crawl_article_text_link(post_url):
    """
    抓取文章中的文字链接
    :param post_url: 文章页面
    """
    text_link_url = set()  # 爬取的文字链接
    flag = True # 标记是否需要继续爬取
    while flag:
        html = download(post_url) # 下载页面
        if html == None:
            break

        soup = BeautifulSoup(html, "html.parser") # 格式化爬取的页面数据
        title = soup.find('h1', {'class': 'title'}).text  # 获取文章标题
        article_content = soup.find('div', {'class': 'show-content'}) # 获取文章的内容div
        text_links = article_content.find_all('a', {'target': '_blank'})
        if text_links.__len__() == 0: # 爬取的页面中没有文字链元素,终止爬取
            break

        i = 1
        text_links_content = ''
        for link in text_links:
            link_url = link.get('href') # 获取文字链的链接
            link_label = link.text # 获取文字链的文本内容
            text_links_content += str(i) + '. ' + (unicode(link_label).encode('utf-8', errors='ignore')) + ' : '+ (unicode(link_url).encode('utf-8', errors='ignore')) + '\n'
            text_link_url.add(link_url)  # 记录未重复的爬取的文字链的链接
            i += 1

        if os.path.exists('spider_output/') == False:  # 检查保存文件的地址
            os.mkdir('spider_output')

        file_name = 'spider_output/' + title + '_article_text_links.txt'  # 设置要保存的文件名
        if os.path.exists(file_name) == False:
            file = open('spider_output/' + title + '_article_text_links.txt', 'wb')  # 写文件
            file.write(text_links_content)
            file.close()
        flag = False

    link_num = text_link_url.__len__()
    print 'total number of text links in the article: ', link_num

先抓取文章的主体,再抓取文章主体中的链接元素。如果有更简洁清晰的方法,一样欢迎建议。

crawl_links函数
def crawl_links(url_seed, url_root):
    """
    抓取文章链接
    :param url_seed: 下载的种子页面地址
    :param url_root: 爬取网站的根目录
    :return: 需要爬取的页面链接
    """
    crawled_url = set()  # 需要爬取的页面
    i = 1
    flag = True  # 标记是否需要继续爬取
    while flag:
        url = url_seed % i  # 真正爬取的页面
        i += 1  # 下一次需要爬取的页面

        html = download(url)  # 下载页面
        if html == None:  # 下载页面为空,表示已爬取到最后
            break

        soup = BeautifulSoup(html, "html.parser")  # 格式化爬取的页面数据
        links = soup.find_all('a', {'class': 'title'})  # 获取标题元素
        if links.__len__() == 0:  # 爬取的页面中已无有效数据,终止爬取
            flag = False

        for link in links:  # 获取有效的文章地址
            link = link.get('href')
            if link not in crawled_url:
                realUrl = urlparse.urljoin(url_root, link)
                crawled_url.add(realUrl)  # 记录未重复的需要爬取的页面
            else:
                print 'end'
                flag = False  # 结束抓取

    paper_num = crawled_url.__len__()
    print 'total paper num: ', paper_num
    return crawled_url

和曾老师课堂提供的代码一样。

调用函数执行页面抓取
crawl_article_images('http://www.jianshu.com/p/10b429fd9c4d')
crawl_article_images('http://www.jianshu.com/p/faf2f4107b9b')
crawl_article_images('http://www.jianshu.com/p/111')
crawl_article_text_link('http://www.jianshu.com/p/10b429fd9c4d')
crawl_article_text_link('http://www.jianshu.com/p/faf2f4107b9b')
crawl_page(['http://www.jianshu.com/p/10b429fd9c4d'])
crawl_page(['http://www.jianshu.com/p/faf2f4107b9b'])

我尝试抓取之前2次爬虫的页面和用crawl_article_images函数抓取一个不存在的页面。

Python Console的输出结果如下

downloading:  http://www.jianshu.com/p/10b429fd9c4d
total number of images in the article:  2
downloading:  http://www.jianshu.com/p/faf2f4107b9b
total number of images in the article:  0
downloading:  http://www.jianshu.com/p/111
download error:  Not Found
total number of images in the article:  0
downloading:  http://www.jianshu.com/p/10b429fd9c4d
total number of text links in the article:  2
downloading:  http://www.jianshu.com/p/faf2f4107b9b
total number of text links in the article:  2
downloading:  http://www.jianshu.com/p/10b429fd9c4d
downloading:  http://www.jianshu.com/p/faf2f4107b9b

抓取的结果文件如下
课程作业-爬虫入门03-爬虫基础-WilliamZeng-20170716_第1张图片
抓取结果文件
结果文件内容示例1
结果文件内容示例2
课程作业-爬虫入门03-爬虫基础-WilliamZeng-20170716_第2张图片
结果文件内容示例3

上完爬虫04课之后明白了爬虫03课的作业是分步抓取,第一步调用crawl_links函数抓取解密大数据专题里的文章链接,第二步调用crawl_page函数抓取第一步产生的链接的网页上的文章信息。编写新的执行代码如下

url_root = 'http://www.jianshu.com/'
url_seed = 'http://www.jianshu.com/c/9b4685b6357c/?page=%d'
crawled_url = crawl_links(url_seed, url_root)
crawl_page(crawled_url)

Python Console的输出结果如下

downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=1
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=2
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=3
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=4
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=5
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=6
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=7
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=8
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=9
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=10
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=11
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=12
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=13
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=14
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=15
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=16
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=17
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=18
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=19
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=20
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=21
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=22
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=23
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=24
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=25
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=26
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=27
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=28
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=29
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=30
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=31
downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=32
total paper num:  305
downloading:  http://www.jianshu.com/p/45df7e3ecc78
downloading:  http://www.jianshu.com/p/99ae5b28a51f
downloading:  http://www.jianshu.com/p/d6243f087bd9
downloading:  http://www.jianshu.com/p/ea40c6da9fec
downloading:  http://www.jianshu.com/p/59e0da43136e
downloading:  http://www.jianshu.com/p/e71e5d7223bb
downloading:  http://www.jianshu.com/p/dc07545c6607
downloading:  http://www.jianshu.com/p/99fd951a0b8b
downloading:  http://www.jianshu.com/p/02f33063c258
downloading:  http://www.jianshu.com/p/ad10d79255f8
downloading:  http://www.jianshu.com/p/062b8dfca144
downloading:  http://www.jianshu.com/p/cb4f8ab1b380
downloading:  http://www.jianshu.com/p/2c557a1bfa04
downloading:  http://www.jianshu.com/p/8f7102c74a4f
downloading:  http://www.jianshu.com/p/77876ef45ab4
downloading:  http://www.jianshu.com/p/e5475131d03f
downloading:  http://www.jianshu.com/p/e0bd6bfad10b
downloading:  http://www.jianshu.com/p/a425acdaf77e
downloading:  http://www.jianshu.com/p/729edfc613aa
downloading:  http://www.jianshu.com/p/e50c863bb465
downloading:  http://www.jianshu.com/p/7107b67c47bc
downloading:  http://www.jianshu.com/p/6585d58f582a
downloading:  http://www.jianshu.com/p/4f38600dae7c
downloading:  http://www.jianshu.com/p/1292d7a3805e
downloading:  http://www.jianshu.com/p/7cb84cfa56fa
downloading:  http://www.jianshu.com/p/41c14ef3e59a
downloading:  http://www.jianshu.com/p/1a2a07611fd8
downloading:  http://www.jianshu.com/p/217a4578f9ab
downloading:  http://www.jianshu.com/p/d234a015fa90
downloading:  http://www.jianshu.com/p/e08d1a03045f
downloading:  http://www.jianshu.com/p/41b1ee54d766
downloading:  http://www.jianshu.com/p/6f4a7a1ef85c
downloading:  http://www.jianshu.com/p/faf2f4107b9b
downloading:  http://www.jianshu.com/p/9dee9886b140
downloading:  http://www.jianshu.com/p/e2ee86a8a32b
downloading:  http://www.jianshu.com/p/9258b0495021
downloading:  http://www.jianshu.com/p/7e2fccb4fad9
downloading:  http://www.jianshu.com/p/f21f01a92521
downloading:  http://www.jianshu.com/p/d882831868fb
downloading:  http://www.jianshu.com/p/872a67eed7af
downloading:  http://www.jianshu.com/p/2e64c2045be5
downloading:  http://www.jianshu.com/p/565500cfb5a4
downloading:  http://www.jianshu.com/p/1729787990e7
downloading:  http://www.jianshu.com/p/8ca518b3b2d5
downloading:  http://www.jianshu.com/p/9c7fbcac3461
downloading:  http://www.jianshu.com/p/13d76e7741c0
downloading:  http://www.jianshu.com/p/81d17436f29e
downloading:  http://www.jianshu.com/p/148b7cc83bcd
downloading:  http://www.jianshu.com/p/70b7505884e9
downloading:  http://www.jianshu.com/p/ba4100af215a
downloading:  http://www.jianshu.com/p/333dacb0e1b2
downloading:  http://www.jianshu.com/p/ff2d4eadebde
downloading:  http://www.jianshu.com/p/eb01f9002091
downloading:  http://www.jianshu.com/p/ba43beaa186a
downloading:  http://www.jianshu.com/p/14967ec6e954
downloading:  http://www.jianshu.com/p/d44cc7e9a0a9
downloading:  http://www.jianshu.com/p/d0de8ee83ea1
downloading:  http://www.jianshu.com/p/b4670cb9e998
downloading:  http://www.jianshu.com/p/9f9fb337be0c
downloading:  http://www.jianshu.com/p/542f41879879
downloading:  http://www.jianshu.com/p/e9f6b15318be
downloading:  http://www.jianshu.com/p/f1ef93a6c033
downloading:  http://www.jianshu.com/p/92a66ccc8998
downloading:  http://www.jianshu.com/p/f0063d735a5c
downloading:  http://www.jianshu.com/p/856c8d648e20
downloading:  http://www.jianshu.com/p/b9407b2c22a4
downloading:  http://www.jianshu.com/p/a36e997b8e59
downloading:  http://www.jianshu.com/p/c28207b3c71d
downloading:  http://www.jianshu.com/p/8448ac374dc1
downloading:  http://www.jianshu.com/p/4a3fbcb06981
downloading:  http://www.jianshu.com/p/d7267956035a
downloading:  http://www.jianshu.com/p/b1a9daef3423
downloading:  http://www.jianshu.com/p/5eb037498c48
downloading:  http://www.jianshu.com/p/f756bf0beb26
downloading:  http://www.jianshu.com/p/673b768c6084
downloading:  http://www.jianshu.com/p/6233788a8abb
downloading:  http://www.jianshu.com/p/087ce1951647
downloading:  http://www.jianshu.com/p/7240db1ba0af
downloading:  http://www.jianshu.com/p/289e51eb6446
downloading:  http://www.jianshu.com/p/39d6793a6554
downloading:  http://www.jianshu.com/p/0565cd673282
downloading:  http://www.jianshu.com/p/873613065502
downloading:  http://www.jianshu.com/p/605644d688ff
downloading:  http://www.jianshu.com/p/1ea730c97aae
downloading:  http://www.jianshu.com/p/bab0c09416ee
downloading:  http://www.jianshu.com/p/c6591991d1ca
downloading:  http://www.jianshu.com/p/fd9536a0acfb
downloading:  http://www.jianshu.com/p/ed8dc3802927
downloading:  http://www.jianshu.com/p/f89c4032a0b2
downloading:  http://www.jianshu.com/p/1fa23219270d
downloading:  http://www.jianshu.com/p/defeeb920c3a
downloading:  http://www.jianshu.com/p/412f8eab2599
downloading:  http://www.jianshu.com/p/05c15b9f16f1
downloading:  http://www.jianshu.com/p/4931d66276c3
downloading:  http://www.jianshu.com/p/b5165468a32b
downloading:  http://www.jianshu.com/p/2c02a7b0b382
downloading:  http://www.jianshu.com/p/dffdaf11bd4c
downloading:  http://www.jianshu.com/p/71c02ef761ac
downloading:  http://www.jianshu.com/p/6920d5e48b31
downloading:  http://www.jianshu.com/p/71b968bd8abb
downloading:  http://www.jianshu.com/p/6450dce856fd
downloading:  http://www.jianshu.com/p/c1163e39a42e
downloading:  http://www.jianshu.com/p/bd9a27c4e2a8
downloading:  http://www.jianshu.com/p/88d0addf64fa
downloading:  http://www.jianshu.com/p/6a7afc98c868
downloading:  http://www.jianshu.com/p/733475b6900d
downloading:  http://www.jianshu.com/p/f75128ec3ea3
downloading:  http://www.jianshu.com/p/9ee12067f35e
downloading:  http://www.jianshu.com/p/c41624a83b71
downloading:  http://www.jianshu.com/p/8318f5b722cf
downloading:  http://www.jianshu.com/p/b5c292e093a2
downloading:  http://www.jianshu.com/p/0a6977eb686d
downloading:  http://www.jianshu.com/p/456ab3a6ef71
downloading:  http://www.jianshu.com/p/d578d5e2755f
downloading:  http://www.jianshu.com/p/616642976ded
downloading:  http://www.jianshu.com/p/c9e1dffad756
downloading:  http://www.jianshu.com/p/81819f27a7d8
downloading:  http://www.jianshu.com/p/a4beefd8cfc2
downloading:  http://www.jianshu.com/p/799c51fbe5f1
downloading:  http://www.jianshu.com/p/5e4a86f8025c
downloading:  http://www.jianshu.com/p/7acf291b2a5e
downloading:  http://www.jianshu.com/p/6ef6b9a56b50
downloading:  http://www.jianshu.com/p/210aacd31ef7
downloading:  http://www.jianshu.com/p/9a9280de68f8
downloading:  http://www.jianshu.com/p/d5bc50d8e0a2
downloading:  http://www.jianshu.com/p/39eb230e6f15
downloading:  http://www.jianshu.com/p/c0c0a3ed35d4
downloading:  http://www.jianshu.com/p/74db357c7252
downloading:  http://www.jianshu.com/p/6a91f948b62d
downloading:  http://www.jianshu.com/p/bc75ab89fac0
downloading:  http://www.jianshu.com/p/8088d1bede8d
downloading:  http://www.jianshu.com/p/8ca88a90ea17
downloading:  http://www.jianshu.com/p/a8037a38e219
downloading:  http://www.jianshu.com/p/979b4c5c1857
downloading:  http://www.jianshu.com/p/3dfedf60de62
downloading:  http://www.jianshu.com/p/ada67bd7c56f
downloading:  http://www.jianshu.com/p/486afcd4c36c
downloading:  http://www.jianshu.com/p/2841c81d57fc
downloading:  http://www.jianshu.com/p/e492d3acfe38
downloading:  http://www.jianshu.com/p/b4e2e5e31154
downloading:  http://www.jianshu.com/p/75fc36aec98e
downloading:  http://www.jianshu.com/p/545581b0c7dd
downloading:  http://www.jianshu.com/p/a015b756a803
downloading:  http://www.jianshu.com/p/29062bca16aa
downloading:  http://www.jianshu.com/p/3a95a09cda40
downloading:  http://www.jianshu.com/p/8fbe3a7b4764
downloading:  http://www.jianshu.com/p/0329f87c9ae4
downloading:  http://www.jianshu.com/p/e1b28de0a1e4
download error:  Gateway Time-out
504
downloading:  http://www.jianshu.com/p/e1b28de0a1e4
downloading:  http://www.jianshu.com/p/b5c31a2eeb8b
downloading:  http://www.jianshu.com/p/7e556f17021a
downloading:  http://www.jianshu.com/p/23144099e9f8
downloading:  http://www.jianshu.com/p/a91c54f96ded
downloading:  http://www.jianshu.com/p/74ef104a9f45
downloading:  http://www.jianshu.com/p/afa17bc391b7
downloading:  http://www.jianshu.com/p/90914aef3636
downloading:  http://www.jianshu.com/p/0c0e3ace0da1
downloading:  http://www.jianshu.com/p/b7eef4033a09
downloading:  http://www.jianshu.com/p/7b2e81589a4f
downloading:  http://www.jianshu.com/p/2f7d10b2e508
downloading:  http://www.jianshu.com/p/ed499f4ecdd1
downloading:  http://www.jianshu.com/p/11c103c03d4a
downloading:  http://www.jianshu.com/p/97ff0beca873
downloading:  http://www.jianshu.com/p/7c54cd046d4b
downloading:  http://www.jianshu.com/p/cfaf85b24281
downloading:  http://www.jianshu.com/p/356a579062aa
downloading:  http://www.jianshu.com/p/460a8eed5cfa
downloading:  http://www.jianshu.com/p/46e82e4fe324
downloading:  http://www.jianshu.com/p/ba00a9852a02
downloading:  http://www.jianshu.com/p/b6359185fc26
downloading:  http://www.jianshu.com/p/a1a2dabb4bc2
downloading:  http://www.jianshu.com/p/4077cbc4dd37
downloading:  http://www.jianshu.com/p/90efe88727fe
downloading:  http://www.jianshu.com/p/17f99100525a
downloading:  http://www.jianshu.com/p/01385e2dd129
downloading:  http://www.jianshu.com/p/ec3c57d6a4c7
downloading:  http://www.jianshu.com/p/9632ba906ca2
downloading:  http://www.jianshu.com/p/85da47fddad7
downloading:  http://www.jianshu.com/p/3b47b36cc8e8
downloading:  http://www.jianshu.com/p/29e304a61d32
downloading:  http://www.jianshu.com/p/649167e0e2f4
downloading:  http://www.jianshu.com/p/13840057782d
downloading:  http://www.jianshu.com/p/11b3dbb05c39
downloading:  http://www.jianshu.com/p/3a5975d6ac55
downloading:  http://www.jianshu.com/p/394856545ab0
downloading:  http://www.jianshu.com/p/0ee1f0bfc8cb
downloading:  http://www.jianshu.com/p/2364064e0bc9
downloading:  http://www.jianshu.com/p/09b19b8f8886
downloading:  http://www.jianshu.com/p/50a2ba489685
downloading:  http://www.jianshu.com/p/f0436668cb72
downloading:  http://www.jianshu.com/p/c0f3d36d0c7a
downloading:  http://www.jianshu.com/p/be0192aa6486
downloading:  http://www.jianshu.com/p/ee43c55123f8
downloading:  http://www.jianshu.com/p/af4765b703f0
downloading:  http://www.jianshu.com/p/ff772050bd96
downloading:  http://www.jianshu.com/p/e121b1a420ad
downloading:  http://www.jianshu.com/p/ed93f7f344d0
downloading:  http://www.jianshu.com/p/8f6ee3b1efeb
downloading:  http://www.jianshu.com/p/3f06c9f69142
downloading:  http://www.jianshu.com/p/887889c6daee
downloading:  http://www.jianshu.com/p/ce0e0773c6ec
downloading:  http://www.jianshu.com/p/be384fd73bdb
downloading:  http://www.jianshu.com/p/acc47733334f
downloading:  http://www.jianshu.com/p/bf5984fb299a
downloading:  http://www.jianshu.com/p/1a935c2dc911
downloading:  http://www.jianshu.com/p/8982ad63eb85
downloading:  http://www.jianshu.com/p/d1acbed69f45
downloading:  http://www.jianshu.com/p/98cc73755a22
downloading:  http://www.jianshu.com/p/bb736600b483
downloading:  http://www.jianshu.com/p/3c71839bc660
downloading:  http://www.jianshu.com/p/23a905cf936b
downloading:  http://www.jianshu.com/p/169403f7e40c
downloading:  http://www.jianshu.com/p/a9c7970bc949
downloading:  http://www.jianshu.com/p/ed9ec88e71e4
downloading:  http://www.jianshu.com/p/5057ab6f9ad5
downloading:  http://www.jianshu.com/p/1b42a12dac14
downloading:  http://www.jianshu.com/p/5dc5dfe26148
downloading:  http://www.jianshu.com/p/c88a4453dd6d
downloading:  http://www.jianshu.com/p/cd971afcb207
downloading:  http://www.jianshu.com/p/2ccd37ae73e2
downloading:  http://www.jianshu.com/p/926013888e3e
downloading:  http://www.jianshu.com/p/888a580b2384
downloading:  http://www.jianshu.com/p/8a0479f55b21
downloading:  http://www.jianshu.com/p/e72c8ef71e49
downloading:  http://www.jianshu.com/p/bb4a81624af1
downloading:  http://www.jianshu.com/p/4b944b22fe83
downloading:  http://www.jianshu.com/p/b3e8e9cb0141
downloading:  http://www.jianshu.com/p/bfd9b3954038
downloading:  http://www.jianshu.com/p/f6c26ef0f4cc
downloading:  http://www.jianshu.com/p/56967004f8c4
downloading:  http://www.jianshu.com/p/ae5f78b40f17
downloading:  http://www.jianshu.com/p/aed64f7e647b
downloading:  http://www.jianshu.com/p/a32f27199846
downloading:  http://www.jianshu.com/p/4b4e0c343d3e
downloading:  http://www.jianshu.com/p/8f6b5a1bb3fa
downloading:  http://www.jianshu.com/p/f7354d1c5abf
downloading:  http://www.jianshu.com/p/1fe31cbddc78
downloading:  http://www.jianshu.com/p/f7dc92913f33
downloading:  http://www.jianshu.com/p/296ae7538d1f
downloading:  http://www.jianshu.com/p/d43125a4ff44
downloading:  http://www.jianshu.com/p/0b0b7c33be57
downloading:  http://www.jianshu.com/p/b4ac4473a55d
downloading:  http://www.jianshu.com/p/4b57424173a0
downloading:  http://www.jianshu.com/p/e0ae002925bd
downloading:  http://www.jianshu.com/p/5250518f5cc5
downloading:  http://www.jianshu.com/p/de3455ed089c
downloading:  http://www.jianshu.com/p/7b946e6d6861
downloading:  http://www.jianshu.com/p/62e127dbb73c
downloading:  http://www.jianshu.com/p/430b5bea974d
downloading:  http://www.jianshu.com/p/e5d13e351320
downloading:  http://www.jianshu.com/p/5d8a3205e28e
downloading:  http://www.jianshu.com/p/1099c3a74336
downloading:  http://www.jianshu.com/p/761a73b7eea2
downloading:  http://www.jianshu.com/p/83cc892eb24a
downloading:  http://www.jianshu.com/p/b223e54fe5ee
downloading:  http://www.jianshu.com/p/366c2594f24b
downloading:  http://www.jianshu.com/p/cc3b5d76c587
downloading:  http://www.jianshu.com/p/6dbadc78d231
downloading:  http://www.jianshu.com/p/d32d7ab5063a
downloading:  http://www.jianshu.com/p/020f0281f1df
downloading:  http://www.jianshu.com/p/f26085aadd47
downloading:  http://www.jianshu.com/p/df7b35249975
downloading:  http://www.jianshu.com/p/68423bfc4c4e
downloading:  http://www.jianshu.com/p/601d3a488a58
downloading:  http://www.jianshu.com/p/1d6fc1a9406b
downloading:  http://www.jianshu.com/p/76238014a03f
downloading:  http://www.jianshu.com/p/9e7cfcc85a57
downloading:  http://www.jianshu.com/p/819a202adecd
downloading:  http://www.jianshu.com/p/4a8749704ebf
downloading:  http://www.jianshu.com/p/d2dc5aa9bf8f
downloading:  http://www.jianshu.com/p/4dda2425314a
downloading:  http://www.jianshu.com/p/8baa664ea613
downloading:  http://www.jianshu.com/p/cbfab5db7f6f
downloading:  http://www.jianshu.com/p/bd78a49c9d23
downloading:  http://www.jianshu.com/p/cf2edecdba77
downloading:  http://www.jianshu.com/p/3b3bca4281aa
downloading:  http://www.jianshu.com/p/f382741c2736
downloading:  http://www.jianshu.com/p/4ffca0a43476
downloading:  http://www.jianshu.com/p/e04bcac99c8d
downloading:  http://www.jianshu.com/p/5a6c4b8e7700
downloading:  http://www.jianshu.com/p/37e927476dfe
downloading:  http://www.jianshu.com/p/67ae9d87cf3c
downloading:  http://www.jianshu.com/p/4981df2eefe7
downloading:  http://www.jianshu.com/p/86117613b7a6
downloading:  http://www.jianshu.com/p/233ff48d668e
downloading:  http://www.jianshu.com/p/13a68ac7afdd
downloading:  http://www.jianshu.com/p/aa1121232dfd
downloading:  http://www.jianshu.com/p/e99dacbf5c44
downloading:  http://www.jianshu.com/p/74042ba10c0d
downloading:  http://www.jianshu.com/p/40cc7d239513
downloading:  http://www.jianshu.com/p/5a8b8ce0a395
downloading:  http://www.jianshu.com/p/59ca82a11f87
downloading:  http://www.jianshu.com/p/8266f0c736f9
downloading:  http://www.jianshu.com/p/fa7dd359d7a8
downloading:  http://www.jianshu.com/p/87f36332b707
downloading:  http://www.jianshu.com/p/10b429fd9c4d
downloading:  http://www.jianshu.com/p/9086d0300d1a
downloading:  http://www.jianshu.com/p/e76c242c7d6a
downloading:  http://www.jianshu.com/p/910662d6e881
downloading:  http://www.jianshu.com/p/f68d28d3b862
downloading:  http://www.jianshu.com/p/9457100d8763
downloading:  http://www.jianshu.com/p/62c0a5122fa8
downloading:  http://www.jianshu.com/p/f6420cce3040
downloading:  http://www.jianshu.com/p/27a78b2016e0
downloading:  http://www.jianshu.com/p/0c007dbbf728
downloading:  http://www.jianshu.com/p/f20bc50ad0e8

共生成294个文件,大部分是解密大数据专题下的文章。我这里记录了其中一次出现一个504页面访问错误的输出结果。


代码部分二:lxml实现
  1. 导入模块
  2. 基础的下载函数:download
  3. 抓取文章页上文章主体内容的函数:crawl_page
  4. 抓取文章内图片信息和链接的函数:crawl_article_images
  5. 抓取文章内文字链信息和链接的函数:crawl_article_text_link

基本框架和beautifulsoup4的代码一致。实现中有2类问题希望将来有机会得到老师或其他资深的人的指点。lxml的英文官方文档内容不少,不太容易查找我们这次实际使用中遇到的问题。主要靠网上搜索,解决方法不系统。

  1. 中文乱码的解码。曾老师课堂上演示过的lxml.html.fromstring()函数我还没找到一个办法可以令其返回的结果正常显示中文字符。只好用etree的相关函数解决。
  2. lxml的函数返回值基本都是一个列表。抓取一个元素下多层多处文字内容时,怎么把抓取到的文字信息列表合并成一个文字信息变量便于写入文件?现在虽然解决了。仍然好奇老师使用lxml的函数返回列表时用什么方法处理。

lxml实现的代码,包括调用函数执行抓取的部分,全部展示,不再分块。

# coding: utf-8
"""
爬虫课练习代码lxml版本
课程作业-爬虫入门03-爬虫基础-WilliamZeng-20170716
"""

import os
import time
import urllib2
import lxml.html # lxml中的HTML返回结果解析模块
import lxml.etree # 为了解决中文乱码而专门引入的lxml模块

def download(url, retry=2):
    """
    下载页面的函数,会下载完整的页面信息
    :param url: 要下载的url
    :param retry: 重试次数
    :return: 原生html
    """
    print "downloading: ", url
    # 设置header信息,模拟浏览器请求
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
    }
    try: #爬取可能会失败,采用try-except方式来捕获处理
        request = urllib2.Request(url, headers=header) #设置请求数据
        html = urllib2.urlopen(request).read() #抓取url
    except urllib2.URLError as e: #异常处理
        print "download error: ", e.reason
        html = None
        if retry > 0: #未超过重试次数,可以继续爬取
            if hasattr(e, 'code') and 500 <= e.code < 600: #错误码范围,是请求出错才继续重试爬取
                print e.code
                return download(url, retry - 1)
    time.sleep(1) #等待1s,避免对服务器造成压力,也避免被服务器屏蔽爬取
    return html

def crawl_article_images(post_url):
    """
    抓取文章中图片链接
    :param post_url: 文章页面
    """
    image_link = []
    flag = True # 标记是否需要继续爬取
    while flag:
        page = download(post_url) # 下载页面
        if page == None:
            break
        my_parser = lxml.etree.HTMLParser(encoding="utf-8")
        html_content = lxml.etree.HTML(page, parser=my_parser) # 格式化爬取的页面数据
        # html_content = lxml.html.fromstring(page) # 格式化爬取的页面数据,fromstring函数未找到解决中文乱码的办法
        title = html_content.xpath('//h1[@class="title"]/text()')  # 获取文章标题
        image_link = html_content.xpath('//div/img/@data-original-src') # 获取图片的原始链接
        image_caption = html_content.xpath('//div[@class="image-caption"]/text()') # 获取图片的标题
        if image_link.__len__() == 0: # 爬取的页面中无图片div元素,终止爬取
            break

        image_content = ''
        for i in range(image_link.__len__()):
            image_content += str(i + 1) + '. ' + (unicode(image_caption[i]).encode('utf-8', errors='ignore')) + ' : '+ image_link[i] + '\n'

        if os.path.exists('spider_output/') == False:  # 检查保存文件的地址
            os.mkdir('spider_output')

        file_name = 'spider_output/' + title[0] + '_images_by_lxml.txt'  # 设置要保存的文件名
        if os.path.exists(file_name) == False:
            file = open('spider_output/' + title[0] + '_images_by_lxml.txt', 'wb')  # 写文件
            file.write(image_content)
            file.close()
        flag = False

    image_num = image_link.__len__()
    print 'total number of images in the article: ', image_num

def crawl_article_text_link(post_url):
    """
    抓取文章中的文字链接
    :param post_url: 文章页面
    """
    flag = True # 标记是否需要继续爬取
    while flag:
        page = download(post_url) # 下载页面
        if page == None:
            break

        my_parser = lxml.etree.HTMLParser(encoding="utf-8")
        html_content = lxml.etree.HTML(page, parser=my_parser)  # 格式化爬取的页面数据
        title = html_content.xpath('//h1[@class="title"]/text()')  # 获取文章标题
        text_links = html_content.xpath('//div[@class="show-content"]//a/@href')
        text_links_label = html_content.xpath('//div[@class="show-content"]//a/text()')
        if text_links.__len__() == 0: # 爬取的页面中没有文字链元素,终止爬取
            break

        text_links_content = ''
        for i in range(text_links.__len__()):
            text_links_content += str(i + 1) + '. ' + (unicode(text_links_label[i]).encode('utf-8', errors='ignore')) + ' : '+ text_links[i] + '\n'

        if os.path.exists('spider_output/') == False:  # 检查保存文件的地址
            os.mkdir('spider_output')

        file_name = 'spider_output/' + title[0] + '_article_text_links_by_lxml.txt'  # 设置要保存的文件名
        if os.path.exists(file_name) == False:
            file = open('spider_output/' + title[0] + '_article_text_links_by_lxml.txt', 'wb')  # 写文件
            file.write(text_links_content)
            file.close()
        flag = False

    link_num = text_links.__len__()
    print 'total number of text links in the article: ', link_num

def crawl_page(crawled_url):
    """
    爬取文章内容
    :param crawled_url: 需要爬取的页面地址集合
    """
    for link in crawled_url: #按地址逐篇文章爬取
        page = download(link)
        my_parser = lxml.etree.HTMLParser(encoding="utf-8")
        html_content = lxml.etree.HTML(page, parser=my_parser)
        title = html_content.xpath('//h1[@class="title"]/text()') #获取文章标题
        contents = html_content.xpath('//div[@class="show-content"]//text()') #获取文章内容
        content = ''.join(contents)

        if os.path.exists('spider_output/') == False: #检查保存文件的地址
            os.mkdir('spider_output/')

        file_name = 'spider_output/' + title[0] + '_by_lxml.txt' #设置要保存的文件名
        if os.path.exists(file_name):
            # os.remove(file_name) # 删除文件
            continue  # 已存在的文件不再写
        file = open('spider_output/' + title[0] + '_by_lxml.txt', 'wb') #写文件
        content = unicode(content).encode('utf-8', errors='ignore')
        file.write(content)
        file.close()


crawl_article_images('http://www.jianshu.com/p/10b429fd9c4d')
crawl_article_images('http://www.jianshu.com/p/faf2f4107b9b')
crawl_article_images('http://www.jianshu.com/p/111')
crawl_article_text_link('http://www.jianshu.com/p/10b429fd9c4d')
crawl_article_text_link('http://www.jianshu.com/p/faf2f4107b9b')
crawl_page(['http://www.jianshu.com/p/10b429fd9c4d'])
crawl_page(['http://www.jianshu.com/p/faf2f4107b9b'])

crawl_article_images和crawl_article_text_link函数的返回结果和beautifulsoup4的一样。crawl_page函数的返回结果格式上有点不同,如下

课程作业-爬虫入门03-爬虫基础-WilliamZeng-20170716_第3张图片
结果文件内容示例1

调用crawl_links函数和crawl_page函数分步抓取解密大数据专题里的文章链接及链接对应网页上的文章信息新的执行代码未编写。主体除了xpath选取元素的代码外不会有太大差异,可能潜在的问题还是文章标题特殊字符的处理。


这次内容和代码注释有点多,可能会有些文字上的错误或忘了修改的地方,代码运行结果没有问题。

你可能感兴趣的:(课程作业-爬虫入门03-爬虫基础-WilliamZeng-20170716)