爬虫第三次作业-0706

感谢曾老师耐心的讲解和细致的回答。

本次作业主要还是为了让我们来熟悉爬虫代码，仅进行了执行部分的修改。
虽然全程代码以依据汤尧和 joe 同学的作业进行了修改和标注，但是不得不承认小白依旧理解不深刻，期待今晚课曾老师耐心的讲解。

本次课的作业如下：

选择第二次课程作业选中的网址

爬取该页面中的所有可以爬取的元素，至少要求爬取文章主体内容

可以尝试用lxml爬取

上次课作业地址
http://www.jianshu.com/u/6eca8e1506ce

代码部分：

# 导入要用的库
#coding: utf-8

import os
import time
import urllib2
import urlparse
from bs4 import BeautifulSoup # 用于解析网页中文

# 定义下载页面函数并进行容错处理

def download(url, retry=2): # 定义一个叫“download”的函数用于下载页面信息
    print ("downloading:", url) # 定义打印方式
    
    # 设置header信息，模拟浏览器请求
    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
    
    # 设置容错机制（爬取可能会失败，采用try-except方式来捕获处理）
    try:
        request = urllib2.Request(url, headers=header) # 设置请求数据（输入网址，并模拟自己的机器登录）
        html = urllib2.urlopen(request).read() # 抓取url
    except urllib2.URLError as e:  # 异常处理
        print ("download error: ", e.reason) # 打印异常原因
        html = None # 并返回空值
        if retry > 0: # 如果未超过重试次数，可以继续爬取
            if hasattr(e, 'code') and 500 <= e.code <600: # 错误码范围，是请求出错才继续重试爬取（只有在链接打不开的情况下才重新爬取，而链接写错了则不重新爬取）
                print (e.code) # 打印错误码
            return download(url, retry -1)
    time.sleep(1) # 等待1s，避免对服务器造成压力，也避免被服务器屏蔽爬取
    return html

url_root = 'http://www.jianshu.com' # 下载的种子页面地址 
url_seed = 'http://www.jianshu.com/u/6eca8e1506ce?page=%d' # 爬取网站的根目录（重点要加“page=%d”，用来配合下段代码进行翻页爬取）

# 定义真正需要爬取的页面

crawled_url = set() # 需要爬取的页面
i = 1
flag = True # 标记是否需要继续爬取
while flag:
    url = url_seed % i # 格式化 url_seed 的 page=%d 
    i += 1 # 下一次需要爬取的页面（i = i + 1）

    html = download(url) # 下载页面
    if html == None: # 下载页面为空，表示已爬取到最后
        break

    soup = BeautifulSoup(html, "html.parser") # 格式化爬取的页面数据
    links = soup.find_all('a',{'class': 'title'}) # 获取标题元素（返回class属性为title的h1标签）
    if links.__len__() == 0: # 爬取的页面中已无有效数据，终止爬取
        flag = False

    for link in links: # 获取有效的文章地址
        link = link.get('href')
        if link not in crawled_url:
            realUrl = urlparse.urljoin(url_root, link)
            crawled_url.add(realUrl) # 记录未重复的需要爬取的页面
        else:
            print ('end')
            flag = False # 结束抓取

# 输出结果

('downloading:', 'http://www.jianshu.com/u/6eca8e1506ce?page=1')
('downloading:', 'http://www.jianshu.com/u/6eca8e1506ce?page=2')
('downloading:', 'http://www.jianshu.com/u/6eca8e1506ce?page=3')
('downloading:', 'http://www.jianshu.com/u/6eca8e1506ce?page=4')
('downloading:', 'http://www.jianshu.com/u/6eca8e1506ce?page=5')
('downloading:', 'http://www.jianshu.com/u/6eca8e1506ce?page=6')
('downloading:', 'http://www.jianshu.com/u/6eca8e1506ce?page=7')
('downloading:', 'http://www.jianshu.com/u/6eca8e1506ce?page=8')
('downloading:', 'http://www.jianshu.com/u/6eca8e1506ce?page=9')

# 计算可能获取的全部文章数量

paper_num = crawled_url.__len__()
print('total paper num: ',paper_num)

# 输出结果

('total paper num: ', 273)

# 抓取文章内容，并按标题和内容保存起来

for link in crawled_url:  # 按地址逐篇文章爬取
    html = download(link)
    soap = BeautifulSoup(html, "html.parser")
    title = soap.find('h1', {'class': 'title'}).text  # 获取文章标题（返回class属性为title的h1标签，并且只包含文字）
    content = soap.find('div', {'class': 'show-content'}).text # 获取文章内容（返回class属性为show-content的div标签，并且只包含文字）

    if os.path.exists('spider_res/') == False: # 检查保存文件的地址
        os.mkdir('spider_res') # 创建一个目录

    file_name = 'spider_res/' + title + '.txt' # 保存的文件名及文件格式
    if os.path.exists(file_name):
        # os.remove(file_name) #删除文件
        continue # 已存在的文件不再写，跳出该循环，继续下一个循环
    
    # 处理title中的特殊字符
    title = title.strip()
    title = title.replace('|', ' ')
    title = title.replace('"', ' ')
    title = title.replace(':', ' ')
    title = title.replace('?', ' ')
    title = title.replace('<', ' ')
    title = title.replace('>', ' ')
    print (title) #可以打印出来感受下样式
    
    arr = 'spider_res/' + title + '.txt'
    file = open(arr, 'wb') # 写文件，定义样式
    content = unicode(content).encode('utf-8', errors='ignore') # 用UTF-8实现Unicode，并消除转义字符
    file.write(content)
    file.close()

以下为爬出的文件

spider_res

参考文档：
汤尧 - 爬虫入门03作业

joe同学的作业还没有发布，但是这次我真的是抄的Joe的作业，所以必须要严重的感谢一下joe

爬虫第三次作业-0706

你可能感兴趣的:(爬虫第三次作业-0706)