简介
年中入坑巫师3, 到现在基本已经抛开其他游戏, 每天就是转悠清问号. 出于对这个游戏和系列小说的喜好, 再加上残留中年男人落后的"玩也要学习"的想法. 打算做个小英语软件, 方便记忆游戏单词, 提高英语阅读能力.
计划实现的功能有:
- 爬虫入库: 将游戏里的词条(包括角色,书,怪物.. ) 等资料统统抓取入库. 一开始先抓取书跑顺程序, 到最后再增加其他资料.
- 做个小程序方便整理和阅读. 具体的界面和实现方式还没想好.
先完成爬虫吧.
资料来源
转了一圈发现这个网站: http://witcher.wikia.com/wiki/Category:Books_in_the_games
灰常感谢粉丝们的热心整理.
程序
crawl_helper.py 是一个爬虫辅助类. 这么简单的需求, 实在不想用框架了:
#! /usr/bin/python
# -*- coding: UTF-8 -*-
"""
使用requests, 加强爬虫方面的应用能力.
使用ip代理池, 参考文章: http://ju.outofmemory.cn/entry/246458
FIXME :
1. cookie 可以通过数据库设置. 这样在访问一次之后, 就自从带上最新的cookie
作者: 萌萌哒小肥他爹
: https://www.jianshu.com/u/db796a501972
"""
import random
import requests
import string
from bs4 import BeautifulSoup
import logging
UA_LIST = [
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
]
def get_header(crawl_for):
headers = {
# 'Host': 'www.zhihu.com',
'User-Agent': UA_LIST[random.randint(0, len(UA_LIST)-1)],
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Referer': 'http://www.baidu.com',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
}
if crawl_for == 'douban':
headers['Host'] = 'www.douban.com'
headers['Cookie'] = "bid=%s" % "".join(random.sample(string.ascii_letters + string.digits, 11))
elif crawl_for == 'zhihu_answer':
headers['Host'] = 'www.zhihu.com'
headers['Cookie'] = 'd_c0="ACCCRm0uzAuPTn3djjdlWBFiQWJ0oQUIhpU=|1495460939"; _zap=3b7aeef8-23a0-4de9-a16b-5fece66e5498; q_c1=cb07a3b06a6e4efaa2b78015c6c2243f|1507801443000|1493628570000; r_cap_id="MjI5OTUyMTk2MzgyNDYwODg1N2RjNWE0ZTEzN2FlNDI=|1510280548|f6d71498966574559ce3f3a64ee848f9b148ffbe"; cap_id="ODJlZjNmOTg5YmQ0NDM0MWJjMDM1M2M0NjgzYWY0MmU=|1510280548|3bd34d6d0f9672659fbd3845ce08a78ca2fd634f"; z_c0=Mi4xaU1jRkFBQUFBQUFBSUlKR2JTN01DeGNBQUFCaEFsVk5jbHZ5V2dDd2Z2Sk1YZXZoVGNLUFRqcVFTY1ExMFhJNjhn|1510280562|94f746f7f48dab3490583fdc65f18ec4df358782; _xsrf=f6fff6c2cd1f55e60b61b29d098f8342; q_c1=cb07a3b06a6e4efaa2b78015c6c2243f|1510311164000|1493628570000; aliyungf_tc=AQAAAGpRUB+dlAoApSHeb6jmDrXePee1; __utma=155987696.1981921582.1510908314.1510908314.1510911826.2; __utmc=155987696; __utmz=155987696.1510908314.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _xsrf=f6fff6c2cd1f55e60b61b29d098f8342'
elif crawl_for == 'zhihu_question':
headers['Host'] = 'www.zhihu.com'
headers['Cookie'] = '_zap=6b9be63d-3724-40c4-9bd2-3e2c6c533472; d_c0="ACCCRm0uzAuPTn3djjdlWBFiQWJ0oQUIhpU=|1495460939"; _zap=3b7aeef8-23a0-4de9-a16b-5fece66e5498; q_c1=cb07a3b06a6e4efaa2b78015c6c2243f|1507801443000|1493628570000; r_cap_id="MjI5OTUyMTk2MzgyNDYwODg1N2RjNWE0ZTEzN2FlNDI=|1510280548|f6d71498966574559ce3f3a64ee848f9b148ffbe"; cap_id="ODJlZjNmOTg5YmQ0NDM0MWJjMDM1M2M0NjgzYWY0MmU=|1510280548|3bd34d6d0f9672659fbd3845ce08a78ca2fd634f"; z_c0=Mi4xaU1jRkFBQUFBQUFBSUlKR2JTN01DeGNBQUFCaEFsVk5jbHZ5V2dDd2Z2Sk1YZXZoVGNLUFRqcVFTY1ExMFhJNjhn|1510280562|94f746f7f48dab3490583fdc65f18ec4df358782; _xsrf=f6fff6c2cd1f55e60b61b29d098f8342; q_c1=cb07a3b06a6e4efaa2b78015c6c2243f|1510311164000|1493628570000; aliyungf_tc=AQAAAGpRUB+dlAoApSHeb6jmDrXePee1; __utma=155987696.1981921582.1510908314.1510911826.1510921296.3; __utmc=155987696; __utmz=155987696.1510908314.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _xsrf=f6fff6c2cd1f55e60b61b29d098f8342'
logging.error(u'---- crawl_for 参数不对, 使用默认的cookie')
return headers
def do_get(url, crawl_for="", is_json=False):
headers = get_header(crawl_for)
"""
如果不设置verify=False, 会抛出以下异常:
File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 385, in send
raise SSLError(e)
requests.exceptions.SSLError: [Errno 1] _ssl.c:510: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
看作者讨论, 是因为 ubuntu14.04 的openssl版本太低. 这里: https://github.com/PHPMailer/PHPMailer/issues/1022
"""
resp = requests.get(url, headers=headers, verify=False)
# print resp.apparent_encoding
# 关于抓取乱码问题, 有篇好文章: http://liguangming.com/python-requests-ge-encoding-from-headers
# real_encoding = requests.utils.get_encodings_from_content(resp.content)[0]
#
# content = resp.content.decode(real_encoding).encode('utf8')
if is_json:
return resp.json()
soup = BeautifulSoup(resp.content, 'html.parser')
return soup
具体实现程序是这里, 已经能够找到书的标题和内容.
#! /usr/bin/python
# -*- coding: UTF-8 -*-
"""
巫师3 books的链接.
原链接如下: http://witcher.wikia.com/wiki/Category:Books_in_the_games
在这个链接能更好地看到数据结构, 而程序里的链接是更方便获取分页查找的链接, 结构完全相同.
作者: 萌萌哒小肥他爹
: https://www.jianshu.com/u/db796a501972
"""
from bs4 import BeautifulSoup
from crawler import crawl_helper
import time
witcher3_books_url_template = 'http://witcher.wikia.com/index.php?action=ajax&articleId=The+Witcher+3+books&method=axGetArticlesPage&rs=CategoryExhibitionAjax&page=%d'
test_url = witcher3_books_url_template % 1
g_domain = 'http://witcher.wikia.com'
# print(soup)
# main_books = soup.find_all('div', {'id': 'mw-pages'})[0]
def do_it():
soup = crawl_helper.do_get(test_url, '', True)
main_books = BeautifulSoup(soup['page'], 'html.parser')
main_books = main_books.find_all('div', {'class': 'category-gallery-item'})
for div in main_books:
a_tag = div.find_all('a')[0]
title = a_tag['title']
book_url = g_domain + a_tag['href']
print('---- book: %s, url: %s' % (title, book_url))
time.sleep(1.17)
find_book_detail(book_url)
def find_book_detail(book_url):
"""
具体格式可以看这个: http://witcher.wikia.com/wiki/Hieronymus%27_notes
:param book_url:
:return:
"""
book_html = crawl_helper.do_get(book_url, '', False)
article_div = book_html.find_all('div', {'class': 'WikiaArticle'})[0]
# wiki 里有时候用dl, 有时候用p , 咳咳...
content_tag_list = article_div.find_all('dl')
if content_tag_list is None:
content_tag_list = article_div.find_all('p')
for dl_tag in content_tag_list:
print(dl_tag.text)
# print(dl_tag.encode_contents())
if __name__ == '__main__':
do_it()
大概先这样, 明天再实现数据库功能. 程序的功能注释都很明白, 不需要多解释了.
以后每完成一个重要功能就连载一篇, 用这种鼓励自己做完, 也希望有粉丝喜欢.