准备工作
系统windows7
安装MYSQL
提示:
安装的时候, 选安装选项server only
根据提示, 遇到安装界面没有下一步可以用键盘操作
键盘操作
b-back。n-next。x-execute。f-finish。c-cancel
根据界面完成安装, 进入安装目录下, mysqld -initialize命令初始化, 用'mysql -uroot -p'进入shell
用 net start mysql启动mysql服务, 如果服务名无效
cmd打开到mysql/bin目录,输入 mysqld -install. 同时在控制面板进入 服务 选项, 启动mysql 服务. 多试试吧
安装pycharm
开启pycharm会员模式
伯乐在线爬取所有文章
安装模块
scrapy, pymysql, pillow, pypiwin32
pymysql是插入数据库的模块
用scrapy自带的ImagesPipeline需要pillow模块
创建爬虫后, windows输入命令scrapy crawl jobbole会报错需要pypiwin32
爬虫结构
items: 爬虫的解析信息的 字段
包含名称, 设置输入输出处理器
pipelines: 爬虫的管道, 用于将解析后消息持久化存储
包含图片存储, Json文件的存储, 数据库的存储
settings: 爬虫各种相关设置
包含 是否遵循ROBOTS_TXT, 爬虫下载网页时延, 爬虫图片下载存储的目录, 日志文件的存储目录, 管道的启用和优先级
spiders: 爬虫主体
爬虫的爬取主要逻辑
基本命令
# 创建爬虫项目
scrapy startproject jobbole_article
# 进入spiders目录下, 生成爬虫
scrapy genspider jobbole blog.jobbole.com
# 运行爬虫
scrapy crawl jobbole
最终的文件目录, 上述命令后images文件夹暂时没有
伯乐在线爬虫目录.png
jobbole.py
# -*- coding: utf-8 -*-
import scrapy
from urllib import parse
from jobbole_article.items import ArticleItemLoader, JobboleArticleItem
from scrapy.http import Request
class JobboleSpider(scrapy.Spider):
name = 'jobbole'
allowed_domains = ['blog.jobbole.com']
start_urls = ['http://blog.jobbole.com/all-posts']
@staticmethod
def add_num(value):
return value if value else [0]
def parse_deatail(self, response):
response_url = response.url
front_image_url = response.meta.get('front_image_url', '')
item_loader = ArticleItemLoader(item=JobboleArticleItem(), response=response)
item_loader.add_xpath('title', "//div[@class='entry-header']/h1/text()")
item_loader.add_value('url', response_url)
item_loader.add_value('url_object_id', response_url)
item_loader.add_value('front_image_url', front_image_url)
item_loader.add_xpath('content', "//div[@class='entry']//text()")
# span_loader = loader.nested_path('//span[@class='href-style'])
# 赞
item_loader.add_xpath('praise_nums', "//span[contains(@class,'vote-post-up')]/h10/text()", self.add_num)
# 评论
item_loader.add_xpath('comment_nums', "//span[contains(@class, 'hide-on-480')]/text()", self.add_num)
# 收藏
item_loader.add_xpath('fav_nums', "//span[contains(@class, 'bookmark-btn')]/text()", self.add_num)
item_loader.add_xpath('tags', "//p[@class='entry-meta-hide-on-mobile']/a[not(@href='#article-comment')]/text()")
return item_loader.load_item()
def parse(self, response):
post_nodes = response.xpath("//div[@class='post floated-thumb']")
for post_node in post_nodes:
post_url = post_node.xpath(".//a[@title]/@href").extract_first("")
img_url = post_node.xpath(".//img/@src").extract_first("")
yield Request(url=parse.urljoin(response.url, post_url), meta={'front_image_url': img_url},
callback=self.parse_deatail)
next_url = response.xpath('//a[@class="next page-numbers"]/@href').extract_first('')
if next_url:
yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
模块
from urllib import parse
该模块主要用于对不完整的url进行补全
url = parse.urljoin('http://blog.jobbole.com/', '10000')
#url输出为拼接后的'http://blog.jobbole.com/10000', 如果后面是完整的, 则不拼接
from jobbole_article.items import ArticleItemLoader, JobboleArticleItem
是items.py中的类
from scrapy.http impot Request
构造scrapy网页请求, 请求需要跟进的url.
meta参数,为字典形式. 主要是在Request中传送额外的变量给response.可以通过response.meta.get()获取
callback参数则是请求内容下载完毕后调用相应的解析函数
比如在http://blog.jobbole.com/all-posts/中需要获取文章内容, 则构造对下面图片中箭头所指url的请求.内容下载完毕后调用parse_detail方法进行处理. 处理函数可以获得Request中键front_image_url的值img_url
对应代码
yield Request(url=parse.urljoin(response.url, post_url), meta={'front_image_url': img_url},
callback=self.parse_deatail)
Request.png
JobboleSpider类
该类继承scrapy.Spider, 其他的属性需要查看文档
@staticmethod
def add_num(value):
可暂时忽略,
该类的静态方法, 用在以下代码中, 作为输入处理器.主要作用是在解析相关字段为空值时返回默认值
item_loader.add_xpath('comment_nums', "//span[contains(@class, 'hide-on-480')]/text()", self.add_num)
自定义方法parse_detail
作用:解析文章详情页的,提取相关字段值的方法, 文章详情页如http://blog.jobbole.com/114420/. 返回填充后的item
一些变量的解释
response_url是响应内容的连接, 比如http://blog.jobbole.com/114420/
front_image_url是http://blog.jobbole.com/all-posts图片连接
item_loader是具有填充item方法的实例, 常用方法add_xpath, add_value, 注意填充后的item的值比如item['title']是一个列表
add_xpath
用xpath解析response的方法, 第一个参数如'title'是item的键或者说字段, 第二个是xpath解析规则, 第三个是处理器
add_value
直接赋予相应的值
load_item
执行填充item
JobboleSpider的自带parse方法
作用: 与parse_detail相同都是解析response, 不同的是parse是爬虫默认调用的解析方法.
response.xpath
xpath解析规则, 返回Selector对象,用extract()获取所有的文本值列表[], 或者是用extract_first()获取第一个文本值
xpath规则
一些规则
可以像url那样拼接规则, 但是注意的是第二个规则加.
post_nodes = response.xpath("//div[@class='post floated-thumb']")
for post_node in post_nodes:
# .//a[@title]/@href
post_url = post_node.xpath(".//a[@title]/@href").extract_first("")
img_url = post_node.xpath(".//img/@src").extract_first("")
- xpath中不含某个属性"//div[not(@class='xx')]"
- xpath中包含某个属性"//div[contains(@class, 'xx')]"
- @herf表示提取属性href的值, text()表示提取元素里的文本值
- //表示元素任意层下的子元素, /表示元素的直接子元素
调试方法
可以在浏览器中输入相应的路径测试, 但是要写css规则
css规则浏览器.png
用scrapy shell命令测试
scrapy shell http://blog.jobbole.com/all-posts
# 然后输入相应的规则可以看返回的值
response.xpath("...").extract()
# 可以用fetch(url)更改下载的Response
fetch('http://blog.jobbole.com/10000')
或者打断点,运行爬虫用pycharm可以查看
items.py
import scrapy
import re
import hashlib
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join, Identity, Compose
def get_md5(value):
if isinstance(value, str):
value = value.encode(encoding='utf-8')
# print('value--------------------------', value)
m = hashlib.md5()
m.update(value)
return m.hexdigest()
def get_num(value):
# print(value)
if value:
num = re.match(r".*?(\d+?)", value)
try:
# print("----------------",num.group(1), int(num.group(1)))
return int(num.group(1))
except (AttributeError, TypeError):
return 0
else:
return 0
#多余
def return_num(value):
# return value[0] if value else 0
if value:
return value
else:
return "1"
class JobboleArticleItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
url = scrapy.Field()
url_object_id = scrapy.Field(
input_processor=MapCompose(get_md5)
)
front_image_url = scrapy.Field(
output_processor=Identity()
)
front_image_path = scrapy.Field()
content = scrapy.Field(
output_processor=Join()
)
praise_nums = scrapy.Field(
input_processor=MapCompose(get_num),
# output_processor=MapCompose(return_num)
)
fav_nums = scrapy.Field(
input_processor=MapCompose(get_num),
# output_processor=MapCompose(return_num)
# input_processor=Compose(get_num, stop_on_none=False)
)
comment_nums = scrapy.Field(
input_processor=MapCompose(get_num),
# output_processor=MapCompose(return_num)
# input_processor=Compose(get_num, stop_on_none=False)
)
tags = scrapy.Field(
output_processor=Join()
)
def get_insert_sql(self):
insert_sql = """
insert into jobbole(title, url, url_object_id, front_image_url, front_image_path,praise_nums, fav_nums,
comment_nums, tags, content)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
"""
params = (
self['title'], self['url'], self['url_object_id'], self['front_image_url'], self['front_image_path'],
self['praise_nums'], self['fav_nums'], self['comment_nums'], self['tags'], self['content']
)
return insert_sql, params
class ArticleItemLoader(ItemLoader):
default_output_processor = TakeFirst()
模块
`import re'
正则匹配模块
# match函数是从字符串开始处匹配
num = re.match(".*(\d)", 'xx')
# 如果上面没有匹配成功, 会出现AttributeError
num = num.group(1)
# 另外int([])会出现TypeError
import hashlib
将字符串转换为 md5字符串, 必须经过utf-8编码.scrapy的值都是unicode编码
from scrapy.loader import ItemLoader
继承scrapy的ItemLoader, 自定义ArticleItemLoader.
from scrapy.loader.processors import TakeFirst, MapCompose, Join, Identity, Compose
一系列scrapy给定的处理器函数类,TakeFirst是获取列表第一个非空值, MapCompose的参数是多个函数, 能将列表中的每个值通过函数处理,并将处理结果汇成列表再进入下一个函数. Join将列表连接成一个字符串, Identity不作处理, Compose的参数是多个函数, 与MapCompose不同, 是将整个列表传入函数处理
get_num函数
jobbole.py中add_xpath()中加上add_num, if判断就多余了
TypeError也有点多余, 懒得改了
def get_num(value):
num = re.match(r".*?(\d+?)", value)
try:
return int(num.group(1))
except AttributeError:
return 0
JobboleArticleItem类
定义item的字段和输入输出处理器, 输入输出处理器的作用时候不同
这里有点疑问:文档说输入处理器作用在解析出一个值后立即作用, 而输出处理器则是在整个列表完成后作用.假如我把Compose写在输出处理器里, Compose不是处理整个列表的吗?有点矛盾
注意的是如果scrapy.Field()中有output_processor将会使default_output_processor失效
另外MapCompose()中的函数是不处理空值.如果是空列表, 那么函数将不生效.
在scrapy源码可以看到用了一个for循环调用函数处理列表中的值
for v in values:
next_values += arg_to_iter(func(v))
get_insert_sql方法
写入mysql数据库的语句和参数, 会在pipelines.py中用到
ArticleItemLoader类
为item每个字段赋予一个默认的输出处理器
pipelines.py
import pymysql
from twisted.enterprise import adbapi
from scrapy.pipelines.images import ImagesPipeline
class JobboleArticlePipeline(object):
def process_item(self, item, spider):
return item
class JobboleMysqlPipeline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
params = dict(
host=settings['MYSQL_HOST'],
db=settings['MYSQL_DBNAME'],
user=settings['MYSQL_USER'],
passwd=settings['MYSQL_PASSWORD'],
charset='utf8',
cursorclass=pymysql.cursors.DictCursor,
use_unicode=True
)
dbpool = adbapi.ConnectionPool('pymysql', **params)
return cls(dbpool)
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self.do_insert, item)
query.addErrback(self.handle_error, item, spider)
def do_insert(self, cursor, item):
insert_sql, params = item.get_insert_sql()
cursor.execute(insert_sql, params)
def handle_error(self, failure, item, spider):
print(failure)
class ArticleImagePipeline(ImagesPipeline):
def item_completed(self, results, item, info):
# 注意这里的判断, 可能front_image_url为空
if 'front_image_url' in item:
for _, value in results:
# print(value)
image_file_path = value['path']
item['front_image_path'] = image_file_path
return item
模块
import pymysql
连接和写入数据库的模块
import pymysql
# 连接pymysql
db = pymysql.connect('localhost', 'root', '123456', 'jobbole')
# 使用cursor()方法获取游标
cursor = db.cursor()
# sql插入语句
insert_sql = "insert into jobbole (字段)values('值')"
# 执行插入
try:
cursor.execute(insert_sql)
# 确认提交...
db.commit()
except:
# 错误就回滚
cursor.rollback()
# 关闭连接
db.close()
from twisted.enterprise import adbapi
异步, 不清楚, 先背着吧
from scray.pipelines.images import ImagesPipeline
scrapy的图片存储管道, 需要手动添加pillow模块
JobboleArticlePipeline类
自动生成的管道类
JobboleMysqlPipeline类, 自定义异步写入mysql
settings在settings.py中设置
异步连接mysql?
dbpool = adbapi.ConnectionPool('pymysql', **params)
生成实例...
# 执行 __init__(dbpool), 生成实例
return cls(dbpool)
process_item管道处理item的方法
# 异步执行插入操作?
# 不需要db.commit()
query = self. dbpool.runInteraction(self.do_insert, item)
# 看不懂
# 不用返回item?
query.addErrback(self.handle_error, item, spider)
do_insert
cursor参数在ConnectionPool中获得吗?
ArticleImagePipeline
item_completed
参数results, item, info
主要是记录front_image_path
settings.py
通用
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
mysql设置
MYSQL_HOST = '127.0.0.1'
MYSQL_USER = 'root'
MYSQL_DBNAME = 'jobbole'
MYSQL_PASSWORD = '123456'
管道的启用和优先级
数字越低优先级越高, 对应的是Pipelines.py中编写的管道
ITEM_PIPELINES = {
# 'jobbole_article.pipelines.JobboleArticlePipeline': 300,
'jobbole_article.pipelines.ArticleImagePipeline': 1,
'jobbole_article.pipelines.JobboleMysqlPipeline': 2,
}
图片存储目录
import os
# 指定图片下载url的item字段
IMAGES_URLS_FIELD = 'front_image_url'
# 图片存储的父目录, 也是settings.py的父目录, __file__是settings.py?
#abspath绝对路径, dirname父目录
image_dir = os.path.abspath(os.path.dirname(__file__))
# 图片存储的文件夹
IMAGES_STORE = os.path.join(image_dir, 'images')
mysql需要用到的命令
# 查看数据库
show databases;
# 查看表格
show tables;
# 创建数据库
create database jobbole;
# 切换数据库
use jobbole;
# 创建表格
create table(
title varchar(200) not null,
url varchar(300) not null,
url_object_id varchar(50) primary key not null,
front_image_url varchar(200),
praise_nums int(11) not null,
fav_nums int(11) not null,
tags varchar(200),
content longtext not null
)
# 查看数据库编码信息
show variables like 'character_set_database';
# 查看表格第一条记录
select * from jobbole limit 1;
# 查看表格记录的数量
select count(title) from jobbole;
# 查看表格的大小
use information_schema
select concat(round(sum(DATA_LENGTH/1024/1024),2),'MB') as data from TABL
ES where table_schema='jobbole' and table_name='jobbole';
# 清空数据表记录
truncate table jobbole;
# 删除一个字段
alter table drop column ;
问题
第一次只爬取了1300多条文章爬虫就终止了, 不清楚具体原因
封面图片数量明显少, 数据库记录9000多条, 图片只有6000多张
封面图片url为空会报错
'fav_nums': 2,
'front_image_url': [''],
'praise_nums': 2,
'tags': '职场 产品经理 程序员 职场',
'title': '程序员眼里的 PM 有两种:有脑子的和没脑子的。后者占 90%',
'url': 'http://blog.jobbole.com/92328/',
'url_object_id': 'f74aa62b6a79fcf8f294173ab52f4459'}
Traceback (most recent call last):
File "g:\py3env\bole2\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\images.py", line 155, in get_media_requests
return [Request(x) for x in item.get(self.images_urls_field, [])]
File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\images.py", line 155, in
return [Request(x) for x in item.get(self.images_urls_field, [])]
File "g:\py3env\bole2\venv\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
self._set_url(url)
File "g:\py3env\bole2\venv\lib\site-packages\scrapy\http\request\__init__.py", line 62, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url:
文章中如果有emoji表情, 会出现编码错误.
爬取的时候没有设置输出的日志文件
当add_xpath()中path路径提取为空列表时, 输出输入处理器MapCompose()不起作用.
解决办法是在add_xpath参数额外加上处理器
总结
在能理解的基础上看英文文档要比机翻的中文文档好
不能理解可以看看源码
日志输出到文件
如果不能很好的理解每个部分, 那么需要在看完整体后回顾
selenium登录知乎,爬取问答
编码问题
python 中str 和bytes(二进制)的互相转化
因为scray中的response.body是bytes,所以写入文件要转成string
str = 'abc'
# errors 有strick,ignore
byt = str.encode(encoding='utf8', errors='strick')
# bytes->str
str = byt.decode(encodeing='utf8',errors='ignore')
bytes写入文件中要注意的编码
因为在windows中,新文件默认编码是gbk,所以python解释器会用gbk解析网络数据流。此时往往会失败。要在打开文件时指定编码。
with open('c:test.txt', 'w', encoding='utf8') as f:
f.write(response.body.decode('utf8', errors='ignore'))
base64图片编码
from PIL import image
from io import BytesIO
import base64
img_src = "data:image/jpg;base64,R0lGODdh.."
img_src = img_src.split(',')[1]
img_src = base64.b64encode(img_src)
img = image.open(BytesIO(img_src))
img.show()
爬虫的小技巧
手动构造response
from scrapy.http import HtmlResponse
body = open("example.html").read()
response = HtmlResponse(url='http://example.com', body=body.encode('utf-8'))
爬虫的url的拼接和跟进
def parse(self, response):
yield {}
for url in response.xpath().extract():
yield scrapy.Request(url=response.urljoin(url), callback=self.parse)
//进一步简化,不要for中extract()和response.urljoin
//如果要对提取的Url作处理,url.extract()?
for url in response.xpath():
yield response.follow(url, callback=self.parse)
爬虫的日志
scrapy 文档 日志
爬虫日志信息的级别和python的是一样,debug,info,warning,error,critical
Spider类自带日志属性
class ZhihuSpider(scrapy.Spider):
def func(self)
self.logger.warning('this is a log')
在Spider类外可以
import logging
logging.warning('this is a log')
# 也可以写不同的logger
logger = logging.getlogger('mycustomlogger')
logger.warning('this is a log')
另外在settings.py中可以设置命令行信息输出的级别和输出的日志文件
LOG_FILE = 'dir'
LOG_LEVEL = logging.WARNING
# 命令行
--logfile FILE
--loglevel LEVEL
re匹配不包含字符串
注意(?=)不占匹配位
s = 'sda'
re.match('s(?=d)$', s) # 匹配失败
# 不能匹配s后含da字符串
re.match('s(?!da)', s)
selenium的使用
下载浏览器驱动
chrome版本6.0,最新版本会有missing arguments granttype错误
selenium的方法
from selenium import webdriver
driver = webdriver.Chrome(execute_path="驱动所在目录")
# driver.page_source页面源
# selenium等待
from selenium.webdriver.support.ui import WebDriverwait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
# 10是超时时间,until参数是一个函数,这个函数的参数是driver,返回真假
element = WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(
"//div[@class='SignContainer-switch']/span"))
# 同上,ec是selenium自带的等待函数
WebDriverWait(driver, 10).until(ec.text_to_be_present_in_element(
(By.XPATH, "//div[@class='SignContainer-switch']/span"), '注册'))
整个爬虫代码
settings.py
# -*- coding: utf-8 -*-
import logging
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'zhihu'
SPIDER_MODULES = ['zhihu.spiders']
NEWSPIDER_MODULE = 'zhihu.spiders'
SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
MYSQL_HOST = '127.0.0.1'
MYSQL_DBNAME = 'zhihuSpider'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
LOG_LEVEL = logging.WARNING
LOG_FILE = 'G:\py3env\bole2\zhihu\zhihu\zhihu_spider.log'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = True
# Override the default request headers:
# 必须
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0",
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'zhihu.pipelines.ZhihuPipeline': 300,
}
zhihu_login.py
# -*- coding: utf-8 -*-
import scrapy
from zhihu.items import ZhihuQuestionItem, ZhihuAnswerItem, ZhihuItem
import re
import json
import datetime
from selenium import webdriver
# 使文本能解析
#from scrapy.selector import Selector
# 用法:Seletor(text=driver.pager_source).css().extract()
# 打开base64编码的图片
#import base64
#from io import BytesIO, StringIO
import logging
# selenium等待加载相关的模块
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
class ZhihuLoginSpider(scrapy.Spider):
name = 'zhihu_login'
allowed_domains = ['www.zhihu.com']
# start_requests 初始url
start_urls = ['https://www.zhihu.com/signup?next=%2F']
# 获取问题答案的api
start_answer_url = ["https://www.zhihu.com/api/v4/questions/{0}/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset={2}&limit={1}&sort_by=default"]
def start_requests(self):
driver = webdriver.Chrome(executable_path='C:/Users/Administrator/Desktop/chromedriver.exe')
# 打开网址
driver.get(start_urls[0])
# 等待登录元素出现,超时10秒
element = WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(
"//div[@class='SignContainer-switch']/span"))
# 点击登录
element.click()
# 等待点击后显示“注册”文本
WebDriverWait(driver, 10).until(ec.text_to_be_present_in_element(
(By.XPATH, "//div[@class='SignContainer-switch']/span"), '注册'))
# 模拟输入账号和密码
driver.find_element_by_css_selector("div.SignFlow-account input").send_keys("你的账号")
driver.find_element_by_css_selector("div.SignFlow-password input").send_keys("你的宻码")
driver.find_element_by_css_selector("button.SignFlow-submitButton").click()
# 等待页面中某个元素加载完成
WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(
"//div[@class='GlobalWrite-navTitle']"))
# 获取cookie
Cookies = driver.get_cookies()
cookie_dict = {}
for cookie in Cookies:
cookie_dict[cookie['name']] = cookie['value']
# 关闭驱动
driver.close()
return [scrapy.Request('https://www.zhihu.com/', cookies=cookie_dict, callback=self.parse)]
def parse(self, response):
# 获取页面中所有的链接
all_urls = response.css("a::attr(href)").extract()
for url in all_urls:
# 不匹配https://www.zhihu.com/question/13413413/log
match_obj = re.match('.*zhihu.com/question/(\d+)(/|$)(?!log)', url)
if match_obj:
yield scrapy.Request(response.urljoin(url), callback=self.parse_question)
else:
yield scrapy.Request(response.urljoin(url), callback=self.parse)
def parse_question(self, response):
if "QuestionHeader-title" in response.text:
match_obj = re.match(".*zhihu.com/question/(\d+)(/|$)", response.url)
self.logger.warning('Parse function called on {}'.format(response.url))
if match_obj:
self.logger.warning('zhihu id is {}'.format(match_obj.group(1)))
question_id = int(match_obj.group(1))
item_loader = ZhihuItem(item=ZhihuQuestionItem(), response=response)
# ::text前不带空格表示直接子节点的文本
item_loader.add_css("title", "h1.QuestionHeader-title::text")
item_loader.add_css("content", ".QuestionHeader-detail ::text")
item_loader.add_value("url", response.url)
item_loader.add_value("zhihu_id", question_id)
# 点击查看全部答案和不点击 ,answer_num两个网页提取的css规则不同。
# 这里将两个css都写上
item_loader.add_css("answer_num", "h4.List-headerText span ::text")
item_loader.add_css("answer_num", "a.QuestionMainAction::text")
item_loader.add_css("comments_num", "div.QuestionHeader-Comment button::text")
item_loader.add_css("watch_user_num", "strong.NumberBoard-itemValue::text")
item_loader.add_css("topics", ".QuestionHeader-topics ::text")
item_loader.add_value("crawl_time", datetime.datetime.now())
question_item = item_loader.load_item()
"""没用
else:
match_obj = re.match(".*zhihu.com/question/(\d+)(/|$)", response.url)
if match_obj:
question_id = int(match_obj.group(1))
item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)
item_loader.add_css("title",
"//*[id='zh-question-title']/h2/a/text()|//*[@id='zh-question-title']/h2/span/text()")
item_loader.add_css("content", ".QuestionHeader-detail")
item_loader.add_value("url", response.url)
item_loader.add_value("zhihu_id", question_id)
item_loader.add_css("answer_num", "#zh-question-answer-num::text")
item_loader.add_css("comment_num", "#zh-question-meta-wrap a[name='addcomment']::text")
item_loader.add_css("watch_user_num", "//*[@id='zh-question-side-header-wrap']/text()|"
"//*[@class='zh-question-followers-sidebar]/div/a/strong/text()")
item_loader.add_css("topics", ".zm-tag-editor-labels a::text")
question_item = item_loader.load_item()
"""
# format(*args, **kwargs)
# print("{1}{程度}{0}".format("开心", "今天", 程度="很")
# 今天很开心
yield scrapy.Request(self.start_answer_url[0].format(question_id, 20, 0),
callback=self.parse_answer)
yield question_item
def parse_answer(self, response):
# 网页返回的是json字符串,转为字典对象
ans_json = json.loads(response.text)
is_end = ans_json["paging"]['is_end']
next_url = ans_json["paging"]["next"]
for answer in ans_json["data"]:
# 用item直接赋值简单,却不能用processor
answer_item =ZhihuAnswerItem()
answer_item["zhihu_id"] = answer["id"]
answer_item["url"] = answer["url"]
answer_item["question_id"] = answer["question"]["id"]
answer_item["author_id"] = answer["author"]["id"] if "id" in answer["author"] else None
answer_item["content"] = answer["content"] if "content" in answer else None
answer_item["parise_num"] = answer["voteup_count"]
answer_item["comments_num"] = answer["comment_count"]
answer_item["create_time"] = answer["created_time"]
answer_item["update_time"] = answer["updated_time"]
answer_item["crawl_time"] = datetime.datetime.now()
yield answer_item
if not is_end:
yield scrapy.Request(next_url, callback=self.parse_answer)
下面图片中就是查看网页中api
图片.png
图片.png
图片.png
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from twisted.enterprise import adbapi
class ZhihuPipeline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self.do_insert_sql, item)
query.addErrback(self.handle_error, item, spider)
def do_insert_sql(self, cursor, item):
insert_sql, params = item.get_insert_sql()
cursor.execute(insert_sql, params)
def handle_error(self, failure, item, spider):
print(failure)
@classmethod
def from_settings(cls, settings):
params = dict(
host=settings['MYSQL_HOST'],
db=settings['MYSQL_DBNAME'],
user=settings['MYSQL_USER'],
passwd=settings['MYSQL_PASSWORD'],
charset='utf8',
cursorclass=pymysql.cursors.DictCursor,
use_unicode=True,
)
dbpool = adbapi.ConnectionPool("pymysql", **params)
return cls(dbpool)
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import logging
import datetime
import re
import scrapy
from scrapy.loader.processors import TakeFirst, Join, Compose, MapCompose
from scrapy.loader import ItemLoader
# 提取关注数量,回答数量,评论数量文本中的数字
def extract_num(value):
# 输出日志信息
logging.warning('this is function extract_num value:{}'.format(value))
for val in value:
if val is not None:
# 去掉数字中的,
val = ''.join(val.split(','))
match_obj = re.match(".*?(\d+)", val)
if match_obj:
logging.warning('this is one of value:{}'.format(match_obj.group(1)))
return int(match_obj.group(1))
break
# 重写ItemLoader,指定默认输出处理器
class ZhihuItem(ItemLoader):
# 取列表第一个元素
default_output_processor = TakeFirst()
class ZhihuQuestionItem(scrapy.Item):
topics = scrapy.Field(
# 将主题连接
output_processor=Join(',')
)
url = scrapy.Field()
title = scrapy.Field()
content = scrapy.Field()
answer_num = scrapy.Field(
# 提取数字
output_processor=Compose(extract_num)
)
comments_num = scrapy.Field(
output_processor=Compose(extract_num)
)
# 关注者数量
watch_user_num = scrapy.Field(
output_processor=Compose(extract_num)
)
zhihu_id = scrapy.Field()
crawl_time = scrapy.Field()
def get_insert_sql(self):
# on duplicate key update col_name=value(col_name)
insert_sql = """
insert into zhihu_question(zhihu_id, topics, url, title, content, answer_num, comments_num,
watch_user_num, crawl_time
)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
on duplicate key update content=values(content), answer_num=values(answer_num), comments_num=values(
comments_num), watch_user_num=values(watch_user_num)
"""
# [Failure instance: Traceback: : Use item['crawl_time'] = '2018-10-29 19:16:24' to set field value
# self.crawl_time = datetime.datetime.now()
# 用get处理相应的键为空的情况
# 用datetime.datetime.now()返回的值可以插入数据库
params = (self.get('zhihu_id'), self.get('topics','null'), self.get('url'), self.get('title'), self.get('content','null'), self.get('answer_num',0), self.get('comments_num',0),
self.get('watch_user_num',0), self.get('crawl_time'))
return insert_sql, params
class ZhihuAnswerItem(scrapy.Item):
zhihu_id = scrapy.Field()
url = scrapy.Field()
question_id = scrapy.Field()
author_id = scrapy.Field()
content = scrapy.Field()
# 赞
parise_num = scrapy.Field()
comments_num = scrapy.Field()
# 创建时间
create_time = scrapy.Field()
update_time = scrapy.Field()
crawl_time = scrapy.Field()
def get_insert_sql(self):
insert_sql = """
insert into zhihu_answer(zhihu_id, url, question_id, author_id, content, parise_num, comments_num,
create_time, update_time, crawl_time
)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
on duplicate key update content=values(content), comments_num=values(comments_num), parise_num=values(
parise_num), update_time=values(update_time)
"""
# fromtimestamp方法的将时间戳转为时间元组
params = (
self.get("zhihu_id"), self.get("url"), self.get("question_id"), self.get("author_id"), self.get("content"), self.get("parise_num", 0),
self.get("commennts_num", 0), datetime.datetime.fromtimestamp(self.get("create_time")), datetime.datetime.fromtimestamp(self.get("update_time")), self.get("crawl_time"),
)
return insert_sql, params
总结
有些问题会反复的遇到
程序一步一步写,记得加注释
链接先记下
知乎不用selenium的都失效了,看到不用selenium可以登录的请告知