mysql自定义函数 知乎_scrapy入门

准备工作

系统windows7

安装MYSQL

提示:

安装的时候, 选安装选项server only

根据提示, 遇到安装界面没有下一步可以用键盘操作

键盘操作

b-back。n-next。x-execute。f-finish。c-cancel

根据界面完成安装, 进入安装目录下, mysqld -initialize命令初始化, 用'mysql -uroot -p'进入shell

用 net start mysql启动mysql服务, 如果服务名无效

cmd打开到mysql/bin目录,输入 mysqld -install. 同时在控制面板进入 服务 选项, 启动mysql 服务. 多试试吧

安装pycharm

开启pycharm会员模式

伯乐在线爬取所有文章

安装模块

scrapy, pymysql, pillow, pypiwin32

pymysql是插入数据库的模块

用scrapy自带的ImagesPipeline需要pillow模块

创建爬虫后, windows输入命令scrapy crawl jobbole会报错需要pypiwin32

爬虫结构

items: 爬虫的解析信息的 字段

包含名称, 设置输入输出处理器

pipelines: 爬虫的管道, 用于将解析后消息持久化存储

包含图片存储, Json文件的存储, 数据库的存储

settings: 爬虫各种相关设置

包含 是否遵循ROBOTS_TXT, 爬虫下载网页时延, 爬虫图片下载存储的目录, 日志文件的存储目录, 管道的启用和优先级

spiders: 爬虫主体

爬虫的爬取主要逻辑

基本命令

# 创建爬虫项目

scrapy startproject jobbole_article

# 进入spiders目录下, 生成爬虫

scrapy genspider jobbole blog.jobbole.com

# 运行爬虫

scrapy crawl jobbole

最终的文件目录, 上述命令后images文件夹暂时没有

伯乐在线爬虫目录.png

jobbole.py

# -*- coding: utf-8 -*-

import scrapy

from urllib import parse

from jobbole_article.items import ArticleItemLoader, JobboleArticleItem

from scrapy.http import Request

class JobboleSpider(scrapy.Spider):

name = 'jobbole'

allowed_domains = ['blog.jobbole.com']

start_urls = ['http://blog.jobbole.com/all-posts']

@staticmethod

def add_num(value):

return value if value else [0]

def parse_deatail(self, response):

response_url = response.url

front_image_url = response.meta.get('front_image_url', '')

item_loader = ArticleItemLoader(item=JobboleArticleItem(), response=response)

item_loader.add_xpath('title', "//div[@class='entry-header']/h1/text()")

item_loader.add_value('url', response_url)

item_loader.add_value('url_object_id', response_url)

item_loader.add_value('front_image_url', front_image_url)

item_loader.add_xpath('content', "//div[@class='entry']//text()")

# span_loader = loader.nested_path('//span[@class='href-style'])

# 赞

item_loader.add_xpath('praise_nums', "//span[contains(@class,'vote-post-up')]/h10/text()", self.add_num)

# 评论

item_loader.add_xpath('comment_nums', "//span[contains(@class, 'hide-on-480')]/text()", self.add_num)

# 收藏

item_loader.add_xpath('fav_nums', "//span[contains(@class, 'bookmark-btn')]/text()", self.add_num)

item_loader.add_xpath('tags', "//p[@class='entry-meta-hide-on-mobile']/a[not(@href='#article-comment')]/text()")

return item_loader.load_item()

def parse(self, response):

post_nodes = response.xpath("//div[@class='post floated-thumb']")

for post_node in post_nodes:

post_url = post_node.xpath(".//a[@title]/@href").extract_first("")

img_url = post_node.xpath(".//img/@src").extract_first("")

yield Request(url=parse.urljoin(response.url, post_url), meta={'front_image_url': img_url},

callback=self.parse_deatail)

next_url = response.xpath('//a[@class="next page-numbers"]/@href').extract_first('')

if next_url:

yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)

模块

from urllib import parse

该模块主要用于对不完整的url进行补全

url = parse.urljoin('http://blog.jobbole.com/', '10000')

#url输出为拼接后的'http://blog.jobbole.com/10000', 如果后面是完整的, 则不拼接

from jobbole_article.items import ArticleItemLoader, JobboleArticleItem

是items.py中的类

from scrapy.http impot Request

构造scrapy网页请求, 请求需要跟进的url.

meta参数,为字典形式. 主要是在Request中传送额外的变量给response.可以通过response.meta.get()获取

callback参数则是请求内容下载完毕后调用相应的解析函数

比如在http://blog.jobbole.com/all-posts/中需要获取文章内容, 则构造对下面图片中箭头所指url的请求.内容下载完毕后调用parse_detail方法进行处理. 处理函数可以获得Request中键front_image_url的值img_url

对应代码

yield Request(url=parse.urljoin(response.url, post_url), meta={'front_image_url': img_url},

callback=self.parse_deatail)

Request.png

JobboleSpider类

该类继承scrapy.Spider, 其他的属性需要查看文档

@staticmethod

def add_num(value):

可暂时忽略,

该类的静态方法, 用在以下代码中, 作为输入处理器.主要作用是在解析相关字段为空值时返回默认值

item_loader.add_xpath('comment_nums', "//span[contains(@class, 'hide-on-480')]/text()", self.add_num)

自定义方法parse_detail

作用:解析文章详情页的,提取相关字段值的方法, 文章详情页如http://blog.jobbole.com/114420/. 返回填充后的item

一些变量的解释

response_url是响应内容的连接, 比如http://blog.jobbole.com/114420/

front_image_url是http://blog.jobbole.com/all-posts图片连接

item_loader是具有填充item方法的实例, 常用方法add_xpath, add_value, 注意填充后的item的值比如item['title']是一个列表

add_xpath

用xpath解析response的方法, 第一个参数如'title'是item的键或者说字段, 第二个是xpath解析规则, 第三个是处理器

add_value

直接赋予相应的值

load_item

执行填充item

JobboleSpider的自带parse方法

作用: 与parse_detail相同都是解析response, 不同的是parse是爬虫默认调用的解析方法.

response.xpath

xpath解析规则, 返回Selector对象,用extract()获取所有的文本值列表[], 或者是用extract_first()获取第一个文本值

xpath规则

一些规则

可以像url那样拼接规则, 但是注意的是第二个规则加.

post_nodes = response.xpath("//div[@class='post floated-thumb']")

for post_node in post_nodes:

# .//a[@title]/@href

post_url = post_node.xpath(".//a[@title]/@href").extract_first("")

img_url = post_node.xpath(".//img/@src").extract_first("")

- xpath中不含某个属性"//div[not(@class='xx')]"

- xpath中包含某个属性"//div[contains(@class, 'xx')]"

- @herf表示提取属性href的值, text()表示提取元素里的文本值

- //表示元素任意层下的子元素, /表示元素的直接子元素

调试方法

可以在浏览器中输入相应的路径测试, 但是要写css规则

css规则浏览器.png

用scrapy shell命令测试

scrapy shell http://blog.jobbole.com/all-posts

# 然后输入相应的规则可以看返回的值

response.xpath("...").extract()

# 可以用fetch(url)更改下载的Response

fetch('http://blog.jobbole.com/10000')

或者打断点,运行爬虫用pycharm可以查看

items.py

import scrapy

import re

import hashlib

from scrapy.loader import ItemLoader

from scrapy.loader.processors import TakeFirst, MapCompose, Join, Identity, Compose

def get_md5(value):

if isinstance(value, str):

value = value.encode(encoding='utf-8')

# print('value--------------------------', value)

m = hashlib.md5()

m.update(value)

return m.hexdigest()

def get_num(value):

# print(value)

if value:

num = re.match(r".*?(\d+?)", value)

try:

# print("----------------",num.group(1), int(num.group(1)))

return int(num.group(1))

except (AttributeError, TypeError):

return 0

else:

return 0

#多余

def return_num(value):

# return value[0] if value else 0

if value:

return value

else:

return "1"

class JobboleArticleItem(scrapy.Item):

# define the fields for your item here like:

title = scrapy.Field()

url = scrapy.Field()

url_object_id = scrapy.Field(

input_processor=MapCompose(get_md5)

)

front_image_url = scrapy.Field(

output_processor=Identity()

)

front_image_path = scrapy.Field()

content = scrapy.Field(

output_processor=Join()

)

praise_nums = scrapy.Field(

input_processor=MapCompose(get_num),

# output_processor=MapCompose(return_num)

)

fav_nums = scrapy.Field(

input_processor=MapCompose(get_num),

# output_processor=MapCompose(return_num)

# input_processor=Compose(get_num, stop_on_none=False)

)

comment_nums = scrapy.Field(

input_processor=MapCompose(get_num),

# output_processor=MapCompose(return_num)

# input_processor=Compose(get_num, stop_on_none=False)

)

tags = scrapy.Field(

output_processor=Join()

)

def get_insert_sql(self):

insert_sql = """

insert into jobbole(title, url, url_object_id, front_image_url, front_image_path,praise_nums, fav_nums,

comment_nums, tags, content)

VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)

"""

params = (

self['title'], self['url'], self['url_object_id'], self['front_image_url'], self['front_image_path'],

self['praise_nums'], self['fav_nums'], self['comment_nums'], self['tags'], self['content']

)

return insert_sql, params

class ArticleItemLoader(ItemLoader):

default_output_processor = TakeFirst()

模块

`import re'

正则匹配模块

# match函数是从字符串开始处匹配

num = re.match(".*(\d)", 'xx')

# 如果上面没有匹配成功, 会出现AttributeError

num = num.group(1)

# 另外int([])会出现TypeError

import hashlib

将字符串转换为 md5字符串, 必须经过utf-8编码.scrapy的值都是unicode编码

from scrapy.loader import ItemLoader

继承scrapy的ItemLoader, 自定义ArticleItemLoader.

from scrapy.loader.processors import TakeFirst, MapCompose, Join, Identity, Compose

一系列scrapy给定的处理器函数类,TakeFirst是获取列表第一个非空值, MapCompose的参数是多个函数, 能将列表中的每个值通过函数处理,并将处理结果汇成列表再进入下一个函数. Join将列表连接成一个字符串, Identity不作处理, Compose的参数是多个函数, 与MapCompose不同, 是将整个列表传入函数处理

get_num函数

jobbole.py中add_xpath()中加上add_num, if判断就多余了

TypeError也有点多余, 懒得改了

def get_num(value):

num = re.match(r".*?(\d+?)", value)

try:

return int(num.group(1))

except AttributeError:

return 0

JobboleArticleItem类

定义item的字段和输入输出处理器, 输入输出处理器的作用时候不同

这里有点疑问:文档说输入处理器作用在解析出一个值后立即作用, 而输出处理器则是在整个列表完成后作用.假如我把Compose写在输出处理器里, Compose不是处理整个列表的吗?有点矛盾

注意的是如果scrapy.Field()中有output_processor将会使default_output_processor失效

另外MapCompose()中的函数是不处理空值.如果是空列表, 那么函数将不生效.

在scrapy源码可以看到用了一个for循环调用函数处理列表中的值

for v in values:

next_values += arg_to_iter(func(v))

get_insert_sql方法

写入mysql数据库的语句和参数, 会在pipelines.py中用到

ArticleItemLoader类

为item每个字段赋予一个默认的输出处理器

pipelines.py

import pymysql

from twisted.enterprise import adbapi

from scrapy.pipelines.images import ImagesPipeline

class JobboleArticlePipeline(object):

def process_item(self, item, spider):

return item

class JobboleMysqlPipeline(object):

def __init__(self, dbpool):

self.dbpool = dbpool

@classmethod

def from_settings(cls, settings):

params = dict(

host=settings['MYSQL_HOST'],

db=settings['MYSQL_DBNAME'],

user=settings['MYSQL_USER'],

passwd=settings['MYSQL_PASSWORD'],

charset='utf8',

cursorclass=pymysql.cursors.DictCursor,

use_unicode=True

)

dbpool = adbapi.ConnectionPool('pymysql', **params)

return cls(dbpool)

def process_item(self, item, spider):

query = self.dbpool.runInteraction(self.do_insert, item)

query.addErrback(self.handle_error, item, spider)

def do_insert(self, cursor, item):

insert_sql, params = item.get_insert_sql()

cursor.execute(insert_sql, params)

def handle_error(self, failure, item, spider):

print(failure)

class ArticleImagePipeline(ImagesPipeline):

def item_completed(self, results, item, info):

# 注意这里的判断, 可能front_image_url为空

if 'front_image_url' in item:

for _, value in results:

# print(value)

image_file_path = value['path']

item['front_image_path'] = image_file_path

return item

模块

import pymysql

连接和写入数据库的模块

import pymysql

# 连接pymysql

db = pymysql.connect('localhost', 'root', '123456', 'jobbole')

# 使用cursor()方法获取游标

cursor = db.cursor()

# sql插入语句

insert_sql = "insert into jobbole (字段)values('值')"

# 执行插入

try:

cursor.execute(insert_sql)

# 确认提交...

db.commit()

except:

# 错误就回滚

cursor.rollback()

# 关闭连接

db.close()

from twisted.enterprise import adbapi

异步, 不清楚, 先背着吧

from scray.pipelines.images import ImagesPipeline

scrapy的图片存储管道, 需要手动添加pillow模块

JobboleArticlePipeline类

自动生成的管道类

JobboleMysqlPipeline类, 自定义异步写入mysql

settings在settings.py中设置

异步连接mysql?

dbpool = adbapi.ConnectionPool('pymysql', **params)

生成实例...

# 执行 __init__(dbpool), 生成实例

return cls(dbpool)

process_item管道处理item的方法

# 异步执行插入操作?

# 不需要db.commit()

query = self. dbpool.runInteraction(self.do_insert, item)

# 看不懂

# 不用返回item?

query.addErrback(self.handle_error, item, spider)

do_insert

cursor参数在ConnectionPool中获得吗?

ArticleImagePipeline

item_completed

参数results, item, info

主要是记录front_image_path

settings.py

通用

ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

mysql设置

MYSQL_HOST = '127.0.0.1'

MYSQL_USER = 'root'

MYSQL_DBNAME = 'jobbole'

MYSQL_PASSWORD = '123456'

管道的启用和优先级

数字越低优先级越高, 对应的是Pipelines.py中编写的管道

ITEM_PIPELINES = {

# 'jobbole_article.pipelines.JobboleArticlePipeline': 300,

'jobbole_article.pipelines.ArticleImagePipeline': 1,

'jobbole_article.pipelines.JobboleMysqlPipeline': 2,

}

图片存储目录

import os

# 指定图片下载url的item字段

IMAGES_URLS_FIELD = 'front_image_url'

# 图片存储的父目录, 也是settings.py的父目录, __file__是settings.py?

#abspath绝对路径, dirname父目录

image_dir = os.path.abspath(os.path.dirname(__file__))

# 图片存储的文件夹

IMAGES_STORE = os.path.join(image_dir, 'images')

mysql需要用到的命令

# 查看数据库

show databases;

# 查看表格

show tables;

# 创建数据库

create database jobbole;

# 切换数据库

use jobbole;

# 创建表格

create table(

title varchar(200) not null,

url varchar(300) not null,

url_object_id varchar(50) primary key not null,

front_image_url varchar(200),

praise_nums int(11) not null,

fav_nums int(11) not null,

tags varchar(200),

content longtext not null

)

# 查看数据库编码信息

show variables like 'character_set_database';

# 查看表格第一条记录

select * from jobbole limit 1;

# 查看表格记录的数量

select count(title) from jobbole;

# 查看表格的大小

use information_schema

select concat(round(sum(DATA_LENGTH/1024/1024),2),'MB') as data from TABL

ES where table_schema='jobbole' and table_name='jobbole';

# 清空数据表记录

truncate table jobbole;

# 删除一个字段

alter table drop column ;

问题

第一次只爬取了1300多条文章爬虫就终止了, 不清楚具体原因

封面图片数量明显少, 数据库记录9000多条, 图片只有6000多张

封面图片url为空会报错

'fav_nums': 2,

'front_image_url': [''],

'praise_nums': 2,

'tags': '职场 产品经理 程序员 职场',

'title': '程序员眼里的 PM 有两种:有脑子的和没脑子的。后者占 90%',

'url': 'http://blog.jobbole.com/92328/',

'url_object_id': 'f74aa62b6a79fcf8f294173ab52f4459'}

Traceback (most recent call last):

File "g:\py3env\bole2\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks

current.result = callback(current.result, *args, **kw)

File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\media.py", line 79, in process_item

requests = arg_to_iter(self.get_media_requests(item, info))

File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\images.py", line 155, in get_media_requests

return [Request(x) for x in item.get(self.images_urls_field, [])]

File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\images.py", line 155, in

return [Request(x) for x in item.get(self.images_urls_field, [])]

File "g:\py3env\bole2\venv\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__

self._set_url(url)

File "g:\py3env\bole2\venv\lib\site-packages\scrapy\http\request\__init__.py", line 62, in _set_url

raise ValueError('Missing scheme in request url: %s' % self._url)

ValueError: Missing scheme in request url:

文章中如果有emoji表情, 会出现编码错误.

爬取的时候没有设置输出的日志文件

当add_xpath()中path路径提取为空列表时, 输出输入处理器MapCompose()不起作用.

解决办法是在add_xpath参数额外加上处理器

总结

在能理解的基础上看英文文档要比机翻的中文文档好

不能理解可以看看源码

日志输出到文件

如果不能很好的理解每个部分, 那么需要在看完整体后回顾

selenium登录知乎,爬取问答

编码问题

python 中str 和bytes(二进制)的互相转化

因为scray中的response.body是bytes,所以写入文件要转成string

str = 'abc'

# errors 有strick,ignore

byt = str.encode(encoding='utf8', errors='strick')

# bytes->str

str = byt.decode(encodeing='utf8',errors='ignore')

bytes写入文件中要注意的编码

因为在windows中,新文件默认编码是gbk,所以python解释器会用gbk解析网络数据流。此时往往会失败。要在打开文件时指定编码。

with open('c:test.txt', 'w', encoding='utf8') as f:

f.write(response.body.decode('utf8', errors='ignore'))

base64图片编码

from PIL import image

from io import BytesIO

import base64

img_src = ".."

img_src = img_src.split(',')[1]

img_src = base64.b64encode(img_src)

img = image.open(BytesIO(img_src))

img.show()

爬虫的小技巧

手动构造response

from scrapy.http import HtmlResponse

body = open("example.html").read()

response = HtmlResponse(url='http://example.com', body=body.encode('utf-8'))

爬虫的url的拼接和跟进

def parse(self, response):

yield {}

for url in response.xpath().extract():

yield scrapy.Request(url=response.urljoin(url), callback=self.parse)

//进一步简化,不要for中extract()和response.urljoin

//如果要对提取的Url作处理,url.extract()?

for url in response.xpath():

yield response.follow(url, callback=self.parse)

爬虫的日志

scrapy 文档 日志

爬虫日志信息的级别和python的是一样,debug,info,warning,error,critical

Spider类自带日志属性

class ZhihuSpider(scrapy.Spider):

def func(self)

self.logger.warning('this is a log')

在Spider类外可以

import logging

logging.warning('this is a log')

# 也可以写不同的logger

logger = logging.getlogger('mycustomlogger')

logger.warning('this is a log')

另外在settings.py中可以设置命令行信息输出的级别和输出的日志文件

LOG_FILE = 'dir'

LOG_LEVEL = logging.WARNING

# 命令行

--logfile FILE

--loglevel LEVEL

re匹配不包含字符串

注意(?=)不占匹配位

s = 'sda'

re.match('s(?=d)$', s) # 匹配失败

# 不能匹配s后含da字符串

re.match('s(?!da)', s)

selenium的使用

下载浏览器驱动

chrome版本6.0,最新版本会有missing arguments granttype错误

selenium的方法

from selenium import webdriver

driver = webdriver.Chrome(execute_path="驱动所在目录")

# driver.page_source页面源

# selenium等待

from selenium.webdriver.support.ui import WebDriverwait

from selenium.webdriver.common.by import By

from selenium.webdriver.support import expected_conditions as ec

# 10是超时时间,until参数是一个函数,这个函数的参数是driver,返回真假

element = WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(

"//div[@class='SignContainer-switch']/span"))

# 同上,ec是selenium自带的等待函数

WebDriverWait(driver, 10).until(ec.text_to_be_present_in_element(

(By.XPATH, "//div[@class='SignContainer-switch']/span"), '注册'))

整个爬虫代码

settings.py

# -*- coding: utf-8 -*-

import logging

# commonly used. You can find more settings consulting the documentation:

#

# https://doc.scrapy.org/en/latest/topics/settings.html

# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'zhihu'

SPIDER_MODULES = ['zhihu.spiders']

NEWSPIDER_MODULE = 'zhihu.spiders'

SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"

MYSQL_HOST = '127.0.0.1'

MYSQL_DBNAME = 'zhihuSpider'

MYSQL_USER = 'root'

MYSQL_PASSWORD = '123456'

LOG_LEVEL = logging.WARNING

LOG_FILE = 'G:\py3env\bole2\zhihu\zhihu\zhihu_spider.log'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

COOKIES_ENABLED = True

# Override the default request headers:

# 必须

USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0",

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'zhihu.pipelines.ZhihuPipeline': 300,

}

zhihu_login.py

# -*- coding: utf-8 -*-

import scrapy

from zhihu.items import ZhihuQuestionItem, ZhihuAnswerItem, ZhihuItem

import re

import json

import datetime

from selenium import webdriver

# 使文本能解析

#from scrapy.selector import Selector

# 用法:Seletor(text=driver.pager_source).css().extract()

# 打开base64编码的图片

#import base64

#from io import BytesIO, StringIO

import logging

# selenium等待加载相关的模块

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.common.by import By

from selenium.webdriver.support import expected_conditions as ec

class ZhihuLoginSpider(scrapy.Spider):

name = 'zhihu_login'

allowed_domains = ['www.zhihu.com']

# start_requests 初始url

start_urls = ['https://www.zhihu.com/signup?next=%2F']

# 获取问题答案的api

start_answer_url = ["https://www.zhihu.com/api/v4/questions/{0}/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset={2}&limit={1}&sort_by=default"]

def start_requests(self):

driver = webdriver.Chrome(executable_path='C:/Users/Administrator/Desktop/chromedriver.exe')

# 打开网址

driver.get(start_urls[0])

# 等待登录元素出现,超时10秒

element = WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(

"//div[@class='SignContainer-switch']/span"))

# 点击登录

element.click()

# 等待点击后显示“注册”文本

WebDriverWait(driver, 10).until(ec.text_to_be_present_in_element(

(By.XPATH, "//div[@class='SignContainer-switch']/span"), '注册'))

# 模拟输入账号和密码

driver.find_element_by_css_selector("div.SignFlow-account input").send_keys("你的账号")

driver.find_element_by_css_selector("div.SignFlow-password input").send_keys("你的宻码")

driver.find_element_by_css_selector("button.SignFlow-submitButton").click()

# 等待页面中某个元素加载完成

WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(

"//div[@class='GlobalWrite-navTitle']"))

# 获取cookie

Cookies = driver.get_cookies()

cookie_dict = {}

for cookie in Cookies:

cookie_dict[cookie['name']] = cookie['value']

# 关闭驱动

driver.close()

return [scrapy.Request('https://www.zhihu.com/', cookies=cookie_dict, callback=self.parse)]

def parse(self, response):

# 获取页面中所有的链接

all_urls = response.css("a::attr(href)").extract()

for url in all_urls:

# 不匹配https://www.zhihu.com/question/13413413/log

match_obj = re.match('.*zhihu.com/question/(\d+)(/|$)(?!log)', url)

if match_obj:

yield scrapy.Request(response.urljoin(url), callback=self.parse_question)

else:

yield scrapy.Request(response.urljoin(url), callback=self.parse)

def parse_question(self, response):

if "QuestionHeader-title" in response.text:

match_obj = re.match(".*zhihu.com/question/(\d+)(/|$)", response.url)

self.logger.warning('Parse function called on {}'.format(response.url))

if match_obj:

self.logger.warning('zhihu id is {}'.format(match_obj.group(1)))

question_id = int(match_obj.group(1))

item_loader = ZhihuItem(item=ZhihuQuestionItem(), response=response)

# ::text前不带空格表示直接子节点的文本

item_loader.add_css("title", "h1.QuestionHeader-title::text")

item_loader.add_css("content", ".QuestionHeader-detail ::text")

item_loader.add_value("url", response.url)

item_loader.add_value("zhihu_id", question_id)

# 点击查看全部答案和不点击 ,answer_num两个网页提取的css规则不同。

# 这里将两个css都写上

item_loader.add_css("answer_num", "h4.List-headerText span ::text")

item_loader.add_css("answer_num", "a.QuestionMainAction::text")

item_loader.add_css("comments_num", "div.QuestionHeader-Comment button::text")

item_loader.add_css("watch_user_num", "strong.NumberBoard-itemValue::text")

item_loader.add_css("topics", ".QuestionHeader-topics ::text")

item_loader.add_value("crawl_time", datetime.datetime.now())

question_item = item_loader.load_item()

"""没用

else:

match_obj = re.match(".*zhihu.com/question/(\d+)(/|$)", response.url)

if match_obj:

question_id = int(match_obj.group(1))

item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)

item_loader.add_css("title",

"//*[id='zh-question-title']/h2/a/text()|//*[@id='zh-question-title']/h2/span/text()")

item_loader.add_css("content", ".QuestionHeader-detail")

item_loader.add_value("url", response.url)

item_loader.add_value("zhihu_id", question_id)

item_loader.add_css("answer_num", "#zh-question-answer-num::text")

item_loader.add_css("comment_num", "#zh-question-meta-wrap a[name='addcomment']::text")

item_loader.add_css("watch_user_num", "//*[@id='zh-question-side-header-wrap']/text()|"

"//*[@class='zh-question-followers-sidebar]/div/a/strong/text()")

item_loader.add_css("topics", ".zm-tag-editor-labels a::text")

question_item = item_loader.load_item()

"""

# format(*args, **kwargs)

# print("{1}{程度}{0}".format("开心", "今天", 程度="很")

# 今天很开心

yield scrapy.Request(self.start_answer_url[0].format(question_id, 20, 0),

callback=self.parse_answer)

yield question_item

def parse_answer(self, response):

# 网页返回的是json字符串,转为字典对象

ans_json = json.loads(response.text)

is_end = ans_json["paging"]['is_end']

next_url = ans_json["paging"]["next"]

for answer in ans_json["data"]:

# 用item直接赋值简单,却不能用processor

answer_item =ZhihuAnswerItem()

answer_item["zhihu_id"] = answer["id"]

answer_item["url"] = answer["url"]

answer_item["question_id"] = answer["question"]["id"]

answer_item["author_id"] = answer["author"]["id"] if "id" in answer["author"] else None

answer_item["content"] = answer["content"] if "content" in answer else None

answer_item["parise_num"] = answer["voteup_count"]

answer_item["comments_num"] = answer["comment_count"]

answer_item["create_time"] = answer["created_time"]

answer_item["update_time"] = answer["updated_time"]

answer_item["crawl_time"] = datetime.datetime.now()

yield answer_item

if not is_end:

yield scrapy.Request(next_url, callback=self.parse_answer)

下面图片中就是查看网页中api

图片.png

图片.png

图片.png

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

from twisted.enterprise import adbapi

class ZhihuPipeline(object):

def __init__(self, dbpool):

self.dbpool = dbpool

def process_item(self, item, spider):

query = self.dbpool.runInteraction(self.do_insert_sql, item)

query.addErrback(self.handle_error, item, spider)

def do_insert_sql(self, cursor, item):

insert_sql, params = item.get_insert_sql()

cursor.execute(insert_sql, params)

def handle_error(self, failure, item, spider):

print(failure)

@classmethod

def from_settings(cls, settings):

params = dict(

host=settings['MYSQL_HOST'],

db=settings['MYSQL_DBNAME'],

user=settings['MYSQL_USER'],

passwd=settings['MYSQL_PASSWORD'],

charset='utf8',

cursorclass=pymysql.cursors.DictCursor,

use_unicode=True,

)

dbpool = adbapi.ConnectionPool("pymysql", **params)

return cls(dbpool)

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import logging

import datetime

import re

import scrapy

from scrapy.loader.processors import TakeFirst, Join, Compose, MapCompose

from scrapy.loader import ItemLoader

# 提取关注数量,回答数量,评论数量文本中的数字

def extract_num(value):

# 输出日志信息

logging.warning('this is function extract_num value:{}'.format(value))

for val in value:

if val is not None:

# 去掉数字中的,

val = ''.join(val.split(','))

match_obj = re.match(".*?(\d+)", val)

if match_obj:

logging.warning('this is one of value:{}'.format(match_obj.group(1)))

return int(match_obj.group(1))

break

# 重写ItemLoader,指定默认输出处理器

class ZhihuItem(ItemLoader):

# 取列表第一个元素

default_output_processor = TakeFirst()

class ZhihuQuestionItem(scrapy.Item):

topics = scrapy.Field(

# 将主题连接

output_processor=Join(',')

)

url = scrapy.Field()

title = scrapy.Field()

content = scrapy.Field()

answer_num = scrapy.Field(

# 提取数字

output_processor=Compose(extract_num)

)

comments_num = scrapy.Field(

output_processor=Compose(extract_num)

)

# 关注者数量

watch_user_num = scrapy.Field(

output_processor=Compose(extract_num)

)

zhihu_id = scrapy.Field()

crawl_time = scrapy.Field()

def get_insert_sql(self):

# on duplicate key update col_name=value(col_name)

insert_sql = """

insert into zhihu_question(zhihu_id, topics, url, title, content, answer_num, comments_num,

watch_user_num, crawl_time

)

VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)

on duplicate key update content=values(content), answer_num=values(answer_num), comments_num=values(

comments_num), watch_user_num=values(watch_user_num)

"""

# [Failure instance: Traceback: : Use item['crawl_time'] = '2018-10-29 19:16:24' to set field value

# self.crawl_time = datetime.datetime.now()

# 用get处理相应的键为空的情况

# 用datetime.datetime.now()返回的值可以插入数据库

params = (self.get('zhihu_id'), self.get('topics','null'), self.get('url'), self.get('title'), self.get('content','null'), self.get('answer_num',0), self.get('comments_num',0),

self.get('watch_user_num',0), self.get('crawl_time'))

return insert_sql, params

class ZhihuAnswerItem(scrapy.Item):

zhihu_id = scrapy.Field()

url = scrapy.Field()

question_id = scrapy.Field()

author_id = scrapy.Field()

content = scrapy.Field()

# 赞

parise_num = scrapy.Field()

comments_num = scrapy.Field()

# 创建时间

create_time = scrapy.Field()

update_time = scrapy.Field()

crawl_time = scrapy.Field()

def get_insert_sql(self):

insert_sql = """

insert into zhihu_answer(zhihu_id, url, question_id, author_id, content, parise_num, comments_num,

create_time, update_time, crawl_time

)

VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)

on duplicate key update content=values(content), comments_num=values(comments_num), parise_num=values(

parise_num), update_time=values(update_time)

"""

# fromtimestamp方法的将时间戳转为时间元组

params = (

self.get("zhihu_id"), self.get("url"), self.get("question_id"), self.get("author_id"), self.get("content"), self.get("parise_num", 0),

self.get("commennts_num", 0), datetime.datetime.fromtimestamp(self.get("create_time")), datetime.datetime.fromtimestamp(self.get("update_time")), self.get("crawl_time"),

)

return insert_sql, params

总结

有些问题会反复的遇到

程序一步一步写,记得加注释

链接先记下

知乎不用selenium的都失效了,看到不用selenium可以登录的请告知

你可能感兴趣的:(mysql自定义函数,知乎)