本文建立在学习完大壮老师视频Python最火爬虫框架Scrapy入门与实践,自己一步一步操作后做一个记录(建议跟我一样的新手都一步一步进行操作).
安装命令:pip install Scrapy
操作 1 : 通过Pycharm创建一个项目scrapy_douban(项目名)
操作 2 : 进入你的项目路径,并初始化一个项目douban(文件名):
scrapy startproject douban
操作 3 : 修改settings.py设置文件:
ROBOTSTXT_OBEY = False #不遵守爬虫协议,拒绝遵守 Robot协议
# 下载延时
DOWNLOAD_DELAY = 0.5
操作 4 : 生成初始化文件:douban_spider.py
scrapy genspider douban_spider http://movie.douban.com/
scrapy genspider spider文件名字 网址(到spider文件目录下)
# -*- coding: utf-8 -*-
import scrapy
class DoubanSpiderSpider(scrapy.Spider):
name = 'douban_spider'
allowed_domains = ['movie.douban.com']
start_urls = ['http://movie.douban.com/']
def parse(self, response):
pass
-
name
:用于区分不同的Spider,名字要唯一!!! -
parse(response)
:Spider的一个回调函数,当Downloader返回Response时会被调用,
每个初始URL完成下载后生成的response对象将会作为唯一的参数传递
给该函数。该函数负责解析返回的数据(response),提取数据(生成item)
以及生成需要进一步处理的URL的Request对象。 -
start_urls
可以说是start_requests 方法的简写
我们仅仅定义 start_urls 类属性,它是 URL 的列表组合,用于替代 start_requests() 中通过生成 scrapy.Reuqest 对象的做法。定义了 start_urls 后,会使用默认的 start_requests() 来创建 Spider 中的初始化请求。
headers={}
def start_requests(self):
urls = [
'url1',
'url2',
]
for url in urls:
yield scrapy.Request(url=url,headers=self.headers,callback=self.parse)
#等同于
start_urls = [
'url1',
'url2',
]
操作 5 : 根据需要抓取的对象编辑数据模型文件 items.py ,创建对象(序号,名称,描述,评价等等).
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 序号
serial_number = scrapy.Field()
# 电影名称
movie_name = scrapy.Field()
# 介绍
introduce = scrapy.Field()
# 星级
star = scrapy.Field()
# 评价
evaluate = scrapy.Field()
# 描述
describle = scrapy.Field()
操作 6 : 编辑爬虫文件douban_spider.py :
# -*- coding: utf-8 -*-
import scrapy
class DoubanSpiderSpider(scrapy.Spider):
# 爬虫的名称
name = 'douban_spider'
# 爬虫允许抓取的域名
allowed_domains = ['movie.douban.com']
# 爬虫抓取数据地址,给调度器
start_urls = ['http://movie.douban.com/top250']
def parse(self, response):
# 打印返回结果
print(response.text)
操作 7 : 开启scrapy项目:
scrapy crawl douban_spider
- douban_spider是spider的name值
- 在spiders文件路径下执行命令
返回发现有报错
我们还需要回到项目settings.py 里 设置USER_AGENT,不然请求无法通过
设置什么内容?
操作 8 : 设置请求头信息 USER_AGENT
(1)打开Pycharm 的 settings.py
里 设置USER_AGENT:
(2)打开命令行界面, 在spiders文件路径下重新执行命令: scrapy crawl douban_spider ,如果返回日志里有一堆html信息,说明执行成功:
操作 9 : 上面我们是在终端执行的,为了方便,现在设置在Pycharm开发工具中执行.
首先我们需要创建一个启动文件,比如main.py:
创建完成后编写如下main.py:
from scrapy import cmdline
# 输出未过滤的页面信息
cmdline.execute('scrapy crawl douban_spider'.split())
右键运行,返回信息和终端一样.
操作 10 : 下面进入爬虫文件douban_spider.py 进行进一步设置:利用xpath获取页面的电影列表
# -*- coding: utf-8 -*-
import scrapy
class DoubanSpiderSpider(scrapy.Spider):
# 爬虫的名称
name = 'douban_spider'
# 爬虫允许抓取的域名
allowed_domains = ['movie.douban.com']
# 爬虫抓取数据地址,给调度器
start_urls = ['http://movie.douban.com/top250']
def parse(self, response):
movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
for i_item in movie_list:
print(i_item)
操作 11 : 完善douban_spider.py文件(解析详细属性):
接下来进一步细分,获取详细的信息:
继续修改 信息:
- 导入模型文件
from ..items import DoubanItem
意思是从目录文件douban下的items.py里,导入DoubanItem模型 - 修改遍历:
- DoubanItem() :模型初始化
- douban_item['字段名'] :设置模型变量items中DoubanItem对应的字段值,
- 利用了Scrapy Selectors介绍
-
response.xpath()
:返回该表达式对应的所有节点的selector list列表; -
response.css()
:返回该表达式对应的所有及诶点的selector list列表; -
.extract_first()
: 筛选结果的第一个值 -
.extract()
:筛选所有结果,返回的是列表 -
.re()
:根据传入的正则表达式对数据进行提取,返回unicode字符串list列表; -
.re_first()
:使用它只提取第一个匹配的字符串
-
css():传入CSS表达式,返回该表达式对应的所有及诶点的selector list列表;
extract():序列化该节点为unicode字符串并返回list;
re():根据传入的正则表达式对数据进行提取,返回unicode字符串list列表;
# -*- coding: utf-8 -*-
import scrapy
from ..items import DoubanItem
class DoubanSpiderSpider(scrapy.Spider):
# 爬虫的名称
name = 'douban_spider'
# 爬虫允许抓取的域名
allowed_domains = ['movie.douban.com']
# 爬虫抓取数据地址,给调度器
start_urls = ['http://movie.douban.com/top250']
def parse(self, response):
# 获取电影列表
movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
for i_item in movie_list:
# 初始化模型
douban_item = DoubanItem()
# 获取电影的信息,赋值给模型各字段,一一对应
douban_item['serial_number'] = i_item.xpath(".//div[@class='item']//em/text()").extract_first()
douban_item['movie_name'] = i_item.xpath(".//div[@class='item']//span[@class='title']/text()").extract_first()
descs = i_item.xpath(".//div[@class='info']//div[@class='bd']/p[1]/text()").extract()
for i_desc in descs:
i_desc_str = "".join(i_desc.split( ))
douban_item['introduce'] = i_desc_str
douban_item['star'] = i_item.xpath(".//div[@class='star']//*[@class='rating_num']/text()").extract_first()
douban_item['evaluate'] = i_item.xpath(".//div[@class='star']//span[4]/text()").extract_first()
douban_item['describle'] = i_item.xpath(".//p[@class='quote']/span/text()").extract_first()
print(douban_item)
操作 12 : yield命令和Scrapy框架
把最后一行代码 print(douban_item)
替换成 yield douban_item
意思是将返回结果压入 item Pipline进行处理:(如下图介绍scrapy原理)
操作 13 : 遍历 "下一页" , 获取所有数据
一直到上面为止,只抓取了当前页面,接下来需要处理下一页功能,并遍历所有链接.
如下图所示,我们需要遍历标签 下的
# -*- coding: utf-8 -*-
import scrapy
from ..items import DoubanItem
class DoubanSpiderSpider(scrapy.Spider):
# 爬虫的名称
name = 'douban_spider'
# 爬虫允许抓取的域名
allowed_domains = ['movie.douban.com']
# 爬虫抓取数据地址,给调度器
start_urls = ['http://movie.douban.com/top250']
def parse(self, response):
# 获取电影列表
movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
for i_item in movie_list:
# 初始化模型
douban_item = DoubanItem()
# 获取电影的信息,赋值给模型各字段,一一对应
douban_item['serial_number'] = i_item.xpath(".//div[@class='item']//em/text()").extract_first()
douban_item['movie_name'] = i_item.xpath(".//div[@class='item']//span[@class='title']/text()").extract_first()
descs = i_item.xpath(".//div[@class='info']//div[@class='bd']/p[1]/text()").extract()
for i_desc in descs:
i_desc_str = "".join(i_desc.split( ))
douban_item['introduce'] = i_desc_str
douban_item['star'] = i_item.xpath(".//div[@class='star']//*[@class='rating_num']/text()").extract_first()
douban_item['evaluate'] = i_item.xpath(".//div[@class='star']//span[4]/text()").extract_first()
douban_item['describle'] = i_item.xpath(".//p[@class='quote']/span/text()").extract_first()
yield douban_item
# 解析下一页
next_link = response.xpath("//span[@class='next']/link/@href").extract()
if next_link:
next_link = next_link[0]
# 下一页拼接成链接url
next_url="https://movie.douban.com/top250" + next_link
yield scrapy.Request(next_url,callback=self.parse)
1 每次for循环结束后(当前页数据爬取完毕了),需要获取下一页的页面链接:next_link
2 如果到最后一页时没有下一页,需要判断一下
3 callback=self.parse : 请求回调,一下该函数,再次爬取下一页数据
运行main.py结果:(可以看到我们最后一个序号是250的数据)
操作 14 : 保存数据到json文件 或者 csv文件
在douban路径执行:
scrapy crawl spider名字 -o 导出文件名.导出文件格式
scrapy crawl douban_spider -o movielist.json
scrapy crawl douban_spider -o movielist.csv
操作 15 : 存储到数据库mysql
- 设置settings.py文件
#(1)将settings.py被注释的下面代码开启:
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}
#(2)settings.py文件最后添加数据库信息:
# 定义MYSQL信息
mysql_host = 'localhost'
mysql_port = 3306
mysql_user_name = 'root'
mysql_user_password='111111'
mysql_database_name='douban'
- 修改你的pipelines.py文件如下:
# -*- coding: utf-8 -*-
import mysql.connector
from .settings import mysql_host,mysql_port,mysql_user_name,mysql_user_password,mysql_database_name
class DoubanPipeline:
# 连接数据库
def __init__(self):
self.conn = mysql.connector.connect(host=mysql_host, port=mysql_port, database=mysql_database_name, user=mysql_user_name, password=mysql_user_password, charset='utf8')
self.cs=self.conn.cursor()
# 数据表是否存在
def tableExists(self):
stmt = 'SHOW TABLES LIKE "{}"'.format("movies")
print(stmt)
self.cs.execute(stmt)
return self.cs.fetchone()
# 数据库操作
def process_item(self, item, spider):
try:
if self.tableExists():
print("不建数据表")
else:
print("创建数据表")
creat_sql = "CREATE TABLE movies (serial_number VARCHAR(10),movie_name VARCHAR(150),introduce text,star VARCHAR(10),evaluate VARCHAR(100),describle text)"
self.cs.execute(creat_sql)
print("创建成功")
# 插入数据
sql="insert into movies (serial_number,movie_name,introduce,star,evaluate,describle) values(%s,%s,%s,%s,%s,%s)"
print(sql)
# 提交sql语句
self.cs.execute(sql,(item['serial_number'],item['movie_name'],item['introduce'],item['star'],item['evaluate'],item['describle']))
self.conn.commit()
except Exception as error:
# 出现错误时打印错误日志
print(error)
return item
进入main.py运行.即可存储数据到数据库.
操作 16 : ip代理中间价编写(爬虫ip地址伪装)
修改中间价文件:middlewares.py文件:
(1)文件开头导入base64文件:
import base64
(2)文件结尾添加方法:
class my_proxy(object):
def process_request(self,request,spider):
# request.meta['Proxy'] 中Proxy大写开头的P,小写会报twisted.web.error.SchemeNotSupported: Unsupported scheme: b''的错误
request.meta['Proxy'] = 'http-cla.abuyun.com:9030'
proxy_name_pass = b'H622272STYB666BW:F78990HJSS7'
enconde_pass_name = base64.b64encode(proxy_name_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + enconde_pass_name.decode()
- 解释:根据阿布云注册购买http隧道列表信息
-
request.meta['Proxy']
: '服务器地址:端口号' -
proxy_name_pass
: b'证书号:密钥' ,b开头是字符串base64处理 -
base64.b64encode()
: 变量做base64处理 -
'Basic '
: basic后一定要有空格
修改settings.py文件:
(3)取消注释,并修改如下:
DOWNLOADER_MIDDLEWARES = {
'douban.middlewares.my_proxy': 543,
}
(4)进入main.py运行:
下面截图表示成功隐藏ip地址
操作 17 : 头信息User-Agent伪装
其实在上面的步骤里已经设置过一次User-Agent信息,不过信息是写死的,
接下里我们通过随机给出一个User-Agent信息的方式来实现简单伪装:
同样是修改中间价文件:middlewares.py文件:
(1)文件开头导入random文件(随机函数):
import random
(2)文件结尾添加方法:
添加新方法:
class my_useragent(object):
def process_request(self, request, spider):
UserAgentList = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
]
agent = random.choice(UserAgentList)
request.headers['User_Agent'] = agent
(3)修改settings.py文件:并修改如下:
增加一条设置: 'douban.middlewares.my_useragent': 544
DOWNLOADER_MIDDLEWARES = {
'douban.middlewares.my_proxy': 543,
'douban.middlewares.my_useragent': 544,
}
(4)进入main.py运行:
user agent设置成功