目录
一、创建一个scrapy项目
二、xpath解析数据
三、通过pipelines管道实现数据保存
四、中间件
1.创建一个文件夹:C06
在终端输入以下命令:
2.安装scrapy:pip install scrapy
3.来到文件夹下:cd C06
4.创建项目:scrapy startproject C06L02(项目名称)
5.切换到C06L02下:cd C06L02/C06L02
切换到spiders下:cd spiders
6.创建爬虫名称和输入爬取链接:scrapy genspider app https://product.cheshi.com/rank/2-0-0-0-1/
7.注意看爬虫文件(新生成的app.py)链接是否一致
8.运行爬虫文件:scrapy crawl app
9.若想要消除日志文件,在settings.py中添加命令:LOG_LEVEL="ERROR"
若想要绕过ROBOTS协议,在settings.py中添加命令:ROBOTSTXT_OBEY=False
10.简单的scrapy项目的app.py文件代码如下:
import scrapy
class AppSpider(scrapy.Spider):
name = "app"
allowed_domains = ["product.cheshi.com"]
started_urls = ["http://product.cheshi.com/rank/2-0-0-0-1/"]
def parse(self, response):
print(response.text)
11.user-agent配置:在settings.py文件中将user-agent注释内容展开,添加需要内容
在app.py文件中修改parse函数
import scrapy
class AppSpider(scrapy.Spider):
name = "app"
allowed_domains = ["product.cheshi.com"]
started_urls = ["http://product.cheshi.com/rank/2-0-0-0-1/"]
def parse(self, response):
cars = response.xpath('//ul[@class="condition_list_con"]/li')
for car in cars:
title = car.xpath('./div[@class="m_detail"]//a/text()').get()
price = car.xpath('./div[@class="m_detail"]//b/text()').get()
1.在items.py文件中定义数据模型
import scrapy
class C06L04Item(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
2.在app.py文件中添加如下代码
import scrapy
from ..items import C06L04Item
class AppSpider(scrapy.Spider):
name = "app"
allowed_domains = ["product.cheshi.com"]
started_urls = ["http://product.cheshi.com/rank/2-0-0-0-1/"]
def parse(self, response):
item = C06L04Item()
cars = response.xpath('//ul[@class="condition_list_con"]/li')
for car in cars:
item["title"] = car.xpath('./div[@class="m_detail"]//a/text()').get()
item["price"] = car.xpath('./div[@class="m_detail"]//b/text()').get()
yield item
3.在settings.py文件中展开被注释掉的ITEM_PIPELINES,无需修改
4.修改pipelines.py文件代码
from itemadapter import ItemAdapter
class C06L04Pipeline:
def process_item(self, item, spider):
# print(item["title"],item["price"])
return item
若想要保存成文件添加以下代码
from itemadapter import ItemAdapter
class C06L04Pipeline:
def __init__(self):
self.f = open("data.tet", "w")
def process_item(self, item, spider):
self.f.write(item["title"]+item["price"]+"\n")
return item
def __del__(self):
self.f.close()
存储为mongodb形式为如下代码
from itemadapter import ItemAdapter
import pymongo
class C06L04Pipeline:
def __init__(self):
self.client = pymongo.MongoClient("mongodb://localhost:27017")
self.db = self.client["cheshi"]
self.col = self.db["cars"]
def process_item(self, item, spider):
res = self.col.insert_one(dict(item))
print(res.inserted_id)
return item
def __del__(self):
print("end")
1.Middleware的应用:随机User-Agent、代理IP、使用Selenium、添加Cookie
2.动态User-Agent
打开settings.py文件中注释掉的DOWNLOADER_MIDDLEWARES
在middlewares.py文件中添加如下代码(只显示修改部分):
import random
def process_request(self, request, spider):
uas = [
"User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxx",
"User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxx",
"User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxx",
"User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxx",
]
request.headers["User-Agent"] = random.choice(uas)
2.代理IP
具体操作略去,例如:快代理-隧道代理-python-scrapy的文档中心有具体的书写方式