scrapy-redis 分布式爬虫爬取房天下网站所有国内城市的新房和二手房信息
爬取思路
1. 进入 https://www.fang.com/SoufunFamily.htm 页面,解析所有的省份和城市,获取到城市首页链接
2. 通过分析,每个城市的新房都是在首页链接上添加newhouse和house/s/字符串,二手房 都是在首页链接上添加esf字段
以上海为例:
首页:https://sh.fang.com/
新房:https://sh.newhouse.fang.com/house/s/
二手房:https://sh.esf.fang.com
所以就可以爬取每个城市的新房和二手房
1. 创建项目
scrapy startproject fang
cd fang
scrapy genspider fangtianxia "fang.com"
2. 编辑需要爬取的数据字段
import scrapy
class FangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
province = scrapy.Field()
city_name = scrapy.Field()
house_name = scrapy.Field()
size = scrapy.Field()
address = scrapy.Field()
tel = scrapy.Field()
price = scrapy.Field()
type = scrapy.Field()
3. 编辑爬虫解析数据和请求转发
# -*- coding: utf-8 -*-
import scrapy
from scrapylearn.fang.fang.items import FangItem
class FangtianxiaSpider(scrapy.Spider):
name = 'fangtianxia'
allowed_domains = ['fang.com']
start_urls = ['https://www.fang.com/SoufunFamily.htm']
def parse(self, response):
tr_id = None
province = None
trs = response.xpath("//div[@class='outCont']//tr")
# 获取每个省每个城市的新房和二手房链接
for tr in trs:
new_tr_id = tr.xpath("@id").get()
if tr_id != new_tr_id:
tr_id = new_tr_id
province = tr.xpath("./td[2]//text()").get()
citys = tr.xpath("./td[3]/a")
for city in citys:
city_name = city.xpath("text()").get()
city_url = city.xpath("@href").get()
city_newhouse_url = city_url.replace(".", ".newhouse.", 1) + "house/s/"
city_esf_url = list5 = city_url.replace(".", ".esf.", 1)
yield scrapy.Request(city_newhouse_url, callback=self.parse_newhouse,
meta={"info": (province, city_name)})
yield scrapy.Request(city_esf_url, callback=self.parse_esf, meta={"info": (province, city_name)})
def parse_newhouse(self, response):
province, city_name = response.meta["info"]
type = "新房"
houses = response.xpath("//div[@id='newhouse_loupai_list']/ul/li[@id]")
for house in houses:
house_name = house.xpath(".//div[@class='nlcd_name']/a/text()").get().strip()
size = house.xpath(".//div[@class='house_type clearfix']/a/text()").getall()
size = ",".join(size)
address = house.xpath(".//div[@class='address']/a/@title").get()
tel = house.xpath(".//div[@class='tel']/p//text()").getall()
tel = "".join(tel)
price = house.xpath(".//div[@class='nhouse_price']/*/text()").getall()
price = " ".join(price)
item = FangItem(province=province, city_name=city_name, house_name=house_name, size=size, address=address,
tel=tel, price=price, type=type)
yield item
# 继续抓取下一页
next_url = response.xpath("//a[@class='active']/following-sibling::a[1]/@href").get()
if next_url:
next_url = response.urljoin(next_url)
yield scrapy.Request(next_url, callback=self.parse_newhouse, meta={"info": (province, city_name)})
def parse_esf(self, response):
# 爬取二手房与 parse_newhouse 中爬取新房同理
pass
4. 将爬取的数据保存到json文件中
from scrapy.exporters import JsonLinesItemExporter
class FangPipeline:
# 当爬虫被打开的时候会调用
def open_spider(self, spider):
print("爬虫开始执行。。。")
fileName = "fang.json"
self.fp = open(fileName, "wb") # 必须以二进制的形式打开文件
self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding="utf-8")
# 当爬虫有item传过来的时候会调用
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
# 当爬虫关闭的时候会调用
def close_spider(self, spider):
print("爬虫执行结束")
5. 设置配置文件 settings.py
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'fang.pipelines.FangPipeline': 300,
}
6. 启动爬虫
scrapy crawl fangtianxia
拓展,将单机版的爬虫转成分布式爬虫
- 参考文档:https://www.jianshu.com/p/5cd97ca134ef
1. 安装scrapy-redis
## 安装scrapy-redis:
pip3 install scrapy-redis -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
2. 将爬虫的类 scrapy.Spider 换成 scrapy_redis.spiders.RedisSpider
3. 将 start_urls = ['https://www.fang.com/SoufunFamily.htm'] 删掉,添加一个 redis_key
# start_urls = ['https://www.fang.com/SoufunFamily.htm']
# 在redis数据库中添加时要添加成列表类型
# LPUSH sfw:start_url https://www.fang.com/SoufunFamily.htm
redis_key = "sfw:start_url"
4. 在配置文件中添加配置
# 1:设置去重组件,使用的是scrapy_redis的去重组件,而不是scrapy自己的去重组件了
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 2:设置调度器,使用scrapy——redis重写的调度器,
# 而不再使用scrapy内部的调度器了
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 3:可以实现断点爬取=jondir,(请求的记录不会丢失,会存储在redis数据库中,
# 不会清除 redis的队列,下次直接从redis的队列中爬取)
SCHEDULER_PERSIST = True
# 4:设置任务队列的模式(三选一):
# SpiderPriorityQueue数据scrapy-redis默认使用的队列模式(
# 有自己的优先级)默认第一种
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
# 使用了队列的形式,任务先进先出。
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
# 采用了栈的形式:任务先进后出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
#5: 实现这个管道可以将爬虫端获取的item数据,统一保存在redis数据库中
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400,
}
# 6:指定要存储的redis数据库的主机IP
REDIS_HOST = '127.0.0.1' # 远端的ip地址
# 指定redis数据库主机的端口
REDIS_PORT = 6379
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
5. 在redis中添加url数据
# 在redis数据库中添加时要添加成列表类型
LPUSH sfw:start_url https://www.fang.com/SoufunFamily.htm
6. 启动爬虫,就可以在redis中看到爬取的数据了