Scrapy爬虫之一:房产网站挂牌信息
笔者有朋友计划把自己的一套房屋在中介门店挂牌出售。为了能报一个最贴近市场行情的价格,需要1.掌握最新的本区域二手房价格信息;2. 利用机器学习的方法,在训练获取的房产价格信息后得到一个房屋价格预测模型;3.把自有房屋的信息经过处理,输入机器学习模型,获得预测结果。
有多个爬虫工具,比如Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。
但是,如果需要一个稳定、生命力强大的网络爬虫框架,非SCRAPY莫属。
引擎(Scrapy): 用来处理整个系统的数据流处理, 触发事务(框架核心)
调度器(Scheduler):用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
下载器(Downloader):用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
爬虫(Spiders): 爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
项目管道(Pipeline):负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
下载器中间件(Downloader Middlewares): 位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
爬虫中间件(SpiderMiddlewares): 介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
调度中间件(SchedulerMiddewares): 介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。
关于Scrapy 的安装,需要注意几点:
最好是先安装虚拟环境。在windows系统,可以运行python的env
2. [endif]安装Twisted,
3.Python 下激活虚拟环境env >pip
install scrapy
4.创建一个Scrapy 项目。
笔者浏览了几个房产网站,决定从其中一个知名的中介信息发布平台入手,抓取朋友房产所在的区域板块的二手房挂牌信息。
通过打开该网站的开发者信息,可以获取与对应信息有关的HTML内容。主要需要获取如下几个关键信息:
小区名,房型,楼层,建造年份,总价,单价。
为了调试方便,先使用 scrapy 的shell功能,可以在CMD终端互动。
用shell调试顺利,但是用脚本运行爬取时失败了。这是由于该网站建立了反爬虫措施。Scrapy默认的请求头被网站识别,因为屏蔽了爬虫request.
这时候,需要使用随机user-agent,和代理IP。代码如下:
Spiders.py
import scrapy
import urllib
class fjSpider(scrapy.Spider):
name = "fj"
start_urls = [
'https://shanghai.房产网站.com/sale/biyun/p1/',
]
def parse(self, response):
fangwu =response.css('div.house-details')
#for i in range(0,len(fangwu):
# yield {
# fxs =fangwu[i].css('p.tel_shop ::text')[0].extract()
# mjs =fangwu[i].css('p.tel_shop ::text')[2].extract()
# lcs =fangwu[i].css('p.tel_shop ::text')[4].extract()
# cxs =fangwu[i].css('p.tel_shop ::text')[6].extract()
# nfs =fangwu[i].css('p.tel_shop ::text')[8].extract()
# }
djs = response.css('span.unit-price ::text').extract()
zjs = response.css('span.price-det strong::text').extract()
# xqs = response.css('p.add_shop a::text').extract()
#x = len(djs)
for i in range(0,len(fangwu)):
yield {
#'xq': fangwu[i].css(,
'xq':fangwu[i].css('div.details-item span::text')[5].extract(),
'fx':fangwu[i].css('div.details-item span::text')[0].extract(),
'mj':fangwu[i].css('div.details-item span::text')[1].extract(),
'lc':fangwu[i].css('div.details-item span::text')[2].extract(),
#'cx':fangwu[i].css('p.tel_shop ::text')[6].extract(),
'nf':fangwu[i].css('div.details-item span::text')[3].extract(),
#'mj': mjs[i],
#'lc':lcs[i],
#'cx':cxs[i],
#
#'nf':nfs[i],
'zj': zjs[i],
'dj': djs[i],
}
# next_page = response.css('div.page_al p a::attr(href)')[2].extract()
# if next_page is not None:
# yieldresponse.follow(next_page, callback=self.parse)
# next_page = response.css('div.page_al p a::attr(href)')[2].extract()
# url = 'https://sh.esf.fang.com/'
# if next_page is not None:
# next_page =response.urljoin(next_page)
# yieldscrapy.Request(next_page, callback=self.parse)
for i in range(2,42):
url = 'https://shanghai.anjuke.com/sale/biyun/p' + str(i)
yield response.follow(url, callback=self.parse)#将下一个网页传到PARSE处理,得到数据。
# for url in hxs.select('//a/@href').extract():
# 21
# yield Request(url,callback=self.parse)
Ajk middleware.py
# -*- coding: utf-8 -*-
# Define here the models for your spider
middleware
#
# See documentation in:
#
https://doc.scrapy.org/en/latest/topics/spider-middleware.html
import random
from scrapy import signals
#import requests
#import bs64
#from settings import PROXIES
#from useragent.user_agents import user_agents
class RandomUserAgent(object):
"""Randomly rotate user agents based on a list ofpredefined ones"""
def __init__(self, agents):
self.agents = agents
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings.getlist('USER_AGENTS'))
def process_request(self, request, spider):
#print "**************************" +random.choice(self.agents)
request.headers.setdefault('User-Agent', random.choice('USER_AGENTS'))
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
encoded_user_pass = base64.encodestring(proxy['user_pass'])
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
print( "**************ProxyMiddleware have pass************" +proxy['ip_port'])
else:
print( "**************ProxyMiddleware no pass************" +proxy['ip_port'])
request.meta['proxy'] = "http://%s" % proxy['ip_port']
class AjkSpiderMiddleware(object):
#Not all methods need to be defined. If a method is not defined,
#scrapy acts as if the spider middleware does not modify the
#passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser;SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35;Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64;Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; MediaCenter PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0;WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15(KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, likeGecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre)Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9)Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) GeckoFedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML,like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20(KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr)Presto/2.9.168 Version/11.52",
]
PROXIES = [
{'ip_port': '221.1.200.242:38652', 'user_pass': ''},
{'ip_port': '113.117.193.203:9999', 'user_pass': ''},
{'ip_port': '112.87.68.184:9999', 'user_pass': ''},
{'ip_port': '36.7.128.146:5222', 'user_pass': ''},
{'ip_port': '116.113.27.170:47849', 'user_pass': ''},
#{'ip_port': '122.224.249.122:8088', 'user_pass': ''},
]
class AjkDownloaderMiddleware(object):
#Not all methods need to be defined. If a method is not defined,
#scrapy acts as if the downloader middleware does not modify the
#passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloadermiddleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request,exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spideropened: %s' % spider.name)
ajk settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for ajk project
#
# For simplicity, this file contains only
settings considered important or
# commonly used. You can find more settings
consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = ' 爬虫名'
SPIDER_MODULES = ['爬虫项目名.spiders']
NEWSPIDER_MODULE = '爬虫项目名.spiders'
# Crawl responsibly by identifying yourself
(and your website) on the user-agent
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE
6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'
# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests
performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the
same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor
only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Disable Telnet Console (enabled by
default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See
https://doc.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'ajk.middlewares.SpiderMiddleware': 543,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
}
# Enable or disable downloader middlewares
# See
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
' ***.middlewares.RandomUserAgent': 543,
}
# Enable or disable extensions
# See
https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See
https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# '***.pipelines.Pipeline': 300,
#}
# Enable and configure the AutoThrottle
extension (disabled by default)
# See
https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in
case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy
should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every
response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching
(disabled by default)
# See
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE =
'scrapy.extensions.httpcache.FilesystemCacheStorage'
在CMD输入命令:
Scrapy crawl
运行成功。先从最后一页P41开始爬取信息。并输出保存为CSV文件。
爬虫运行结束后,打开CSV文件。发现共获得了2453条改区域所有二手房挂牌的信息!接下来,就需要对数据进行处理了。请看下一片文章:SCRAPY爬虫之二:数据处理与Altair可视化