2019-01-14

Scrapy爬虫之一：房产网站挂牌信息

笔者有朋友计划把自己的一套房屋在中介门店挂牌出售。为了能报一个最贴近市场行情的价格，需要1.掌握最新的本区域二手房价格信息；2. 利用机器学习的方法，在训练获取的房产价格信息后得到一个房屋价格预测模型；3.把自有房屋的信息经过处理，输入机器学习模型，获得预测结果。

有多个爬虫工具，比如Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

但是，如果需要一个稳定、生命力强大的网络爬虫框架，非SCRAPY莫属。

引擎(Scrapy): 用来处理整个系统的数据流处理, 触发事务(框架核心)

调度器(Scheduler):用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL（抓取网页的网址或者说是链接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址

下载器(Downloader):用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)

爬虫(Spiders): 爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面

项目管道(Pipeline):负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据。

下载器中间件(Downloader Middlewares): 位于Scrapy引擎和下载器之间的框架，主要是处理Scrapy引擎与下载器之间的请求及响应。

爬虫中间件(SpiderMiddlewares): 介于Scrapy引擎和爬虫之间的框架，主要工作是处理蜘蛛的响应输入和请求输出。

调度中间件(SchedulerMiddewares): 介于Scrapy引擎和调度之间的中间件，从Scrapy引擎发送到调度的请求和响应。

关于Scrapy 的安装，需要注意几点：

最好是先安装虚拟环境。在windows系统，可以运行python的env

2. [endif]安装Twisted,

3.Python 下激活虚拟环境env >pip

install scrapy

4.创建一个Scrapy 项目。

笔者浏览了几个房产网站，决定从其中一个知名的中介信息发布平台入手，抓取朋友房产所在的区域板块的二手房挂牌信息。

通过打开该网站的开发者信息，可以获取与对应信息有关的HTML内容。主要需要获取如下几个关键信息：

小区名，房型，楼层，建造年份，总价，单价。

为了调试方便，先使用 scrapy 的shell功能，可以在CMD终端互动。

用shell调试顺利，但是用脚本运行爬取时失败了。这是由于该网站建立了反爬虫措施。Scrapy默认的请求头被网站识别，因为屏蔽了爬虫request.

这时候，需要使用随机user-agent,和代理IP。代码如下：

Spiders.py

import scrapy

import urllib

class fjSpider(scrapy.Spider):

name = "fj"

start_urls = [

'https://shanghai.房产网站.com/sale/biyun/p1/',

]

def parse(self, response):

fangwu =response.css('div.house-details')

#for i in range(0,len(fangwu):

# yield {

# fxs =fangwu[i].css('p.tel_shop ::text')[0].extract()

# mjs =fangwu[i].css('p.tel_shop ::text')[2].extract()

# lcs =fangwu[i].css('p.tel_shop ::text')[4].extract()

# cxs =fangwu[i].css('p.tel_shop ::text')[6].extract()

# nfs =fangwu[i].css('p.tel_shop ::text')[8].extract()

# }

djs = response.css('span.unit-price ::text').extract()

zjs = response.css('span.price-det strong::text').extract()

# xqs = response.css('p.add_shop a::text').extract()

#x = len(djs)

for i in range(0,len(fangwu)):

yield {

#'xq': fangwu[i].css(,

'xq':fangwu[i].css('div.details-item span::text')[5].extract(),

'fx':fangwu[i].css('div.details-item span::text')[0].extract(),

'mj':fangwu[i].css('div.details-item span::text')[1].extract(),

'lc':fangwu[i].css('div.details-item span::text')[2].extract(),

#'cx':fangwu[i].css('p.tel_shop ::text')[6].extract(),

'nf':fangwu[i].css('div.details-item span::text')[3].extract(),

#'mj': mjs[i],

#'lc':lcs[i],

#'cx':cxs[i],

#'nf':nfs[i],

'zj': zjs[i],

'dj': djs[i],

}

# next_page = response.css('div.page_al p a::attr(href)')[2].extract()

# if next_page is not None:

# yieldresponse.follow(next_page, callback=self.parse)

# next_page = response.css('div.page_al p a::attr(href)')[2].extract()

# url = 'https://sh.esf.fang.com/'

# if next_page is not None:

# next_page =response.urljoin(next_page)

# yieldscrapy.Request(next_page, callback=self.parse)

for i in range(2,42):

url = 'https://shanghai.anjuke.com/sale/biyun/p' + str(i)

yield response.follow(url, callback=self.parse)#将下一个网页传到PARSE处理，得到数据。

# for url in hxs.select('//a/@href').extract():

# 21

# yield Request(url,callback=self.parse)

Ajk middleware.py

# -*- coding: utf-8 -*-

# Define here the models for your spider

middleware

# See documentation in:

https://doc.scrapy.org/en/latest/topics/spider-middleware.html

import random

from scrapy import signals

#import requests

#import bs64

#from settings import PROXIES

#from useragent.user_agents import user_agents

class RandomUserAgent(object):

"""Randomly rotate user agents based on a list ofpredefined ones"""

def __init__(self, agents):

self.agents = agents

@classmethod

def from_crawler(cls, crawler):

return cls(crawler.settings.getlist('USER_AGENTS'))

def process_request(self, request, spider):

#print "**************************" +random.choice(self.agents)

request.headers.setdefault('User-Agent', random.choice('USER_AGENTS'))

class ProxyMiddleware(object):

def process_request(self, request, spider):

proxy = random.choice(PROXIES)

if proxy['user_pass'] is not None:

request.meta['proxy'] = "http://%s" % proxy['ip_port']

encoded_user_pass = base64.encodestring(proxy['user_pass'])

request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

print( "**************ProxyMiddleware have pass************" +proxy['ip_port'])

else:

print( "**************ProxyMiddleware no pass************" +proxy['ip_port'])

request.meta['proxy'] = "http://%s" % proxy['ip_port']

class AjkSpiderMiddleware(object):

#Not all methods need to be defined. If a method is not defined,

#scrapy acts as if the spider middleware does not modify the

#passed objects.

@classmethod

def from_crawler(cls, crawler):

# This method is used by Scrapy to create your spiders.

s = cls()

crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

return s

def process_spider_input(self, response, spider):

# Called for each response that goes through the spider

# middleware and into the spider.

# Should return None or raise an exception.

return None

def process_spider_output(self, response, result, spider):

# Called with the results returned from the Spider, after

# it has processed the response.

# Must return an iterable of Request, dict or Item objects.

for i in result:

yield i

def process_spider_exception(self, response, exception, spider):

# Called when a spider or process_spider_input() method

# (from other spider middleware) raises an exception.

# Should return either None or an iterable of Response, dict

# or Item objects.

pass

def process_start_requests(self, start_requests, spider):

# Called with the start requests of the spider, and works

# similarly to the process_spider_output() method, except

# that it doesn’t have a response associated.

# Must return only requests (not items).

for r in start_requests:

yield r

def spider_opened(self, spider):

spider.logger.info('Spider opened: %s' % spider.name)

USER_AGENTS = [

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser;SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",

"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35;Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64;Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; MediaCenter PC 6.0)",

"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0;WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15(KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, likeGecko, Safari/419.3) Arora/0.6",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre)Gecko/20070215 K-Ninja/2.1.1",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9)Gecko/20080705 Firefox/3.0 Kapiko/3.0",

"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322Kazehakase/0.4.5",

"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) GeckoFedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML,like Gecko) Chrome/17.0.963.56 Safari/535.11",

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20(KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",

"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr)Presto/2.9.168 Version/11.52",

]

PROXIES = [

{'ip_port': '221.1.200.242:38652', 'user_pass': ''},

{'ip_port': '113.117.193.203:9999', 'user_pass': ''},

{'ip_port': '112.87.68.184:9999', 'user_pass': ''},

{'ip_port': '36.7.128.146:5222', 'user_pass': ''},

{'ip_port': '116.113.27.170:47849', 'user_pass': ''},

#{'ip_port': '122.224.249.122:8088', 'user_pass': ''},

]

class AjkDownloaderMiddleware(object):

#Not all methods need to be defined. If a method is not defined,

#scrapy acts as if the downloader middleware does not modify the

#passed objects.

@classmethod

def from_crawler(cls, crawler):

# This method is used by Scrapy to create your spiders.

s = cls()

crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

return s

def process_request(self, request, spider):

# Called for each request that goes through the downloader

# middleware.

# Must either:

# - return None: continue processing this request

# - or return a Response object

# - or return a Request object

# - or raise IgnoreRequest: process_exception() methods of

# installed downloadermiddleware will be called

return None

def process_response(self, request, response, spider):

# Called with the response returned from the downloader.

# Must either;

# - return a Response object

# - return a Request object

# - or raise IgnoreRequest

return response

def process_exception(self, request,exception, spider):

# Called when a download handler or a process_request()

# (from other downloader middleware) raises an exception.

# Must either:

# - return None: continue processing this exception

# - return a Response object: stops process_exception() chain

# - return a Request object: stops process_exception() chain

pass

def spider_opened(self, spider):

spider.logger.info('Spideropened: %s' % spider.name)

ajk settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for ajk project

# For simplicity, this file contains only

settings considered important or

# commonly used. You can find more settings

consulting the documentation:

# https://doc.scrapy.org/en/latest/topics/settings.html

# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = ' 爬虫名'

SPIDER_MODULES = ['爬虫项目名.spiders']

NEWSPIDER_MODULE = '爬虫项目名.spiders'

# Crawl responsibly by identifying yourself

(and your website) on the user-agent

USER_AGENT = 'Mozilla/4.0 (compatible; MSIE

6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'

# Obey robots.txt rules

#ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests

performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the

same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

DOWNLOAD_DELAY = 3

# The download delay setting will honor

only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

COOKIES_ENABLED = False

# Disable Telnet Console (enabled by

default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

# 'Accept-Language': 'en',

# Enable or disable spider middlewares

# See

https://doc.scrapy.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

'ajk.middlewares.SpiderMiddleware': 543,

'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,

}

# Enable or disable downloader middlewares

# See

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

' ***.middlewares.RandomUserAgent': 543,

}

# Enable or disable extensions

# See

https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,

# Configure item pipelines

# See

https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#ITEM_PIPELINES = {

# '***.pipelines.Pipeline': 300,

# Enable and configure the AutoThrottle

extension (disabled by default)

# See

https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in

case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy

should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every

response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching

(disabled by default)

# See

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE =

'scrapy.extensions.httpcache.FilesystemCacheStorage'

在CMD输入命令：

Scrapy crawl -o 输出文件名.csv

运行成功。先从最后一页P41开始爬取信息。并输出保存为CSV文件。

爬虫运行结束后，打开CSV文件。发现共获得了2453条改区域所有二手房挂牌的信息！接下来，就需要对数据进行处理了。请看下一片文章：SCRAPY爬虫之二：数据处理与Altair可视化

2019-01-14

你可能感兴趣的:(2019-01-14)