2019-01-14

Scrapy爬虫之一:房产网站挂牌信息

笔者有朋友计划把自己的一套房屋在中介门店挂牌出售。为了能报一个最贴近市场行情的价格,需要1.掌握最新的本区域二手房价格信息;2. 利用机器学习的方法,在训练获取的房产价格信息后得到一个房屋价格预测模型;3.把自有房屋的信息经过处理,输入机器学习模型,获得预测结果。


有多个爬虫工具,比如Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。

但是,如果需要一个稳定、生命力强大的网络爬虫框架,非SCRAPY莫属。

引擎(Scrapy): 用来处理整个系统的数据流处理, 触发事务(框架核心)

调度器(Scheduler):用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址

下载器(Downloader):用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)

爬虫(Spiders): 爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面

项目管道(Pipeline):负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。

下载器中间件(Downloader Middlewares): 位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。

爬虫中间件(SpiderMiddlewares): 介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。

调度中间件(SchedulerMiddewares): 介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

关于Scrapy 的安装,需要注意几点:

最好是先安装虚拟环境。在windows系统,可以运行python的env

2.      [endif]安装Twisted,


3.Python 下激活虚拟环境env >pip

install scrapy


4.创建一个Scrapy 项目。



笔者浏览了几个房产网站,决定从其中一个知名的中介信息发布平台入手,抓取朋友房产所在的区域板块的二手房挂牌信息。

通过打开该网站的开发者信息,可以获取与对应信息有关的HTML内容。主要需要获取如下几个关键信息:

小区名,房型,楼层,建造年份,总价,单价。

为了调试方便,先使用 scrapy 的shell功能,可以在CMD终端互动。


用shell调试顺利,但是用脚本运行爬取时失败了。这是由于该网站建立了反爬虫措施。Scrapy默认的请求头被网站识别,因为屏蔽了爬虫request.

这时候,需要使用随机user-agent,和代理IP。代码如下:

Spiders.py

import scrapy

import urllib





class fjSpider(scrapy.Spider):

   name = "fj"

   start_urls = [


       'https://shanghai.房产网站.com/sale/biyun/p1/',

    ]

   def parse(self, response):

       fangwu =response.css('div.house-details')


       #for i in range(0,len(fangwu):


           # yield {

           #         fxs =fangwu[i].css('p.tel_shop ::text')[0].extract()

           #         mjs =fangwu[i].css('p.tel_shop ::text')[2].extract()

           #         lcs =fangwu[i].css('p.tel_shop ::text')[4].extract()

           #         cxs =fangwu[i].css('p.tel_shop ::text')[6].extract()

           #         nfs =fangwu[i].css('p.tel_shop ::text')[8].extract()

           # }


       djs = response.css('span.unit-price ::text').extract()

       zjs = response.css('span.price-det strong::text').extract()




      # xqs = response.css('p.add_shop a::text').extract()


       #x = len(djs)


       for i in range(0,len(fangwu)):

           yield {

                    #'xq': fangwu[i].css(,

                    'xq':fangwu[i].css('div.details-item span::text')[5].extract(),

                    'fx':fangwu[i].css('div.details-item span::text')[0].extract(),

                    'mj':fangwu[i].css('div.details-item span::text')[1].extract(),

                    'lc':fangwu[i].css('div.details-item span::text')[2].extract(),

                    #'cx':fangwu[i].css('p.tel_shop ::text')[6].extract(),

                    'nf':fangwu[i].css('div.details-item span::text')[3].extract(),

                    #'mj': mjs[i],

                    #'lc':lcs[i],

                    #'cx':cxs[i],

                    #

                    #'nf':nfs[i],

                    'zj': zjs[i],

                    'dj': djs[i],


                }

       # next_page = response.css('div.page_al p a::attr(href)')[2].extract()

       # if next_page is not None:

       #     yieldresponse.follow(next_page, callback=self.parse)

       # next_page = response.css('div.page_al p a::attr(href)')[2].extract()

       # url = 'https://sh.esf.fang.com/'

       # if next_page is not None:

       #     next_page =response.urljoin(next_page)

       #     yieldscrapy.Request(next_page, callback=self.parse)

       for i in range(2,42):

           url = 'https://shanghai.anjuke.com/sale/biyun/p' + str(i)


           yield response.follow(url, callback=self.parse)#将下一个网页传到PARSE处理,得到数据。

       # for url in hxs.select('//a/@href').extract():

       #     21

       #     yield Request(url,callback=self.parse)

Ajk middleware.py

# -*- coding: utf-8 -*-


# Define here the models for your spider

middleware

#

# See documentation in:

#

https://doc.scrapy.org/en/latest/topics/spider-middleware.html

import random

from scrapy import signals

#import requests

#import bs64

#from settings import PROXIES

#from useragent.user_agents import user_agents

class RandomUserAgent(object):


   """Randomly rotate user agents based on a list ofpredefined ones"""


   def __init__(self, agents):


       self.agents = agents


   @classmethod


   def from_crawler(cls, crawler):


       return cls(crawler.settings.getlist('USER_AGENTS'))


   def process_request(self, request, spider):


       #print "**************************" +random.choice(self.agents)


       request.headers.setdefault('User-Agent', random.choice('USER_AGENTS'))


class ProxyMiddleware(object):


   def process_request(self, request, spider):


       proxy = random.choice(PROXIES)


       if proxy['user_pass'] is not None:


           request.meta['proxy'] = "http://%s" % proxy['ip_port']


           encoded_user_pass = base64.encodestring(proxy['user_pass'])


           request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass


           print( "**************ProxyMiddleware have pass************" +proxy['ip_port'])


       else:


           print( "**************ProxyMiddleware no pass************" +proxy['ip_port'])


           request.meta['proxy'] = "http://%s" % proxy['ip_port']

class AjkSpiderMiddleware(object):

    #Not all methods need to be defined. If a method is not defined,

    #scrapy acts as if the spider middleware does not modify the

    #passed objects.


   @classmethod

   def from_crawler(cls, crawler):

       # This method is used by Scrapy to create your spiders.

       s = cls()

       crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

       return s


   def process_spider_input(self, response, spider):

       # Called for each response that goes through the spider

       # middleware and into the spider.


       # Should return None or raise an exception.

       return None


   def process_spider_output(self, response, result, spider):

       # Called with the results returned from the Spider, after

       # it has processed the response.


       # Must return an iterable of Request, dict or Item objects.

       for i in result:

           yield i


   def process_spider_exception(self, response, exception, spider):

       # Called when a spider or process_spider_input() method

       # (from other spider middleware) raises an exception.


       # Should return either None or an iterable of Response, dict

       # or Item objects.

       pass


   def process_start_requests(self, start_requests, spider):

       # Called with the start requests of the spider, and works

       # similarly to the process_spider_output() method, except

       # that it doesn’t have a response associated.


       # Must return only requests (not items).

       for r in start_requests:

           yield r


   def spider_opened(self, spider):

       spider.logger.info('Spider opened: %s' % spider.name)

USER_AGENTS = [


   "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",


   "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser;SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",


   "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35;Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",


   "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",


   "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64;Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; MediaCenter PC 6.0)",


   "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0;WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",


   "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",


   "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15(KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",


   "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, likeGecko, Safari/419.3) Arora/0.6",


   "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre)Gecko/20070215 K-Ninja/2.1.1",


   "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9)Gecko/20080705 Firefox/3.0 Kapiko/3.0",


   "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322Kazehakase/0.4.5",


   "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) GeckoFedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",


   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML,like Gecko) Chrome/17.0.963.56 Safari/535.11",


   "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20(KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",


   "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr)Presto/2.9.168 Version/11.52",


]


PROXIES = [


   {'ip_port': '221.1.200.242:38652', 'user_pass': ''},


   {'ip_port': '113.117.193.203:9999', 'user_pass': ''},


   {'ip_port': '112.87.68.184:9999', 'user_pass': ''},


   {'ip_port': '36.7.128.146:5222', 'user_pass': ''},


   {'ip_port': '116.113.27.170:47849', 'user_pass': ''},


   #{'ip_port': '122.224.249.122:8088', 'user_pass': ''},


]


class AjkDownloaderMiddleware(object):

    #Not all methods need to be defined. If a method is not defined,

    #scrapy acts as if the downloader middleware does not modify the

    #passed objects.


   @classmethod

   def from_crawler(cls, crawler):

       # This method is used by Scrapy to create your spiders.

       s = cls()

       crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

       return s


   def process_request(self, request, spider):

       # Called for each request that goes through the downloader

       # middleware.


       # Must either:

       # - return None: continue processing this request

       # - or return a Response object

       # - or return a Request object

       # - or raise IgnoreRequest: process_exception() methods of

       #   installed downloadermiddleware will be called

       return None


   def process_response(self, request, response, spider):

       # Called with the response returned from the downloader.


       # Must either;

       # - return a Response object

       # - return a Request object

       # - or raise IgnoreRequest

       return response


    def process_exception(self, request,exception, spider):

       # Called when a download handler or a process_request()

       # (from other downloader middleware) raises an exception.


       # Must either:

       # - return None: continue processing this exception

       # - return a Response object: stops process_exception() chain

       # - return a Request object: stops process_exception() chain

       pass


   def spider_opened(self, spider):

       spider.logger.info('Spideropened: %s' % spider.name)

ajk settings.py

# -*- coding: utf-8 -*-


# Scrapy settings for ajk project

#

# For simplicity, this file contains only

settings considered important or

# commonly used. You can find more settings

consulting the documentation:

#

#    https://doc.scrapy.org/en/latest/topics/settings.html

#    https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#    https://doc.scrapy.org/en/latest/topics/spider-middleware.html


BOT_NAME = ' 爬虫名'


SPIDER_MODULES = ['爬虫项目名.spiders']

NEWSPIDER_MODULE = '爬虫项目名.spiders'



# Crawl responsibly by identifying yourself

(and your website) on the user-agent

USER_AGENT = 'Mozilla/4.0 (compatible; MSIE

6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'


# Obey robots.txt rules

#ROBOTSTXT_OBEY = True


# Configure maximum concurrent requests

performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32


# Configure a delay for requests for the

same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

DOWNLOAD_DELAY = 3

# The download delay setting will honor

only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16


# Disable cookies (enabled by default)

COOKIES_ENABLED = False


# Disable Telnet Console (enabled by

default)

#TELNETCONSOLE_ENABLED = False


# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#  'Accept-Language': 'en',

#}


# Enable or disable spider middlewares

# See

https://doc.scrapy.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

   'ajk.middlewares.SpiderMiddleware': 543,

   'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,

}


# Enable or disable downloader middlewares

# See

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

   '   ***.middlewares.RandomUserAgent': 543,

}


# Enable or disable extensions

# See

https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#   'scrapy.extensions.telnet.TelnetConsole': None,

#}


# Configure item pipelines

# See

https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#ITEM_PIPELINES = {

#   '***.pipelines.Pipeline': 300,

#}


# Enable and configure the AutoThrottle

extension (disabled by default)

# See

https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in

case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy

should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every

response received:

#AUTOTHROTTLE_DEBUG = False


# Enable and configure HTTP caching

(disabled by default)

# See

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE =

'scrapy.extensions.httpcache.FilesystemCacheStorage'


在CMD输入命令:

Scrapy crawl -o 输出文件名.csv

运行成功。先从最后一页P41开始爬取信息。并输出保存为CSV文件。


爬虫运行结束后,打开CSV文件。发现共获得了2453条改区域所有二手房挂牌的信息!接下来,就需要对数据进行处理了。请看下一片文章:SCRAPY爬虫之二:数据处理与Altair可视化

你可能感兴趣的:(2019-01-14)