python+scrapy 爬取成都链家二手房和成交信息

爬虫设计方案

爬取目标

  1. 成都链家的二手房和成交数据。
  2. 由于web版看不到最新的成交金额数据,因此需要用手机版的数据。
  3. 成交数据应该去重,可以做成每天增量爬取。
  4. 需要做成每天爬取一次,定时执行

参考文章

技术方案

  1. 使用Scrapy框架,实现spider部分和pipeline部分。
  2. 拦截手机版网络请求,获取Cookie信息,模拟手机请求拿到未隐藏的成交数据。
GET /cd/chengjiao/ HTTP/1.1
Host: m.lianjia.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Connection: keep-alive
Cookie: _staticData=%7B%0A%20%20%22deviceId%22%20%3A%20%22156E5777-0F30-402D-88B9-D62EB3B9233D%22%2C%0A%20%20%22appVersion%22%20%3A%20%228.3.2%22%2C%0A%20%20%22scheme%22%20%3A%20%22lianjia%22%2C%0A%20%20%22appName%22%20%3A%20%22%E9%93%BE%E5%AE%B6%22%2C%0A%20%20%22extraData%22%20%3A%20%7B%0A%20%20%20%20%22cityId%22%20%3A%20%22510100%22%2C%0A%20%20%20%20%22cityName%22%20%3A%20%22%E6%88%90%E9%83%BD%22%0A%20%20%7D%2C%0A%20%20%22sysModel%22%20%3A%20%22iPhone%22%2C%0A%20%20%22deviceInfo%22%20%3A%20%7B%0A%20%20%20%20%22uuid%22%20%3A%20%2235F36686-461C-4BD1-A904-DA63EC64E6EC%22%2C%0A%20%20%20%20%22udid%22%20%3A%20%22156E5777-0F30-402D-88B9-D62EB3B9233D%22%2C%0A%20%20%20%20%22ssid%22%20%3A%20%223C2F87B9-CC71-4769-A766-7EDC2582802D%22%0A%20%20%7D%2C%0A%20%20%22sysVersion%22%20%3A%20%229.2%22%2C%0A%20%20%22network%22%20%3A%20%22WIFI%22%2C%0A%20%20%22userInfo%22%20%3A%20%7B%0A%0A%20%20%7D%0A%7D; lianjia_ssid=3C2F87B9-CC71-4769-A766-7EDC2582802D; lianjia_token=; lianjia_udid=156E5777-0F30-402D-88B9-D62EB3B9233D; lianjia_uuid=35F36686-461C-4BD1-A904-DA63EC64E6EC; CNZZDATA1253491255=481059203-1514046308-%7C1514046308; CNZZDATA1254525948=972699380-1514042713-%7C1514042713; lj-ss=9fc6cee08e4d99ced4584517044e1242; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1514046865; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1514046657,1514046865; UM_distinctid=160843690e4f6-091b6dfa-d313861-2c600-160843690e56d; _ga=GA1.2.12607153.1514046656; _gat=1; _gat_global=1; _gat_new=1; _gat_new_global=1; _gat_past=1; _gid=GA1.2.1980141963.1514046656; gr_session_id_a1a50f141657a94e=00c98fea-0a57-4701-bfc6-4593a17a8509; gr_user_id=8e2fc122-9c57-4ee4-9ad5-a972aed85d48; lianjia_ssid=3C2F87B9-CC71-4769-A766-7EDC2582802D; lianjia_token=; lianjia_udid=156E5777-0F30-402D-88B9-D62EB3B9233D; lianjia_uuid=35F36686-461C-4BD1-A904-DA63EC64E6EC; select_city=510100; select_nation=1
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 9_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13C75 GrowingIO/lianjia_0.1-20170215211354 shlianjia/shlianjia Lianjia/8.3.2.5
Accept-Language: zh-cn
Referer: https://m.lianjia.com/cd/fangjia
Accept-Encoding: gzip, deflate
  1. 使用lxml结合xpath解析网页内容。
# 二手房详情
    def detail_parse(self,response):
        #'http://cd.lianjia.com/ershoufang/dongcheng/pg2/'
        try:
            content = response.body
            #self.logger.info("contents: "+ content)
            contents = etree.HTML(content)

            houselist = contents.xpath('/html/body/div[4]/div[1]/ul/li')
            self.logger.info(houselist)
            for house in houselist:
                try:
                    item = CdlianjiaspiderItem()
                    item['title'] = house.xpath('div[1]/div[1]/a/text()').pop()
                    item['community'] = house.xpath('div[1]/div[2]/div/a/text()').pop()
                    item['model'] = house.xpath('div[1]/div[2]/div/text()').pop().split('|')[1]
                    item['area'] = house.xpath('div[1]/div[2]/div/text()').pop().split('|')[2]
                    item['focus_num'] = house.xpath('div[1]/div[4]/text()').pop().split('/')[0]
                    item['watch_num'] = house.xpath('div[1]/div[4]/text()').pop().split('/')[1]
                    item['time'] = house.xpath('div[1]/div[4]/text()').pop().split('/')[2]
                    item['price'] = house.xpath('div[1]/div[6]/div[1]/span/text()').pop()
                    item['average_price'] = house.xpath('div[1]/div[6]/div[2]/span/text()').pop()
                    item['link'] = house.xpath('div[1]/div[1]/a/@href').pop()
                    item['city'] = response.meta["id1"]
                    self.url_detail = house.xpath('div[1]/div[1]/a/@href').pop()
                    #item['Latitude'] = self.get_latitude(self.url_detail)
                    self.logger.info("CdlianjiaspiderItem:" + item)
                except Exception:
                    pass
                yield item
        except Exception,e:
            self.logger.info(e)
            pass
  1. 通过pipeline每天写入爬取结果json文件中,分别保存ershoufang_yyyymmdd.json和chengjiao_yyyymmdd.json,注意中文的处理。
lass CdlianjiaspiderPipeline(object):

    def __init__(self):
        fileName =  "ershoufang_" + time.strftime("%Y-%m-%d") + ".json"
        chengjiaoFileName = "chengjiao_" + time.strftime("%Y-%m-%d") + ".json"
        self.file = codecs.open(fileName,'ab', encoding='utf-8')
        self.chengjiaoFile = codecs.open(chengjiaoFileName,'ab', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item),encoding='utf-8') + '\n'
        line = line.decode('unicode_escape')
        if isinstance(item, CdlianjiaspiderItem):
            self.file.write(line)
        elif isinstance(item, ChengjiaoItem):
            self.chengjiaoFile.write(line)
        return item
  1. 爬虫结果目前还未考虑如何去重,有增量爬取和Mongodb或redis去重两种方案,准备下期实现。
  2. 使用crond做成每日定时爬取。
crontab -e

#编辑内容,每天晚上8点运行
0 20 * * * nohup python2.7 /home/nodejs/python/CDLianjiaSpider/run.py &

上面的写法会报错找不到python2.7,还是得把命令写成sh脚本调用,记得要赋执行权限。
cron中命令都需要用绝对路径。
必须在spider工程的根目录下执行scrapy crawl,否则会报crawl命令不存在,因此要先cd到spider目录。

$ vim run.sh

cd /home/nodejs/python/CDLianjiaSpider
nohup /usr/local/bin/python2.7 /home/nodejs/python/CDLianjiaSpider/run.py &

$ crontab -e

#编辑内容,每天晚上8点运行
0 20 * * * /home/nodejs/python/CDLianjiaSpider/run.sh

mac安装python2.7+scrapy总结

背景

macos其实自带python2.7的,但是不带pip, iphython等必备工具,正常情况下是可以直接安装的,但是升级10.12以后,由于SIP(System Integrity Protection)保护,无权限直接安装组件了。 因为python是默认装在系统目录下的。所以需要一些手段来解决。

解决方案

有两种解决方案,
1. 重装python,让它装到/user/local/bin目录下,就有权限操作了,避免后续所有权限问题
2. 关闭SIP, 强烈不建议这样做,给系统带来很大风险。

重装python到/user/local/bin

  1. 安装python
homebrew install python

This formula installs a python2 executable to /usr/local/bin.
If you wish to have this formula's python executable in your PATH then add
the following to ~/.zshrc:
  export PATH="/usr/local/opt/python/libexec/bin:$PATH"

Pip and setuptools have been installed. To update them
  pip2 install --upgrade pip setuptools

You can install Python packages with
  pip2 install <package>

They will install into the site-package directory
  /usr/local/lib/python2.7/site-packages
  1. 修改配置,根据上面的提示,需要在环境变量中修改使用的默认python为/usr/local/bin目录下的,否则还会继续用以前的python,
  2. 以后pip安装组件,要用pip2,才能安装到/usr/local/lib/python2.7 的目录下,而不是pip,
  3. 安装scrapy
pip2 install scrapy

centOS6安装python2.7+scrapy总结

背景

因为我的爬虫是在本机开发,然后部署到阿里云运行的,阿里云安装的centos6.x

参考文章

  1. CentOs 6安装python2.7.13及异常解决
  2. CENTOS 6.5 安装 Python 2.7 总结 ,这篇文章中的setuptools+pip安装,地址已经找不到了,所以用的上一篇文章的安装方法。
  3. 安装python爬虫scrapy踩过的那些坑和编程外的思考 如果安装scrapy遇到报错可以根据这篇文章,我没遇到这里面的所有问题,但是有一两个遇到了。

安装步骤

1.安装python依赖包, 我自己安装没执行这一步

yum groupinstall "Development tools"
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel

2.下载Python2.7.13的源码包并编译

wget https://www.python.org/ftp/python/2.7.13/Python-2.7.13.tgz
tar xf Python-2.7.13.tgz
cd Python-2.7.13
./configure --prefix=/usr/local
make
make install

安装成功之后,你可以在 /usr/local/bin/python2.7 找到 Python 2.7

3.安装sqllite3, 没这个运行python会报错

sudo yum install sqlite-devel

4.安装 setuptools + pip

# 安装Pip
wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gz
tar -vxf setuptools-1.4.2.tar.gz 
cd setuptools-1.4.2
python2.7 setup.py install
easy_install-2.7 pip

5.安装scrapy

pip2 install scrapy

爬虫代码下载

github源码

你可能感兴趣的:(大数据入门)