GET /cd/chengjiao/ HTTP/1.1
Host: m.lianjia.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Connection: keep-alive
Cookie: _staticData=%7B%0A%20%20%22deviceId%22%20%3A%20%22156E5777-0F30-402D-88B9-D62EB3B9233D%22%2C%0A%20%20%22appVersion%22%20%3A%20%228.3.2%22%2C%0A%20%20%22scheme%22%20%3A%20%22lianjia%22%2C%0A%20%20%22appName%22%20%3A%20%22%E9%93%BE%E5%AE%B6%22%2C%0A%20%20%22extraData%22%20%3A%20%7B%0A%20%20%20%20%22cityId%22%20%3A%20%22510100%22%2C%0A%20%20%20%20%22cityName%22%20%3A%20%22%E6%88%90%E9%83%BD%22%0A%20%20%7D%2C%0A%20%20%22sysModel%22%20%3A%20%22iPhone%22%2C%0A%20%20%22deviceInfo%22%20%3A%20%7B%0A%20%20%20%20%22uuid%22%20%3A%20%2235F36686-461C-4BD1-A904-DA63EC64E6EC%22%2C%0A%20%20%20%20%22udid%22%20%3A%20%22156E5777-0F30-402D-88B9-D62EB3B9233D%22%2C%0A%20%20%20%20%22ssid%22%20%3A%20%223C2F87B9-CC71-4769-A766-7EDC2582802D%22%0A%20%20%7D%2C%0A%20%20%22sysVersion%22%20%3A%20%229.2%22%2C%0A%20%20%22network%22%20%3A%20%22WIFI%22%2C%0A%20%20%22userInfo%22%20%3A%20%7B%0A%0A%20%20%7D%0A%7D; lianjia_ssid=3C2F87B9-CC71-4769-A766-7EDC2582802D; lianjia_token=; lianjia_udid=156E5777-0F30-402D-88B9-D62EB3B9233D; lianjia_uuid=35F36686-461C-4BD1-A904-DA63EC64E6EC; CNZZDATA1253491255=481059203-1514046308-%7C1514046308; CNZZDATA1254525948=972699380-1514042713-%7C1514042713; lj-ss=9fc6cee08e4d99ced4584517044e1242; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1514046865; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1514046657,1514046865; UM_distinctid=160843690e4f6-091b6dfa-d313861-2c600-160843690e56d; _ga=GA1.2.12607153.1514046656; _gat=1; _gat_global=1; _gat_new=1; _gat_new_global=1; _gat_past=1; _gid=GA1.2.1980141963.1514046656; gr_session_id_a1a50f141657a94e=00c98fea-0a57-4701-bfc6-4593a17a8509; gr_user_id=8e2fc122-9c57-4ee4-9ad5-a972aed85d48; lianjia_ssid=3C2F87B9-CC71-4769-A766-7EDC2582802D; lianjia_token=; lianjia_udid=156E5777-0F30-402D-88B9-D62EB3B9233D; lianjia_uuid=35F36686-461C-4BD1-A904-DA63EC64E6EC; select_city=510100; select_nation=1
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 9_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13C75 GrowingIO/lianjia_0.1-20170215211354 shlianjia/shlianjia Lianjia/8.3.2.5
Accept-Language: zh-cn
Referer: https://m.lianjia.com/cd/fangjia
Accept-Encoding: gzip, deflate
# 二手房详情
def detail_parse(self,response):
#'http://cd.lianjia.com/ershoufang/dongcheng/pg2/'
try:
content = response.body
#self.logger.info("contents: "+ content)
contents = etree.HTML(content)
houselist = contents.xpath('/html/body/div[4]/div[1]/ul/li')
self.logger.info(houselist)
for house in houselist:
try:
item = CdlianjiaspiderItem()
item['title'] = house.xpath('div[1]/div[1]/a/text()').pop()
item['community'] = house.xpath('div[1]/div[2]/div/a/text()').pop()
item['model'] = house.xpath('div[1]/div[2]/div/text()').pop().split('|')[1]
item['area'] = house.xpath('div[1]/div[2]/div/text()').pop().split('|')[2]
item['focus_num'] = house.xpath('div[1]/div[4]/text()').pop().split('/')[0]
item['watch_num'] = house.xpath('div[1]/div[4]/text()').pop().split('/')[1]
item['time'] = house.xpath('div[1]/div[4]/text()').pop().split('/')[2]
item['price'] = house.xpath('div[1]/div[6]/div[1]/span/text()').pop()
item['average_price'] = house.xpath('div[1]/div[6]/div[2]/span/text()').pop()
item['link'] = house.xpath('div[1]/div[1]/a/@href').pop()
item['city'] = response.meta["id1"]
self.url_detail = house.xpath('div[1]/div[1]/a/@href').pop()
#item['Latitude'] = self.get_latitude(self.url_detail)
self.logger.info("CdlianjiaspiderItem:" + item)
except Exception:
pass
yield item
except Exception,e:
self.logger.info(e)
pass
lass CdlianjiaspiderPipeline(object):
def __init__(self):
fileName = "ershoufang_" + time.strftime("%Y-%m-%d") + ".json"
chengjiaoFileName = "chengjiao_" + time.strftime("%Y-%m-%d") + ".json"
self.file = codecs.open(fileName,'ab', encoding='utf-8')
self.chengjiaoFile = codecs.open(chengjiaoFileName,'ab', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item),encoding='utf-8') + '\n'
line = line.decode('unicode_escape')
if isinstance(item, CdlianjiaspiderItem):
self.file.write(line)
elif isinstance(item, ChengjiaoItem):
self.chengjiaoFile.write(line)
return item
crontab -e
#编辑内容,每天晚上8点运行
0 20 * * * nohup python2.7 /home/nodejs/python/CDLianjiaSpider/run.py &
上面的写法会报错找不到python2.7,还是得把命令写成sh脚本调用,记得要赋执行权限。
cron中命令都需要用绝对路径。
必须在spider工程的根目录下执行scrapy crawl,否则会报crawl命令不存在,因此要先cd到spider目录。
$ vim run.sh
cd /home/nodejs/python/CDLianjiaSpider
nohup /usr/local/bin/python2.7 /home/nodejs/python/CDLianjiaSpider/run.py &
$ crontab -e
#编辑内容,每天晚上8点运行
0 20 * * * /home/nodejs/python/CDLianjiaSpider/run.sh
macos其实自带python2.7的,但是不带pip, iphython等必备工具,正常情况下是可以直接安装的,但是升级10.12以后,由于SIP(System Integrity Protection)保护,无权限直接安装组件了。 因为python是默认装在系统目录下的。所以需要一些手段来解决。
有两种解决方案,
1. 重装python,让它装到/user/local/bin目录下,就有权限操作了,避免后续所有权限问题
2. 关闭SIP, 强烈不建议这样做,给系统带来很大风险。
homebrew install python
This formula installs a python2 executable to /usr/local/bin.
If you wish to have this formula's python executable in your PATH then add
the following to ~/.zshrc:
export PATH="/usr/local/opt/python/libexec/bin:$PATH"
Pip and setuptools have been installed. To update them
pip2 install --upgrade pip setuptools
You can install Python packages with
pip2 install <package>
They will install into the site-package directory
/usr/local/lib/python2.7/site-packages
pip2 install scrapy
因为我的爬虫是在本机开发,然后部署到阿里云运行的,阿里云安装的centos6.x
yum groupinstall "Development tools"
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel
wget https://www.python.org/ftp/python/2.7.13/Python-2.7.13.tgz
tar xf Python-2.7.13.tgz
cd Python-2.7.13
./configure --prefix=/usr/local
make
make install
安装成功之后,你可以在 /usr/local/bin/python2.7 找到 Python 2.7
sudo yum install sqlite-devel
# 安装Pip
wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gz
tar -vxf setuptools-1.4.2.tar.gz
cd setuptools-1.4.2
python2.7 setup.py install
easy_install-2.7 pip
pip2 install scrapy
github源码