splash爬取京东

工具:Ubuntu、pycharm
打开京东,输入python:https://search.jd.com/Search?keyword=python&enc=utf-8&wq=python&pvid=24be3f6bbd364413aa0b8d9cdac5f468
splash爬取京东_第1张图片此为目标网址。

第一步,在终端打开docker(docker没装的自行百度):

	~$ sudo service docker start

第二步,连接splash容器:

	~$ sudo docker images
	~$ sudo docker run -p 8050:8050 scrapinghub/splash

第四步,新建一个py文件,运行如下代码:

	import re

	import requests
	import pymongo
	from pyquery import PyQuery as pq
	
	client=pymongo.MongoClient('localhost',port=27017)
	db=client['JD']
	
	def page_parse(html):
	    doc=pq(html,parser='html')
	    items=doc('#J_goodsList .gl-item').items()
	    for item in items:
	        if item('.p-img img').attr('src'):
	            image=item('.p-img img').attr('src')
	        else:
	            image=item('.p-img img').attr('data-lazy-img')
	        texts={
	            'image':'https:'+image,
	            'price':item('.p-price').text()[:6],
	            'title':re.sub('\n','',item('.p-name').text()),
	            'commit':item('.p-commit').text()[:-3],
	
	        }
	        yield texts
	
	def save_to_mongo(data):
	    if db['jd_collection'].insert(data):
	        print('保存到MongoDB成功',data)
	    else:
	        print('MongoDB存储错误',data)
	
	def main(number):
	    url=' http://0.0.0.0:8050/render.html?url=https://search.jd.com/Search?keyword=python&page={}&wait=1&images=0'.format(number)
	    response=requests.get(url)
	    data=page_parse(response.text)
	    for i in data:
	        save_to_mongo(i)
	        #print(i)
	
	if __name__ == '__main__':
	    for number in range(1,200,2):
	        print('开始抓取第{}页'.format(number))
	        main(number)

然后启动mongo,查看数据,发现数据已经存到mongo数据库中。
splash爬取京东_第2张图片

你可能感兴趣的:(爬虫)