2019独角兽企业重金招聘Python工程师标准>>>
前提
1. 已安装scrapy框架
2. 已安装elasticsearch
创建一个项目scrapyes
scrapy startproject scrapyes
目录结构
.
|____scrapy.cfg
|____scrapyes
| |______init__.py
| |____items.py
| |____middlewares.py
| |____pipelines.py
| |____settings.py
| |____spiders
| | |______init__.py
安装ScrapyElasticSearch
pip install ScrapyElasticSearch
配置setting.py
...
ITEM_PIPELINES = {
'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 300,
}
ELASTICSEARCH_SERVERS = ['192.168.4.215']
ELASTICSEARCH_PORT = 9200 # If port 80 leave blank
ELASTICSEARCH_USERNAME = ''
ELASTICSEARCH_PASSWORD = ''
ELASTICSEARCH_INDEX = 'scrapy.course'
ELASTICSEARCH_TYPE = 'course'
ELASTICSEARCH_UNIQ_KEY = 'url'
...
配置说明见 https://github.com/knockrentals/scrapy-elasticsearch
写一个网络课程爬虫
import scrapy
class ESCourseSpider(scrapy.Spider):
name = 'es_course'
def start_requests(self):
urls=[]
for i in xrange(1,30):
urls.append('http://demo.edusoho.com/course/'+str(i))
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
yield {
'title': response.css('span.course-detail-heading::text').extract_first(),
'price': response.css('b.pirce-num::text').extract_first(),
'url' : response.url,
}
跑一下爬虫
scrapy crawl es_course -o es_course.json
爬下来的内容会存放在新生成的一个文件es_course.json里
[
{"url": "http://demo.edusoho.com/course/1", "price": "免费", "title": "\n 课程功能体验\n "},
{"url": "http://demo.edusoho.com/course/20", "price": "0.01", "title": "\n 官方主题\n "},
{"url": "http://demo.edusoho.com/course/24", "price": "999.00", "title": "\n 会员专区\n "},
{"url": "http://demo.edusoho.com/course/22", "price": "免费", "title": "\n 第三方主题\n "},
{"url": "http://demo.edusoho.com/course/27", "price": "0.01", "title": "\n 优惠码\n "}
]
到elasticsearch中查看数据,查询条件如下
GET scrapy.course*/_search
{
"query" : {
"match_all": {}
}
,"from" : 0, "size" : 50
}
结果
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 1,
"hits": [
{
"_index": "scrapy.course",
"_type": "course",
"_id": "6306093149d91c35eabc1c59f28d68355cc4de9d",
"_score": 1,
"_source": {
"url": "http://demo.edusoho.com/course/1",
"price": "免费",
"title": "\n 课程功能体验\n "
}
},
{
"_index": "scrapy.course",
"_type": "course",
"_id": "6a090cfe8f9dbf3d21248d64d9907eab4b31bc4d",
"_score": 1,
"_source": {
"url": "http://demo.edusoho.com/course/24",
"price": "999.00",
"title": "\n 会员专区\n "
}
},
...
说明数据已经存到elasticsearch中。