书接上回,我们讲到,要用elasticsearch来存储信息,那么我们首先就得安装elasticsearch。这里就不介绍了,默认已经装好elasticsearch,那我们就正式开始写爬虫了。
(baidunewsspider) C:\Users\LiTangMM\PycharmProjects>scrapy startproject baiduNewsSpider
(baidunewsspider) C:\Users\LiTangMM\PycharmProjects\baiduNewsSpider>scrapy genspider newsbaidu http://ne
ws.baidu.com/
我们首先得告诉scrapy,我们要哪些数据,编辑baiduNewsSpider/items.py 文件:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class BaidunewsspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
news_url = scrapy.Fileld() # 新闻链接
pass
pass 可以只是为了代码块完整性才加的,可以删除。
OK!下面就是爬取了,修改baiduNewsSpider/spiders/newsbaidu.py文件
# -*- coding: utf-8 -*-
import scrapy
import json
import re
from baiduNewsSpider.items import BaidunewsspiderItem
class NewsbaiduSpider(scrapy.Spider):
name = 'newsbaidu'
allowed_domains = ['http://news.baidu.com/']
start_urls = []
def start_requests(self):
# 重写请求构造函数
baseurl = 'http://news.baidu.com/widget?id={}&ajax=json&t=1565919256596'
widget_ids = ['civilnews',
'InternationalNews',
'EnterNews',
'SportNews',
'FinanceNews',
'TechNews',
'MilitaryNews',
'InternetNews',
'DiscoveryNews',
'LadyNews',
'HealthNews',
'PicWall']
return [scrapy.FormRequest(baseurl.format(_)) for _ in widget_ids]
def parse(self, response): # 解析函数
widgetid = re.match(r'.*?id=(.*?)&', response.url).group(1) # 从response中提取url,使用正则匹配提取id值
jsondate = json.loads(response.text) # 将response的内容使用json解析
news_list = jsondate.get('data').get(widgetid).get('focusNews') # 提取内容
for news in news_list:
yield BaidunewsspiderItem(news_url=news['m_url']) # 返回一个迭代器
打开shell,输入scrapy crawl newsbaidu -o urls.csv 启动爬虫。
我们使用命令将数据爬取并保存下来。
class NewsItem(scrapy.Item):
url = scrapy.Field() # 新闻链接
title = scrapy.Field() # 新闻标题
content = scrapy.Field() # 新闻内容
首先,得从文件中读取到新闻的url,所以我们得重写 start_requests 方法
# -*- coding: utf-8 -*-
import scrapy
from baiduNewsSpider.items import NewsItem
class NewsspiderSpider(scrapy.Spider):
name = 'newsspider'
allowed_domains = ['']
start_urls = ['']
def start_requests(self):
with open('url.csv') as urlfile:
urllist = urlfile.readlines()[1:]
return [scrapy.FormRequest(_) for _ in urllist]
def parse(self, response):
url = response.url
urlname = url.split('/')[2] # 使用split切割url,判断url属于那个网站
if urlname == '3w.huanqiu.com': # 环球网
title = response.xpath('//h1[@class="a-title"]/strong/text()').extract()[0] #使用 response的selector解析 网页
content = ''.join(response.xpath('//div[@class="a-con"]/p/text()').extract())
elif urlname == 'baijiahao.baidu.com': # 百家号
title = response.xpath('//div[@class="article-title"][1]/h2/text()').extract()[0]
content = ''.join(response.xpath('//div[@id="article"]//text()').extract())
elif urlname == 'news.cctv.com': # 央视新闻
title = response.xpath('//div[@class="cnt_bd"][1]/h1/text()').extract()[0]
content = ''.join(response.xpath('//div[@class="cnt_bd"][1]/p/text()').extract())
else: # 其他
title = response.xpath('//h/text()').extract()[0]
content = ''.join(response.xpath('//p/text(').extract())
yield NewsItem(title=title, url=url, content=content)
要把数据存到elasticsearch中,首先,我们得安装好elasticsearch-dsl。
pip install elasticsearch-dsl
接着在根目录新建一个models.py 文件,它将负责映射:
from elasticsearch_dsl import DocType, Keyword, Text
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=['localhost'])
class NewsType(DocType):
title = Keyword # title不作分词
content = Text(analyzer='ik_max_word') #内容使用ik最大分词
class Meta:
index = 'baidunews'
doc_type = 'news'
if __name__=='main':
NewsType.init()
最后实现一个 pipline类:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from baiduNewsSpider.models import NewsType
class ElastisearchPipeline(object):
def process_item(self, item, spider):
if spider.name == 'newsspider': # 如果是新闻爬虫,存到elasticsearch中
news = NewsType()
news.title = item['title']
news.content = item['content']
news.save()
return item
为保证正确运行,请确认elasticsearch已打开,ik分词器以安装。注意elasticsearch版本号一定要和ik分词器版本号一样,否则elasticsearch会启动不了。
运行 model.py 文件创建索引。
运行爬虫。
OK!到这里,我们的爬虫部分完成了,接下来,我们将使用django搭建一个rest服务。