推荐系统1:Scrapy创建一个简单的爬虫

创建项目

进入到文件存放目录下
创建项目,执行 scrapy startproject zhihuscrapy

创建爬虫

在spiders目录下创建文件 zhihu_spider.py
文件代码如下:

import scrapy

class ZhihuSpider(scrapy.Spider):
    name = "zhihu"
    allowed_domains = ["zhihu.com"]
    start_urls = [
        "https://zhuanlan.zhihu.com/p/38198729",
        "https://zhuanlan.zhihu.com/p/38235624"
    ]

    def parse(self, response):
        for sel in response.xpath('//head'):
            title = sel.xpath('title/text()').extract()
            link = sel.xpath('title/text()').extract()
            desc = sel.xpath('title/text()').extract()
            print title, link, desc

设置请求头

在settings.py中增加

#请求头 
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' 
#关闭robot 
ROBOTSTXT_OBEY = False 
#关闭cookies追踪 
COOKIES_ENABLED = False

启动爬取

回到项目目录下

scrapy crawl zhihu

改进代码

import scrapy

from zhihuscrapy.items import ZhihuscrapyItem

class ZhihuSpider(scrapy.Spider):
    name = "zhihu"
    allowed_domains = ["zhihu.com"]
    start_urls = [
        "https://zhuanlan.zhihu.com/p/38198729",
        "https://zhuanlan.zhihu.com/p/38235624"
    ]

    def parse(self, response):
        for href in response.css("UserLink-link > a::attr('href')"):
            #url = response.urljoin(response.url, href.extract())
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath('//head'):
            item = ZhihuscrapyItem()
            item['title'] = sel.xpath('title/text()').extract()
            item['link'] = sel.xpath('title/text()').extract()
            item['desc'] = sel.xpath('title/text()').extract()
            yield item

执行,并输出

scrapy crawl zhihu -o items.json

参考: Scrapy爬虫(1)-知乎
参考: Scrapy入门教程

你可能感兴趣的:(推荐系统1:Scrapy创建一个简单的爬虫)