爬虫项目-爬取豆瓣网,把数据存在mongodb数据库

项目文档:爬取豆瓣网http://movie.douban.com,电影名字、电影信息、电影简介、电影评分

使用:scrapy框架+mongodb数据库

前期准备,需要安装scrapy框架、需要安装pymongo数据库

创建项目:scrapy startpriject doubaner

进入项目目录:cd doubaner

创建爬虫:scrapy genspider douban  'movie.douban.com'

1、设置items

在项目的items文件中

import scrapy

class DouBanItem(scrapy.Item):

             title = scrapy.Field()      # 电影名字

             content = scrapy.Field()   # 电影信息

              infi = scrapy.Field()        # 电影简介

             score = scrapy.Field()      # 电影评分

2、设置douban.py

import scrapy
from doubaner.items import DoubanerItem
class DoubanSpider(scrapy.Spider):
    name = "douban"
    allowed_domains = ["movie.douban.com"]
    start = 0
    url = 'https://movie.douban.com/top250?start='
    end = '&filter='
    start_urls = [url + str(start) + end]

    def parse(self, response):
        item = DoubanerItem()

        movies = response.xpath(r"//div[@class='info']")

        for each in movies:
                title = movies.xpath('//div[@class="hd"]/a/span[1]/text()').extract()
                score = movies.xpath("//div[@class='bd']/div/span[2]/text()").extract()
                content = movies.xpath(' //div[@class="bd"]/p/text()').extract()
                 info = movies.xpath('//div[@class="bd"]/p[2]/span/text()').extract()

                 item['title'] = title[0]
                 item['score'] = score[0]
                 item['content'] = ';'.join(content)
                 item['info'] = info[0]

                 yield item

            if self.start <= 225:
                 self.start += 25
                 yield scrapy.Reqeust(url=(self.url+str(self.start)+self.end), callback=self.parse)

3、设置pipeline.py

# coding=utf-8
from scrapy.conf import settings
import pymongo
class DoubanerPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_POST']
        dbname = settings['MONGODB_DBNAME']

        # pymongo.MongoClient(host, port) 创建MongoDB链接
        client = pymongo.MongoClient(host=host, port=port)

        # 指向指定的数据库
        mdb = client[dbname]

        # 获取数据库里存放数据的表名
        self.post = mdb [settings['MONGODB_DBCNAME']]

    def process_item(self, item, spider):
        data = dict(item)
        self.post.insert(data)
        return item

4、设置settings配置文件

ITEM_PIPELINES = {
    'doubaner.pipelines.DoubanerPipeline': 300
}

USER_AGENT = {
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}

# MONGODB 主机环回地址127.0.0.1
MONGODB_HOST = '127.0.0.1'

# 端口号,默认是27017
MONGODB_PORT = 27017

# 设置数据库名称
MONGODB_DBNAME = 'DouBan'

# 存放本次数据的表名称
MONGODB_DBCNAME = 'DouBanMovies'

5、启动mongodb数据库

6、开启爬虫

 

你可能感兴趣的:(爬虫)