项目文档:爬取豆瓣网http://movie.douban.com,电影名字、电影信息、电影简介、电影评分
使用:scrapy框架+mongodb数据库
前期准备,需要安装scrapy框架、需要安装pymongo数据库
创建项目:scrapy startpriject doubaner
进入项目目录:cd doubaner
创建爬虫:scrapy genspider douban 'movie.douban.com'
1、设置items
在项目的items文件中
import scrapy
class DouBanItem(scrapy.Item):
title = scrapy.Field() # 电影名字
content = scrapy.Field() # 电影信息
infi = scrapy.Field() # 电影简介
score = scrapy.Field() # 电影评分
2、设置douban.py
import scrapy
from doubaner.items import DoubanerItem
class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["movie.douban.com"]
start = 0
url = 'https://movie.douban.com/top250?start='
end = '&filter='
start_urls = [url + str(start) + end]
def parse(self, response):
item = DoubanerItem()
movies = response.xpath(r"//div[@class='info']")
for each in movies:
title = movies.xpath('//div[@class="hd"]/a/span[1]/text()').extract()
score = movies.xpath("//div[@class='bd']/div/span[2]/text()").extract()
content = movies.xpath(' //div[@class="bd"]/p/text()').extract()
info = movies.xpath('//div[@class="bd"]/p[2]/span/text()').extract()
item['title'] = title[0]
item['score'] = score[0]
item['content'] = ';'.join(content)
item['info'] = info[0]
yield item
if self.start <= 225:
self.start += 25
yield scrapy.Reqeust(url=(self.url+str(self.start)+self.end), callback=self.parse)
3、设置pipeline.py
# coding=utf-8
from scrapy.conf import settings
import pymongo
class DoubanerPipeline(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_POST']
dbname = settings['MONGODB_DBNAME']
# pymongo.MongoClient(host, port) 创建MongoDB链接
client = pymongo.MongoClient(host=host, port=port)
# 指向指定的数据库
mdb = client[dbname]
# 获取数据库里存放数据的表名
self.post = mdb [settings['MONGODB_DBCNAME']]
def process_item(self, item, spider):
data = dict(item)
self.post.insert(data)
return item
4、设置settings配置文件
ITEM_PIPELINES = {
'doubaner.pipelines.DoubanerPipeline': 300
}
USER_AGENT = {
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
# MONGODB 主机环回地址127.0.0.1
MONGODB_HOST = '127.0.0.1'
# 端口号,默认是27017
MONGODB_PORT = 27017
# 设置数据库名称
MONGODB_DBNAME = 'DouBan'
# 存放本次数据的表名称
MONGODB_DBCNAME = 'DouBanMovies'
5、启动mongodb数据库
6、开启爬虫