例如:scrapy startproject douban
cd douban
scrapy genspider 爬虫名称 网址
例如:scrapy genspider doubanmovie https://movie.douban.com/chart
1、xpath():传入xpath表达式,返回该表达式所对应的节点列表;(常用,必会,简单)2、css():传入css表达式,返回该表达式所对应的节点列表;(常用,必会,简单,和xpath类似)3、extract():序列化该节点为Unicode字符串并返回所对应的节点列表;(不常用,不用必会,简单)4、re():传入正则表达式,返回根据正则表达式提取所对应的Unicode字符串节点列表(常用,必会,困难)这里使用第一种,xpath:是一门用来在xml文件中选择节点的语言,也可以用在html上。
import scrapy
class DoubanmovieSpider(scrapy.Spider):
# 爬虫id 唯一
name = 'doubanmovie'
# 允许采集的域名(所有采集的数据仅限在当前域名下)
allowed_domains = ['movie.douban.com']
# 开始采集的网站
start_urls = ['https://movie.douban.com/chart/']
# 解析响应的数据,可以理解为一个http请求的的response
def parse(self, response):
# 整个div的数据
divResultList = response.xpath("//div[@class='pl2']")
i = 1
for result in divResultList:
data = {}
print("第"+str(i)+"名")
name = result.xpath(".//a/text()").extract_first().replace('/', '').strip()
aliasName = result.xpath(".//a/span/text()").extract_first()
info = result.xpath(".//p/text()").extract_first()
rank = result.xpath(".//span[@class='rating_nums']/text()").extract_first()
rankPeople = result.xpath(".//span[@class='pl']/text()").extract_first()
# print("电影名称:"+name)
# print("电影别名:"+aliasName)
# print("电影简介:"+info)
# print("电影评分:"+rank)
# print("评分人数:"+rankPeople)
# print("\n")
i = i+1
data['name'] = name
data['aliasName'] = aliasName
data['info'] = info
data['rank'] = rank
data['rankPeople'] = rankPeople
yield data
pass
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
import pymongo
from itemadapter import ItemAdapter
class MoviedoubanspiderPipeline:
def process_item(self, item, spider):
self.table.insert_one(item)
return item
def __init__(self):
client = pymongo.MongoClient("mongodb://192.168.10.15:28017/")
db = client['doubanmovie']
self.table = db['doubanmovie']
pass
scrapy crawl doubanmovie