基于python2.7,使用scrapy爬取豌豆荚app的名字大小及下载次数等字段并将其存储到MongoDB数据库中,步骤如下:
使用scarpy命令新建爬虫项目:
scrapy startproject ChannelCrawler
生成爬虫项目后在Items.py中对爬取数据结构的Item进行编写,根据要爬取的4个字段有Items.py:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AppInfo(scrapy.Item):
name = scrapy.Field()
size = scrapy.Field()
downloadTimes = scrapy.Field()
description = scrapy.Field()
然后r在spider文件夹的init.py下进行爬虫程序的编写,程序代码如下,步骤是先爬取所有app的分类再进行app详细数据的爬取:
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
import scrapy
from ChannelCrawler.items import AppInfo
class wandoujiaAppCrawler(scrapy.Spider):
name = "wandoujiaAppCrawler"
def start_requests(self):
urls = [
"http://www.wandoujia.com/apps",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parseCategory)
#爬取所有app分类
def parseCategory(self, response):
for pageUrl in response.css('li.parent-cate a::attr(href)').extract():
yield scrapy.Request(url=pageUrl, callback=self.parse)
def parse(self, response):
for app in response.css('li.card'):
item = AppInfo()
item['name'] = app.css('div.app-desc h2 a::text').extract_first(),
item['downloadTimes'] = app.css('div.app-desc div.meta span::text').extract_first(),
item['size'] = app.xpath('//div[@class="app-desc"]/div/span[3]/text()').extract_first(),
yield item
#爬取下一页
next_pages = response.css('div.page-wp a::attr(href)').extract()
for page in next_pages:
yield scrapy.Request(url=page, callback=self.parse)
主要爬取内容为app页面上的app名字,app下载量和大小,app简介4个字段,如豌豆荚页面所示:
这里可以使用代理中间件来进行爬取,配置步骤首先是在middlewares.py中编写爬虫代理的中间件,middlewares.py代码如下:
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "http://proxy.yourproxy:8001"
然后再在scrapy的配置文件中注册中间件,settings.py中添加如下字段:
DOWNLOADER_MIDDLEWARES = {
'ChannelCrawler.middlewares.ProxyMiddleware': 100,
}
代理即配置完成,代理配置更加详细的内容可以参考我写的另一篇博客:scrapy代理的配置方法.
进行数据的配置首先要在settings.py中进行对应数据库的配置,然后在pipelines中编写mongoDB的配置代码,首先在settings.py中加入如下内容:
ITEM_PIPELINES = {
'ChannelCrawler.pipelines.MongoPipeline': 300,
}
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_COLLECTION = "wandoujiaApps"
MONGODB_DB = "scrapyTest"
这里制定了MongoDB的Pipeline,然后在下面的配置中写了MongoDB要存入的数据库名称和collection名称.然后再在pipelines.py中加入mongoDB的配置,代码如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.conf import settings
# class ChannelcrawlerPipeline(object):
# def process_item(self, item, spider):
# return item
class MongoPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT']
)
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
self.collection.insert(dict(item))
return item
这样爬虫在运行时即自动将爬取的内容存入mongoDB中,运行scarpy crawl wandoujiaAppCrawler即可运行爬虫并将数据存入MongoDB.