Scrapy-连接数据库

通过前面几篇文章的学习,我们已经能够使用Scrapy框架写出一些常见的网络爬虫。在本章中,我们将使用Scrapy框架,将爬取到的数据存储到数据库中。

与将数据写入文件一样,写入到数据库中也是通过pipelines.py文件完成的

存储到MySQL

修改pipelines.py后,代码如下:

import pymysql

class BlogPipeline(object):

    def __init__(self):
        self.conn = pymysql.connect(host='127.0.0.1', user='root', passwd='123456', db='colin-test', charset='utf8mb4')

    def process_item(self, item, *args, **kwargs):
        for i in range(len(item['title'])):
            title = item['title'][i]
            page_views = item['page_views'][i]
            published_date = item['published_date'][i]
            sql = "insert into blog(title, page_views, published_date) values('"+title+"', '"+page_views+"', '"+published_date+"' )"
            self.conn.query(sql)
        self.conn.commit()
        return item

    def close_spider(self, *args, **kwargs):
        self.conn.close()

存储到SQLite

修改pipelines.py后,代码如下:

import sqlite3

class BlogPipeline(object):

    def __init__(self):
        self.conn = sqlite3.connect('blog.db')
        self.cursor = self.conn.cursor()

    def process_item(self, item, *args, **kwargs):
        for i in range(len(item['title'])):
            title = item['title'][i]
            page_views = item['page_views'][i]
            published_date = item['published_date'][i]
            sql = "insert into blog(title, page_views, published_date) values('"+title+"', '"+page_views+"', '"+published_date+"' )"
            self.cursor.execute(sql)
        self.conn.commit()
        return item

    def close_spider(self, *args, **kwargs):
        self.conn.close()

存储到Mongodb

修改pipelines.py后,代码如下:

import pymongo

class BlogPipeline(object):

    def __init__(self):
        client = pymongo.MongoClient(f"mongodb://localhost:27017/")
        database = client["sport"]
        self.collection = database["blog"]

    def process_item(self, item, *args, **kwargs):
        for i in range(len(item['title'])):
            title = item['title'][i]
            page_views = item['page_views'][i]
            published_date = item['published_date'][i]
            self.collection.insert_one({"title": title, "page_views": page_views, "published_date": published_date})
        return item

    def close_spider(self, *args, **kwargs):
        self.collection.close()

你可能感兴趣的:(scrapy,数据库,scrapy,sqlite)