python用scrapy爬虫豆瓣_Python爬虫(十三)——Scrapy爬取豆瓣图书-Go语言中文社区...

Python爬虫(十三)——Scrapy爬取豆瓣图书

这次我们爬取豆瓣图书的top250的目录后进入书籍界面爬取界面中的书籍标签。

步骤

建立项目和Spider模板

使用以下命令

scrapy startproject demo

cd demo

scrapy genspider book

编写Spider

我们首先在top250的界面中爬取到每本书籍的url。打开网页观察代码:

经过观察,我们发现书籍的信息在标签tr属性为item的代码块中,而书籍的url则是在标签a中。利用yield将这个请求的结果返回:

def parse(self, response):

soup = BeautifulSoup(response.text, 'html.parser')

for item in soup.find_all('tr', attrs={'class': 'item'}):

for href in item.find_all('a'):

if href.string != None:

url = href.attrs['href']

yield scrapy.Request(url, callback=self.parse_book)

然后打开书籍信息界面的源代码搜索tag找到了书籍标签的所在位置

发现我们可以用正则表达式’tag/.*?"'来得到书籍的标签,然后用yield来返回得到的书籍信息:

def parse_book(self, response):

infoDict = {}

booksoup = BeautifulSoup(

response.text, 'html.parser')

infoDict.update(

{'bookname': booksoup.title.string[:-4]})

tagInfo = re.findall('tag/.*?"', response.text)

tag = []

for i in tagInfo:

tag.append(i[4:])

infoDict['tag'] = tag

yield infoDict

编写Pipelines

在pipelines.py文件中我们设定一个filename来存放文件名,然后打开这个文件将得到的内容写进去:

filename = 'book.txt'

def open_spider(self, spider):

self.f = open(self.filename, 'w')

def close_spider(self, spider):

self.f.close()

def process_item(self, item, spider):

try:

line = str(dict(item)) + 'n'

self.f.write(line)

except:

pass

return item

配置settings

打开settings.py文件,将pipelines设定为我们所编写的类:

执行程序

最后打开命令行执行:

scrapy crawl book

运行结束后文件夹中就会得到一个book.txt文件:

完整代码

book.py

# -*- coding: utf-8 -*-

import scrapy

from bs4 import BeautifulSoup

import re

class BookSpider(scrapy.Spider):

name = 'book'

start_urls = ['https://book.douban.com/top250?icn=index-book250-all']

def parse(self, response):

soup = BeautifulSoup(response.text, 'html.parser')

for item in soup.find_all('tr', attrs={'class': 'item'}):

for href in item.find_all('a'):

if href.string != None:

url = href.attrs['href']

yield scrapy.Request(url, callback=self.parse_book)

def parse_book(self, response):

infoDict = {}

booksoup = BeautifulSoup(

response.text, 'html.parser')

infoDict.update(

{'bookname': booksoup.title.string[:-4]})

tagInfo = re.findall('tag/.*?"', response.text)

tag = []

for i in tagInfo:

tag.append(i[4:])

infoDict['tag'] = tag

yield infoDict

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

class DemoPipeline(object):

filename = 'book.txt'

def open_spider(self, spider):

self.f = open(self.filename, 'w')

def close_spider(self, spider):

self.f.close()

def process_item(self, item, spider):

try:

line = str(dict(item)) + 'n'

self.f.write(line)

except:

pass

return item

鸣谢

最后我要感谢慕课网北京理工大学嵩天老师开设的Python网络爬虫与信息提取这门课程,我的学习笔记都是对其课程的记录和自己的实践。没有嵩老师的课程我将无法学习到Python爬虫的那么多知识。感谢嵩老师和他的团队!传送门

你可能感兴趣的:(python用scrapy爬虫豆瓣_Python爬虫(十三)——Scrapy爬取豆瓣图书-Go语言中文社区...)