豆瓣图书rvest爬虫

随便百度一下，可以发现很多豆瓣图书的爬虫案例，本文主要参考知乎爬虫利器Rvest包。

豆瓣图书top250.png

第一页网页内容爬取

以第一页为例，讲讲豆瓣图书的爬取过程

title

通过安装chrome浏览器的SelectorGadget插件，可以简化html代码的寻找

SelectorGadget插件.png

符合代码条件的部件会显示黄色。SelectorGadget插件的使用提高了效率。

SelectorGadget插件应用.png

在html_nodes函数中设置要爬取的内容。

title<-web %>% html_nodes("div.pl2 a") %>% html_text()

title对应的html.png

书籍信息

同理可爬取其他书籍信息，详细代码如下所示

library(rvest)
web<-read_html("https://book.douban.com/top250?icn=index-book250-all",encoding="UTF-8")
#------------------------------------------------------------>books' title
title<-web %>% html_nodes("div.pl2 a") %>% html_text()
title<-as.data.frame(title)
#------------------------------------------------------------>books' information
information<-web %>% html_nodes("p.pl") %>% html_text()
information<-as.data.frame(information)
douban<-cbind(title,information)
#------------------------------------------------------------>books' quote
quote<-web %>% html_nodes("p.quote") %>% html_text()
quote<-as.data.frame(quote)
douban<-cbind(douban,quote)
#------------------------------------------------------------>books' rating
rating<-web %>% html_nodes("span.rating_nums") %>% html_text()
rating<-as.data.frame(rating)
douban<-cbind(douban,rating)

top250的全部书籍信息爬取

爬取所有书籍的信息，需要建立一个循环

豆瓣网址分析

豆瓣网址尾数是以0,25,50,75,100,125,150,175,200,225这样的顺序构成的，所有需要构建一下网址

https://book.douban.com/top250?start=0
https://book.douban.com/top250?start=25

https://book.douban.com/top250?start=225

#### 构建循环
详细的循环代码如下所示

library(stringr)
ind <-c(0,25,50,75,100,125,150,175,200,225)
book<-data.frame()
for(i in 1:length(ind)){
url<-str_c("https://book.douban.com/top250?start=",ind[i])
web<-read_html(url,encoding="UTF-8")
title<-web %>% html_nodes("div.pl2 a") %>% html_text()
title<-as.data.frame(title)
information<-web %>% html_nodes("p.pl") %>% html_text()
information<-as.data.frame(information)
douban<-cbind(title,information)
rating<-web %>% html_nodes("span.rating_nums") %>% html_text()
rating<-as.data.frame(rating)
douban<-cbind(douban,rating)
book<-rbind(book,douban)
}
book$title<-gsub(" ","",book$title)
write.csv(book,"E:/RStudio/爬虫/豆瓣250/douban.csv")

由于爬取到的图书标题有大量的空格，所有使用`book$title<-gsub(" ","",book$title)`将标题中的空格删除掉。

### 豆瓣评分排名前十书籍

![图片.png](http://upload-images.jianshu.io/upload_images/5616135-f03507b2a9bc915c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)


![前十书籍.png](http://upload-images.jianshu.io/upload_images/5616135-7210f9f912c59fb9.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)