R爬虫实践—抓取国自然基金信息【上篇】

国自然基因的爬取最初是由于工作需求，需要整理汇总相关的国自然信息，方便定位科研热点。早期，主要通过比较机械的方式，关键词检索——肉眼选择所有信息——然后复制粘贴，相当费时间。当然，很多公共号也会整理汇总提供，但大多需要发朋友圈之类的操作。通过R爬虫就可以快速获取国自然信息，大大提高效率，以前可能需要2-3天才能完成，而爬虫几分钟就可以解决这个事情。

国自然信息来源

我们知道爬虫就是能提取网页上的可见信息，只要看得见，理论上爬虫就可以获取。因此，要爬取国自然信息，需要知道它们的来源网站，经检索主要有以下几种：

1-国家自然科学基金委员会

网址：https://isisn.nsfc.gov.cn/egrantindex/funcindex/prjsearch-list

该网站是官网来源，它的检索设置非常细化，不能通过一次性获得宽泛的检索结果；同时涉及多次验证码，因此对爬虫并不友好。简单检索信息可以，想要爬取获取信息，不推荐该网站。当然对于爬虫大牛来说，这些都不是障碍。

2-国家自然科学基金结果查询- NSFC

网址：http://nsfc.biomart.cn/

丁香通提供的检索平台，该平台收录信息更新不及时，目前还停留在2015，直接pass吧。但该网站提供有丰富的国自然标书资源，不知道能不能爬虫，哈哈哈。

3-medsci

网址 : https://www.medsci.cn/sci/nsfc.do

该网站可以忽略，收录的信息很不全面。

4-科学网

网址：http://fund.sciencenet.cn/

该网站检索设置不复杂，可以通过简单关键词获取足够多的信息，且没有验证码设置，对爬虫友好。但是呢，该网站在任一关键词检索下只提供200条信息的展示，信息不全面。重点是，现在居然收费了！还是直接pass吧。

5-国家自然基金项目查询(V2.0正式版)

网址：http://fund.zsci.com.cn/

强烈推荐该网站用于国自然信息爬取，网站检索设置不复杂，可以通过简单关键词获取足够多的信息，且没有验证码设置。同时，信息显示全面，是爬取国自然信息的最佳选择！

如何爬取国自然信息？

下面抛出这样一个要求：如何爬取2016-2019年间lncRNA相关的基金项目？

首先，打开网页http://fund.zsci.com.cn/，输入关键词，如下所示：

image.png

得到如下检索界面，共计得到1200多条信息，每条信息包括题目、负责人、申请单位、研究类型、金额等信息。

image

分析网址规律，发现按照网址有三部分组成：[http://fund.zsci.com.cn/Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/]+[页码]+[.html]

http://fund.zsci.com.cn/Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/1.html
http://fund.zsci.com.cn/Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/2.html
http://fund.zsci.com.cn/Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/3.html
http://fund.zsci.com.cn/Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/4.html
.......
http://fund.zsci.com.cn/Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/lastpage_number.html

接下来，正式进行爬取~~~

安装加载相应的R包

rm(list = ls())
library(rvest)
library(stringr)

针对第一页进行信息抓取

url1 <- c("http://fund.zsci.com.cn/Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/1.html")
web <- read_html(url1)
#---获得基金标题---
Title <- web %>% html_nodes('ul a li h3') %>% html_text() # 标题内容解析
Title
#---获得众多信息---
Information <- web %>% html_nodes('ul a li span') %>% html_text()  #获取了负责人和单位信息
Information
#---获取负责人信息---
Author <-  Information[grep("负责人", Information)]
Author
#---获得申请单位---
Department <-  Information[grep("申请单位", Information)]
Department
#---获取研究类型---
jijintype <-  Information[grep("研究类型", Information)]
jijintype
#---获得项目号---
Project <-  Information[grep("项目批准号", Information)]
Project
#---获取批准时间---
Date <-  Information[grep("批准年度", Information)]
Date
#---获取基金金额---
Money <- Information[grep("金额", Information)]
Money 
result <- data.frame(Title=Title,Author=Author,Department=Department,
                     Type=jijintype,Project=Project,Date=Date,Money=Money)

获取所有页面信息，那该检索条件下最大页码是多少？先获取最大页码数（当然可以直接点击最后一页获取相应页码）

#进入总网页第一页
url1<-"http://fund.zsci.com.cn/Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/1.html"
#读取网页内容
web <- read_html(url1,encoding = "utf-8")
lastpage_link <- web %>% html_nodes("div.layui-box a") %>% html_attr("href")
lastpage_link <- paste0("http://fund.zsci.com.cn/",lastpage_link[length(lastpage_link)])

lastpage_web <- read_html(lastpage_link,encoding = "utf-8")
lastpage_number <-  lastpage_web %>% html_nodes("div.layui-box span.current") %>% html_text() %>% as.integer()

利用循环，获取所有页面信息

i=1
site <- 'http://fund.zsci.com.cn/Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/'

#创建一个空的数据框用来存储抓取的数据
results <- data.frame(Title="题目",Author="负责人",Department="申请单位",
                      Type="研究类型",Project="项目批准号",Date="批准年度",Money="金额")

for(i in 1:lastpage_number){
  url <- paste0(site,i,".html")
  web <- read_html(url,encoding = "utf-8")
  #---获得基金标题---
  Title <- web %>% html_nodes('ul a li h3') %>% html_text() # 标题内容解析
  #---获得众多信息---
  Information <- web %>% html_nodes('ul a li span') %>% html_text()  #获取了负责人和单位信息
  #---获取负责人信息---
  Author <-  Information[grep("负责人", Information)]
  #---获得申请单位---
  Department <-  Information[grep("申请单位", Information)]
  #---获取研究类型---
  jijintype <-  Information[grep("研究类型", Information)]
  #---获得项目号---
  Project <-  Information[grep("项目批准号", Information)]
  #---获取批准时间---
  Date <-  Information[grep("批准年度", Information)]
  #---获取基金金额---
  Money <- Information[grep("金额", Information)]
  result <- data.frame(Title=Title,Author=Author,Department=Department,
                       Type=jijintype,Project=Project,Date=Date,Money=Money)
  #合并所有页面数据成数据框
  results <- rbind(results,result)
}
write.csv(results,file = "2016-2019_lncRNA国自然.csv")

爬取结果如下，共计1200多条记录。