R包安利 ① easyPubMed—PubMed利器

想让这张图当封面


R包安利 ① easyPubMed—PubMed利器_第1张图片

1.1 关于 easyPubMed

作者是 Damiano Fantini ,2019-03-29 发表,甩链接。

看它的 Title: 搜索和读取 PubMed 上的文章发表信息。
看它的自我介绍: easyPubMed 可以查询 NCBI Entrez,以 XML 或 文本 格式获得 PubMed 信息,可以提取、整合数据,可以 轻 而 易 举 地下载一大堆记录信息,比如单独得到 作者单位题目关键词摘要发表时间……

2.1 包里的函数

一共12个:

articles_to_list , article_to_df , batch_pubmed_download , custom_grep , EPMsamples, fetch_all_pubmed_ids , fetch_pubmed_data . get_pubmed_ids ,get_pubmed_ids_by_fulltitle , PubMed_stopwords , table_articles_byAuth ,trim_address

3.1 小白使用教程

第一步当然是:

library(easyPubMed)

再加载一些要用到的包:

library(XML)
library(dplyr)
library(kableExtra)
library(parallel)
library(foreach)
library(doParallel)

3.1.1 简单的数据获取

以这个包的作者为例,获取TA的所有文章,[AU]-author.

query <- "Damiano Fantini[AU]"  

get_pubmed_ids() 得到一个 list 文件,包含后续操作需要的信息。

A PubMed query by the get_pubmed_ids() function results in:

  • the query results are posted on the Entrez History Server ready for retrieval
  • the function returns a list containing all information to access and download resuts from the server
entrez_id <- get_pubmed_ids(query)
R包安利 ① easyPubMed—PubMed利器_第2张图片

fetch_pubmed_data() 得到 PubMed 数据。

abstracts_txt <- fetch_pubmed_data(entrez_id, format = "abstract")
print(abstracts_txt[1:16])
R包安利 ① easyPubMed—PubMed利器_第3张图片

这里的 "format" 允许这些选项:"asn.1", "xml", "medline", "uilist", "abstract"

还可以用 fetch_pubmed_data() 结合 XML包,得到所有文章的标题。

abstracts_xml <- fetch_pubmed_data(entrez_id,format = "xml")
class(abstracts_xml)

示例得到的结果是:

## [1] "XMLInternalDocument" "XMLAbstractDocument"

然后就能愉快地继续下面的操作:

my_titles <- unlist(xpathApply(my_abstracts_xml, "//ArticleTitle", saveXML))
my_titles <- gsub("(^.{5,10}Title>)|(<\\/.*$)", "", my_titles)
my_titles[nchar(my_titles)>75] <- paste(substr(my_titles[nchar(my_titles)>75], 1, 70), 
                                        "...", sep = "")
print(my_titles)

最后愉快地得到所有文章的 title.

然鹅……我得到的结果是:

abstracts_xml <- fetch_pubmed_data(entrez_id,format = "xml") 
## 默认的format就是"xml",这个可以不输
class(abstracts_xml)
## [1] "character"
titles <- unlist(xpathApply(abstracts_xml, "//ArticleTitle", saveXML))
## Error in UseMethod("xpathApply") : 
  no applicable method for 'xpathApply' applied to an object of class "character"

绕路用另一种方法,函数 custom_grep 也可以获取所有文章的 title.

titles <- custom_grep(abstracts_xml, "ArticleTitle", "char")  
## format,c("list", "char"):
print(titles) 
R包安利 ① easyPubMed—PubMed利器_第4张图片

成功!

3.1.2 以 TXT 或 XML 格式下载并保存信息

通过 batch_pubmed_download() 将数据保存为 txt 或 xml 文件。

## 搜索标题里有 APE1 或 OGG1 这两个基因——在2012-2016年间发表的文章
new_query <- "(APE1[TI] OR OGG1[TI]) AND (2012[PDAT]:2016[PDAT])"

## 设置输出文件格式、文件名前缀
outfile <- batch_pubmed_download(pubmed_query_string = new_query, 
                               format = "xml", 
                               batch_size = 150,
                               dest_file_prefix = "easyPM_example")
outfile
## [1] "easyPM_example01.txt" "easyPM_example02.txt"

不造为什么得到的永远是txt……作者的示例里输出的是:

## [1] "easyPM_example01.xml" "easyPM_example02.xml"

3.1.3 从单独的 PubMed 记录里提取信息

custom_grep 函数可以将 XML 转换为字符串,从特定的 PubMed 记录中提取相关信息,返回 list 或 character.

PM_list <- articles_to_list(abstracts_xml)
## 任意选其中的一条
custom_grep(PM_list[13], tag = "DateCompleted")
## [[1]]
## [1] " 2011 01 31 "

但要注意:被选择的那条 record 是否含有要提取的 tag.

custom_grep(PM_list[8], tag = "DateCompleted")
## list()

另一个 tag 的例子:

custom_grep(PM_list[13], tag = "LastName", format = "char")
## [1] "Fantini"      "Vascotto"     "Marasco"      "D'Ambrosio"   "Romanello"    "Vitagliano"  
## [7] "Pedone"       "Poletto"      "Cesaratto"    "Quadrifoglio" "Scaloni"      "Radicella"   
## [13] "Tell"   

函数 article_to_df() 可以将字符串转换为 data.frame, 尤其是由 articles_to_list 得到的元素,得到的 data.frame 的列名为 "pmid", "doi", "title", "abstract", "year", "month", "day", "jabbrv", "journal", "lastname", "firstname", "address", "email".

df <- article_to_df(PM_list[13], max_chars = 18)
colnames(df)
##  [1] "pmid"      "doi"       "title"     "abstract"  "year"      "month"     "day"      
##  [8] "jabbrv"    "journal"   "keywords"  "lastname"  "firstname" "address"   "email"   
R包安利 ① easyPubMed—PubMed利器_第5张图片

调整表格:

df$title <- substr(df$title, 1, 15)
df$address <- substr(df$address, 1, 19)
df$jabbrv <- substr(df$jabbrv, 1, 10)
df[,c("pmid", "title", "jabbrv", "firstname", "address")] %>% kable() %>% 
  kable_styling(bootstrap_options = 'striped')
R包安利 ① easyPubMed—PubMed利器_第6张图片

如果多个作者有相同的单位信息,设置参数 autofill 为 “TRUE”.

df2 <- article_to_df(PM_list[13], autofill = TRUE)
df2$title <- substr(df2$title, 1, 15)
df2$address <- substr(df2$address, 1, 19)
df2$jabbrv <- substr(df2$jabbrv, 1, 10)
df2[,c("pmid", "title", "jabbrv", "firstname", "address")] %>% kable() %>% 
  kable_styling(bootstrap_options = 'striped')
R包安利 ① easyPubMed—PubMed利器_第7张图片

3.1.4 从 XML PubMed 记录中自动提取数据

函数 table_articles_byAuth() 可以迅速从多个 XML 记录获得作者信心和文章发表数据,该函数包含5个参数:

  • pubmed_data: an XML file or an XML object with PubMed records
  • max_chars and autofill: same as discussed in the previous example
  • included_authors: one of the following options c(“first”, “last”, “all”). The function can return data corresponding to the first, the last or all the authors for each PubMed record.
  • dest_file: if not NULL, the function attempts writing its output to the selected file. Existing files will be overwritten.
new_PM_query <- "(APEX1[TI] OR OGG1[TI]) AND (2010[PDAT]:2013[PDAT])"
outfile2 <- batch_pubmed_download(pubmed_query_string = new_PM_query, dest_file_prefix = "apex1_sample")
## [1] "PubMed data batch 1 / 1 downloaded..."
new_PM_file <- outfile2[1]
new_PM_df <- table_articles_byAuth(pubmed_data = new_PM_file, included_authors = "first", max_chars = 0)
## Processing PubMed data .................................................. done!

转换为 data.frame , 调整表格。

new_PM_df$address <- substr(new_PM_df$address, 1, 28)
new_PM_df$jabbrv <- substr(new_PM_df$jabbrv, 1, 9)
new_PM_df[1:10, c("pmid", "year", "jabbrv", "lastname", "address")] %>%
  kable() %>% kable_styling(bootstrap_options = 'striped')
R包安利 ① easyPubMed—PubMed利器_第8张图片

4.1 老司机进阶

4.1.1 准备数据

my_query <- 'Damiano Fantini[AU] AND '
my_query <- get_pubmed_ids(my_query)
my_batches <- seq(from = 1, to = my_query$Count, by = 10)
my_abstracts_xml <- lapply(my_batches,  function(i) {
  fetch_pubmed_data(my_query, retmax = 1000, retstart = i)  
})
## 储存为 list
all_xml <- list()
for(x in my_abstracts_xml) {
  xx <- articles_to_list(x)
  for(y in xx) {
    all_xml[[(1 + length(all_xml))]] <- y
  }  
}

4.1.2 快速提取 PMID, 标题,摘要

利用参数 article_to_df(, getAuthors = FALSE) ,省去作者信息,可以快速得到其他全部数据。

## 对整个过程计时 
t.start <- Sys.time()

## max_chars = -1 即提取全部摘要
final_df <- do.call(rbind, lapply(all_xml, article_to_df, 
                                  max_chars = -1, getAuthors = FALSE))
t.stop <- Sys.time()
print(t.stop - t.start)
## Time difference of 0.4308472 secs
R包安利 ① easyPubMed—PubMed利器_第9张图片

选择感兴趣的项目并呈现出表格:

interested_df <- final_df[,c("pmid","title","year","jabbrv")]
interested_df %>%
  head(n=4) %>% kable() %>% kable_styling(bootstrap_options = 'striped')

4.1.3 提取全部信息,包括关键词

利用参数 article_to_df(, getKeywords = TRUE) 得到文章关键词。

t.start <- Sys.time()
keyword_df <- do.call(rbind, lapply(all_xml, 
                                    article_to_df, autofill = T, 
                                    max_chars = 100, getKeywords = T))
t.stop <- Sys.time()
print(t.stop - t.start)
## Time difference of 4.216758 secs
print(keyword_df$keywords[seq(1, 150, by = 15)])
R包安利 ① easyPubMed—PubMed利器_第10张图片

4.1.4 将任务拆分,多任务平行提取数据

keyword_df[seq(1, 100, by = 10), c("lastname", "keywords", "abstract")] %>%
  kable() %>% kable_styling(bootstrap_options = 'striped')
R包安利 ① easyPubMed—PubMed利器_第11张图片

作者在这里注释:

Load required packages (available from CRAN).
This will work on UNIX/LINUX systems.
Windows systems may not support the following code.

然鹅用win的我并没有发现有什么异常、以及 和前面的 区 别(⊙ˍ⊙)

4.1.5 利用 NCBI/Entrez API key 实现更快的信息获取

没有 API Key, 所以以下均为 复制:

# define a PubMed Query: this should return 40 results
my_query <- '"immune checkpoint" AND 2010[DP]:2012[DP]'

# Monitor time, and proceed with record download -- USING API_KEY!
t_key1 <- Sys.time()
set_01 <- batch_pubmed_download(my_query, 
                                api_key = "NNNNNNNNNNe9108aee96ace507af23a4eb09", 
                                batch_size = 2, dest_file_prefix = "TMP_api_")
t_key2 <- Sys.time()

# Monitor time, and proceed with record download -- DO NOT USE API_KEY!
t_nok1 <- Sys.time()
set_02 <- batch_pubmed_download(my_query, 
                                batch_size = 2, dest_file_prefix = "TMP_no_")
t_nok2 <- Sys.time()
# Compute time differences
# The use of a key makes the process faster
print(paste("With key:", t_key2 - t_key1))
## [1] "With key: 20.7291417121887"
print(paste("W/o key:", t_nok2 - t_nok1))
## [1] "W/o key: 24.6694004535675"

总之就是用这个方法超快

4.1.6 文章全长标题的精准匹配

2.11 版的新函数 get_pubmed_ids_by_fulltitle(),可以直接输入文章标题,再敲几行代码得到想要的信息。

试一下 Boss Jimmy 布置的作业:

ftitle_query <- "Identification of hub genes and outcome in colon cancer based on bioinformatics analysis"
my_field <- "[Title]"
fullti <- get_pubmed_ids_by_fulltitle(ftitle_query, field = my_field)
print(as.numeric(fullti$IdList$Id[1]))
## [1] 30643458

就这样,得到了这篇文章的 PMID.

References

  • easyPubMed official website including news, vignettes, and further information https://www.data-pulse.com/dev_site/easypubmed/

最后,向大家隆重推荐生信技能树的一系列干货!

  1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw

你可能感兴趣的:(R包安利 ① easyPubMed—PubMed利器)