「rvest爬虫实战」批量筛选蛋白质亚细胞定位结果

应用场景

得到一组基因后想看这些基因或者蛋白质在uniprot中亚细胞定位的结果,我们以一个actin基因为例,我们之前从蛋白组数据中得知该基因的UniprotKB号为:P07830,打开uniprot搜索P07830得到:uniprot-P07830.
uniprot annotation中我们发现定位只有细胞骨架(cytoskeleton),此外还有一个GO term cell component的注释,这里面除了在细胞骨架外还有注释到其他部位。

image

image

如果一个一个搜,5000多个要搜很久,关键是数量阅读,手工搜越容易出错,所以通过对网页观察,写一个爬虫,批量搜索是必要的。

特征

首先我们在点击左边导航栏的Subcellular location,我们再看地址栏的url变成了

https://www.uniprot.org/uniprot/P07830#subcellular_location

其实对于不同的基因来说,亚细胞定位的网址规则就为

https://www.uniprot.org/uniprot/ + UniprotKB + #subcellular_location

整体信息提取的思路就清晰了:

  1. 通过上述网址规则逐个将每个基因的亚细胞定位信息从uniprot爬取下来。
  2. 通过正则或者其他手段检测我们需要的亚细胞定位信息。
  3. 整理表格输出。

运行

这里我们检测一系列蛋白是否会定位到叶绿体,
所以我们需要提取的信息就很简单了,看亚细胞定位的结果中包不包含叶绿体(chloroplastic)

这里用R的rvest包进行

首先我们看用rvest爬取效果

## 先测试下 UniprotKB 为 P19366 
url_test = "https://www.uniprot.org/uniprot/P19366#subcellular_location"
read_html(url_test) %>% 
      rvest::html_text() 
## 结果
[1] "atpB - ATP synthase subunit beta, chloroplastic - Arabidopsis thaliana (Mouse-ear cress) - atpB gene & protein\n\t\t\tvar BASE = '/';\n\t\t\n\t\t\t\tuniprot.isInternal = false;\n\t\t\t\tuniprot.namespace = 'uniprot';\n\t\t\t\tuniprot.releasedate = '2021_03';\n\t\t\t\n\t\t\t;\n\t\t\n\t\t\t\t// variable to store annotation data\n\t\t\t\tvar annotations = [];\n\t\t\t\tvar entryId = 'P19366';\n\t\t\t\tvar isObsolete = false || !true;\n\t\t\t\r\n                                    

An evidence describes the source of an annotation, e.g. an experiment that has been published in the scientific literature, an orthologous protein, a record from another database, etc.

\r\n\r\n

More...

\r\n Skip Header UniProtKBxUniProtKBProtein knowledgebaseUniParcSequence archiveHelpHelp pages, FAQs, UniProtKB manual, documents, news archive and Biocuration projects.UniRefSequence clustersProteomesProtein sets from fully sequenced genomesAnnotation systemsSystems used to automatically annotate proteins with high accuracy:UniRule (Expertly curated rules)ARBA (System generated rules)Supporting dataSelect one of the options below to target your search:Literature citationsTaxonomyKeywordsSubcellular locationsCross-referenced databasesHuman diseasesAdvancedSearchxHomeBLASTAlignRetrieve/ID mappingPeptide searchSPARQLContactHelpYou are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.\n\t\tThe new UniProt website is here! \n\t\tTake me to UniProt BETAxUniProtKB - P19366\n\t\t\t(ATPB_ARATH)Basket 0(max 400 entries)x\n\t\t\t\t\tYou...

我们看结果中有对该基因的注释中有大量的叶绿体(chloroplast)字段,看了下这些字段首先出现在.svg的图片中,说明是亚细胞定位的图片,其次是【pubmed】引用文献,最后是同源基因的描述,所以基本可以确定这个P19366是定位在叶绿体了,同时从他的基因annotation中也可以看出这个基因是叶绿体基因组编码的。所以接下来就直接用grepl去判断爬出来的这段乱七八糟的字符串中包不包含叶绿体(chloroplast)就可以判断了。整体流程如下:

flow
st=>start: UniprotKB ID
op=>operation: rvest web spider
cond=>condition: success or failed?
op2=>operation: grepl
cond1=>condition: TRUE or FALSE?
e=>end: output
st->op->cond->op2->cond1
cond(yes)->op2
cond(no)->op
cond1(yes)->e
cond1(no)->e
流程
# 加载包
library(rvest)
library(tidyverse)
## 读取UniprotKB 信息
test.p = read.delim("~/15.PostDoc/02.Project/13.cyl/result.txt",header = T,sep = "\t")
head(test.p)
## 提取accession
acc = test.p$Accession
## 制作一个acc为第一列,第二列先填写0,用于后面循环中提取结果的放置
df = data.frame(
  Accession = acc,
  Chlo_TorF = 0
)

## 开始爬数据
for (i in 1:length(acc)) {
  Sys.sleep(0.5)
  tryCatch({
    url = paste0("https://www.uniprot.org/uniprot/",acc[i],"#subcellular_location")
    torf = read_html(url) %>% 
      rvest::html_text() %>% 
      grepl(pattern = "chloroplast",x = .)
    df[i,2] = as.character(torf)
    print(paste0(acc[i]," finish"))
  },error = function(e) {
    print(paste0(acc[i]," has no search result in uniprot"))}
  )
}
## 由于网络问题,有部分基因可能会爬取失败,没有打包function,就找到他们再来一遍最后合并一下。如果还有没有爬出来的,再重复一遍
## 给每个基因编号
df$seq = c(1:nrow(df)
## 找到搜索失败的基因对应的编号
df %>% 
  filter(x == 0) %>% 
  select(seq) %>% ->xx
## 处理成向量形式
xx = xx$seq
## 重新填充
for (i in xx) {
  Sys.sleep(0.5)
  tryCatch({
    url = paste0("https://www.uniprot.org/uniprot/",acc[i],"#subcellular_location")
    torf = read_html(url) %>% 
      rvest::html_text() %>% 
      grepl(pattern = "chloroplast",x = .)
    df[i,2] = as.character(torf)
    print(paste0(acc[i]," finish"))
  },error = function(e) {
    print(paste0(acc[i]," has no search result in uniprot"))}
  )
}

结果我们会得到一张表格,第一列是UniprotKB ID,第二列是判断是否包含叶绿体,这样就把我们query中的基因是否在亚细胞定位结果中出现叶绿体做了批量判断。

你可能感兴趣的:(「rvest爬虫实战」批量筛选蛋白质亚细胞定位结果)