Rosalind工具库: Entrez搜索NCBI资源库

Introduction to Protein Databases

蛋白质数据库中心UniProt提供了蛋白详细的注释，如功能描述，功能与结构，翻译后修饰。它还支持蛋白相似性搜索，分类分析和文献引用等。

已知给定一个uniprot id,可以通过链接"http://www.uniprot.org/uniprot/uniprot_id.txt"或"http://www.uniprot.org/uniprot/uniprot_id" 获取关于该编号的详细描述。通过编程的方式根据一个uniprot ID获取其参与的生物学进程(biological processes)

我使用Go的os.Args读取命令行参数中的编号，使用"net/http"获取响应, 使用"ioutil.ReadAll"获取响应中主体，返回字符数组。利用正则表达式进行解析，然后使用for循环提取出目标区段。

package main

import (
    "net/http"
    "fmt"
    "os"
    "log"
    "io/ioutil"
    "regexp"
)

func main(){
    id := os.Args[1]
    link := "http://www.uniprot.org/uniprot/" + id + ".txt"
    resp, err := http.Get(link)
    if err != nil{
        log.Fatal(err)
    }
    content, err := ioutil.ReadAll(resp.Body)
    resp.Body.Close()
    if err != nil{
        log.Fatal(err)
    }
    re := regexp.MustCompile("P:(.*?);")
    BP := re.FindAllStringSubmatch(string(content[:]),-1)
    for i,n := 0, len(BP); i

 
 GenBank Introduction 
 分子生物学家可获取的最大的整合型数据库就是GenBank, 它包含了几乎所有公共的DNA序列和蛋白序列。 GenBank最早由NCBI在1982年建立，现在30多年过去了，里面存储的数据量超乎你的想象。 
 每个GenBank都有唯一的识别号，用于提取全序列，比如说CAA79696,NP_778203, 263191547, BC043443, NM_002020. 当然还可以用一些关键字搜索一类序列。 
 问题：GenBank包括如下几个子类，如Nucleotide, GSS(Genome Survey Sequence), EST(Expressed Sequene Tags). 为了精确从这些数据库中找到自己目标，需要用到一些搜索语法，比如说(Drosophila[All Fields])表示在所有区域中搜索Drosophila。 那么给定一个物种名，和两个日期，找到在这段时间内该物种上传到GenBank的氨基酸数。 
 解决方案：NCBI提供Entrez用于检索它存放的所有数据，可供检索的数据库 
  
   
    
    Entrez Database 
    UID common name 
    E-utility Database Name 
    
   
   
    
    BioProject 
    BioProject ID 
    bioproject 
    
    
    BioSample 
    BioSample ID 
    biosample 
    
    
    Biosystems 
    BSID 
    biosystems 
    
    
    Books 
    Book ID 
    books 
    
    
    Conserved Domains 
    PSSM-ID 
    cdd 
    
    
    dbGaP 
    dbGaP ID 
    gap 
    
    
    dbVar 
    dbVar ID 
    dbvar 
    
    
    Epigenomics 
    Epigenomics ID 
    epigenomics 
    
    
    EST 
    GI number 
    nucest 
    
    
    Gene 
    Gene ID 
    gene 
    
    
    Genome 
    Genome ID 
    genome 
    
    
    GEO Datasets 
    GDS ID 
    gds 
    
    
    GEO Profiles 
    GEO ID 
    geoprofiles 
    
    
    GSS 
    GI number 
    nucgss 
    
    
    HomoloGene 
    HomoloGene ID 
    homologene 
    
    
    MeSH 
    MeSH ID 
    mesh 
    
    
    NCBI C++ Toolkit 
    Toolkit ID 
    toolkit 
    
    
    NCBI Web Site 
    Web Site ID 
    ncbisearch 
    
    
    NLM Catalog 
    NLM Catalog ID 
    nlmcatalog 
    
    
    Nucleotide 
    GI number 
    nuccore 
    
    
    OMIA 
    OMIA ID 
    omia 
    
    
    PopSet 
    PopSet ID 
    popset 
    
    
    Probe 
    Probe ID 
    probe 
    
    
    Protein 
    GI number 
    protein 
    
    
    Protein Clusters 
    Protein Cluster ID 
    proteinclusters 
    
    
    PubChem BioAssay 
    AID 
    pcassay 
    
    
    PubChem Compound 
    CID 
    pccompound 
    
    
    PubChem Substance 
    SID 
    pcsubstance 
    
    
    PubMed 
    PMID 
    pubmed 
    
    
    PubMed Central 
    PMCID 
    pmc 
    
    
    SNP 
    rs number 
    snp 
    
    
    SRA 
    SRA ID 
    sra 
    
    
    Structure 
    MMDB-ID 
    structure 
    
    
    Taxonomy 
    TaxID 
    taxonomy 
    
    
    UniGene 
    UniGene Cluster ID 
    unigene 
    
    
    UniSTS 
    STS ID 
    unists 
    
   
  
 并提供了相应的网页APIhttps://eutils.ncbi.nlm.nih.gov/entrez/eutils/, 按照要求构建URL发起请求后就能返回目标响应用于解析。为了避免对服务器造成太大压力，NCBI对请求有一定的限制限制，每一秒不超过3个请求, 除非你在请求中带上了API_KEY。 
  
  API key可以在https://www.ncbi.nlm.nih.gov/account/申请，允许每秒10个请求。考虑到国内这个网速，我觉得应该是用不到的。 
  
 Entrez搜索语法： 
  
  Boolean操作: AND OR NOT 
  限定领域: [], 如 horse[Organism]  
  日期或其他范围: :， 如日期 2015/3/1:2016/4/30[Publication Date], 如序列长度110:500[Sequence Length]  
  
 构建URL: 以样本数据Anthoxanthum 2003/7/25 2005/12/27为例，搜索语法为`Anthoxanthum[Organsim] AND 2003/7/25:2005/12/27[Publication Date]，转换成URL link就是 
 https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&term=Anthoxanthum[Organsim]+AND+2003/7/25:2005/12/27[Publication Date]
 
  
  如果搜索语法中有"和"#", 需要转换成"%22","%23", []外的空格要用"+"代替,[]和()内空格要用"%20"替换。 
  
 解决这个问题不能想的太复杂，我们需要假设输入时是"Anthoxanthum[Organsim]+AND+2003/7/25:2005/12/27[Publication Date]"，而不是Anthoxanthum 2003/7/25 2005/12/27, 这样子我们就只需要构建URL link, 然后发送请求解析响应的xml就行 
 package main

import (
        "fmt"
        "io/ioutil"
        "log"
        "net/http"
        "os"
        "encoding/xml"
)

// the base url link of entrez API
const BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"

func buildLink(db, query string) string {
        link := BASE + "?db=" + db + "&term=" + query
        return link
}

func eSearch(link string) {
        resp, err := http.Get(link)
        if err != nil {
                log.Fatal(err)
        }
        body, err := ioutil.ReadAll(resp.Body)
        resp.Body.Close()
        if err != nil {
                log.Fatal(err)
        }
}
type eSearchResult struct {
        Count  string   `xml:"Count"`
        RetMax string   `xml:"RetMax"`
        IdList []string `xml:"IdList>Id"`
}
func main() {
        db := os.Args[1]
        query := os.Args[2]
        link := buildLink(db, query)
        fmt.Println(link)
        content := eSearch(link)
        result := eSearchResult{}
        err := xml.Unmarshal(content, &result)
        if err != nil {
                fmt.Printf("err: %v", err)
                return
        }
        fmt.Printf("Count:%s\n", result.Count)
        fmt.Printf("Idlist:%v\n", result.IdList)
}
}
 
 读取命令行的字符串，第一个为数据库，第二个为请求。 将参数传入后构建成link，使用"net/http"发起请求,将得到的响应用ioutil.ReadAll读取保存为字符数组. 然后定义结构体用来保存XML的解析结果。xml.Unmarshal的使用涉及到interface的知识。

Entrez Database	UID common name	E-utility Database Name
BioProject	BioProject ID	bioproject
BioSample	BioSample ID	biosample
Biosystems	BSID	biosystems
Books	Book ID	books
Conserved Domains	PSSM-ID	cdd
dbGaP	dbGaP ID	gap
dbVar	dbVar ID	dbvar
Epigenomics	Epigenomics ID	epigenomics
EST	GI number	nucest
Gene	Gene ID	gene
Genome	Genome ID	genome
GEO Datasets	GDS ID	gds
GEO Profiles	GEO ID	geoprofiles
GSS	GI number	nucgss
HomoloGene	HomoloGene ID	homologene
MeSH	MeSH ID	mesh
NCBI C++ Toolkit	Toolkit ID	toolkit
NCBI Web Site	Web Site ID	ncbisearch
NLM Catalog	NLM Catalog ID	nlmcatalog
Nucleotide	GI number	nuccore
OMIA	OMIA ID	omia
PopSet	PopSet ID	popset
Probe	Probe ID	probe
Protein	GI number	protein
Protein Clusters	Protein Cluster ID	proteinclusters
PubChem BioAssay	AID	pcassay
PubChem Compound	CID	pccompound
PubChem Substance	SID	pcsubstance
PubMed	PMID	pubmed
PubMed Central	PMCID	pmc
SNP	rs number	snp
SRA	SRA ID	sra
Structure	MMDB-ID	structure
Taxonomy	TaxID	taxonomy
UniGene	UniGene Cluster ID	unigene
UniSTS	STS ID	unists

Rosalind工具库: Entrez搜索NCBI资源库

Introduction to Protein Databases

GenBank Introduction

你可能感兴趣的:(Rosalind工具库: Entrez搜索NCBI资源库)