写代码学生物信息[0]: Bionode

Bionode

clipboard.png

Bionode是一系列供生物信息开发者使用的模块化Javascript API和Shell命令。简单的说就是一堆node module和shell命令(由于node自身流数据的特性,将其改编成支持pipe的shell命令十分容易也很高效)。

典型的使用场景如下

# Parse sequences in a fasta file into one JSON object per line, collect the ones that match chr11 and in fasta
$ cat genome.fasta | bionode-fasta | grep chr11 | bionode-fasta --write

这个项目是由Repositive.io支持的。Repositive是一个提供人类基因数据库搜索服务的公司,索引了上千个数据库。

写代码学生物信息[0]: Bionode_第1张图片

Examples

看到页面给出的例子就懵逼了

# Download all bacteria gff files
$ bionode-ncbi download gff bacteria

bacteria gff files看起来是某种细菌的数据,可是,细菌的数据是毛...

GFF File

gff格式是Sanger研究所定义,是一种简单的、方便的对于DNA、RNA以及蛋白质序列的特征进行描述的一种数据格式,比如序列的那里到那里是基因,已经成为序列注释的通用格式,比如基因组的基因预测,许多软件都支持输入或者输出gff格式。目前格式定义的最新版本是版本3。原始定义见SONG website
博耘生物

搜索了下,发现这领域大家很喜欢用GFF这种格式,还有v2,v3版。但还都是tab+spec来表示数据,隐隐地感觉到Bionode的价值所在=.=

Installation

安装Nodejs(废话),全局安装bionode,最好把bionode-ncbi也装上,它可以访问NCBI API (e-utils)。

npm install bionode -g
npm install bionode-ncbi -g

NCBI API (e-utils)

NCBI全称是The National Center for Biotechnology Information,美国的。网站一如既往的做的很"低调",数据看起来很全。
国内有PKU的镜像

搜索的过程中发现也有人在恨小心翼翼的爬NCBI,规模化爬取NCBI。其实NCBI早已给出了API文档,Accessing and using data in ClinVar。

ClinVar

ClinVar aggregates information about genomic variation and its relationship to human health.

ClinVar是一个Star-based的记录基因变异与人体健康之间关系的数据库。换句话说就是你某个基因位变异了,80%可能你就是红绿色盲了。不过这里的数据是自愿提交的,所以数据很稀疏。

bionode xxx

装完了bionode了吧,那就可以把它当命令行工具执行一下例子啦

$ bionode ncbi search genome solenopsis invicta
{"uid":"2938","organism_name":"Solenopsis invicta","organism_kingdom":"Eukaryota","organism_group":"","organism_subgroup":"Insects","defline":"Solenopsis invicta overview","projectid":49663,"project_accession":"PRJNA49663","status":"Draft","number_of_chromosomes":"0","number_of_plasmids":"0","number_of_organelles":"1","assembly_name":"Si_gnG","assembly_accession":"GCA_000188075.1","assemblyid":244018,"create_date":"2011/02/03 00:00","options":"","weight":"","chromosome_assemblies":"0","scaffold_assemblies":"1","sra_genomes":"0","taxid":13686}

返回值就是个JSON,JSer爽吧!等等,这话是啥意思...solenopsis invicta???

写代码学生物信息[0]: Bionode_第2张图片

看来我们是GET了一下这货的基因信息。

{
"uid":"2938",
"organism_name":"Solenopsis invicta",
"organism_kingdom":"Eukaryota",
"organism_group":"",
"organism_subgroup":"Insects",
"defline":"Solenopsis invicta overview",
"projectid":49663,
"project_accession":"PRJNA49663",
"status":"Draft",
"number_of_chromosomes":"0",
"number_of_plasmids":"0",
"number_of_organelles":"1",
"assembly_name":"Si_gnG",
"assembly_accession":"GCA_000188075.1",
"assemblyid":244018,
"create_date":"2011/02/03 00:00",
"options":"",
"weight":"",
"chromosome_assemblies":"0",
"scaffold_assemblies":"1",
"sra_genomes":"0",
"taxid":13686
}

没有chromosomes是闹哪样...让我搜搜人的...bionode ncbi search genome human...

于是悲剧了...它...是...不只是全文搜索...连炭疽都出来了

{
"organism_name":"Bacillus anthracis",
"projectid":12333,
"project_accession":"PRJNA12333",
"status":"Complete",
"number_of_chromosomes":"1",
...
}

好吧,让我们只找可以相爱的智人homo sapiens。

写代码学生物信息[0]: Bionode_第3张图片

{
"uid":"51",
"organism_name":"Homo sapiens",
"organism_kingdom":"Eukaryota",
"organism_group":"",
"organism_subgroup":"Mammals",
"defline":"Human genome projects have generated an unprecedented amount of knowledge about human genetics and health.",
"projectid":9558,
"project_accession":"PRJNA9558",
"status":"Complete",
"number_of_chromosomes":"48",
"number_of_plasmids":"0",
"number_of_organelles":"1",
"assembly_name":"GRCh38.p8",
"assembly_accession":"GCA_000001405.23",
"assemblyid":763971,
"create_date":"2001/02/15 00:00",
"options":"",
"weight":1000,
"chromosome_assemblies":"10",
"scaffold_assemblies":"31",
"sra_genomes":"0",
"taxid":9606
}

啊啊啊,请叫我智障。为什么number_of_chromosomes是48啊,不是46吗!!!

就这样吧,让我冷静一下...

你可能感兴趣的:(生物信息学)