把全基因组分区研究

来自于文章:Landscape of somatic mutations in 560 breast cancer whole genome sequences

值得模仿的分析方法:

The genome was partitioned according to different sets of regulatory elements/gene features, with a separate analysis performed for each set of elements, including

  • exons (n=20,245 genes)
  • core promoters (n=20,245 genes, where a core promoter is the interval [?250,+250] bp from any transcription start site (TSS) of a coding transcript of the gene, excluding any overlap with coding regions)
  • 5’ UTR (n=9,576 genes)
  • 3’ UTR (n=19,502 genes)
  • intronic regions flanking exons (n=20,212 genes, represents any intronic sequence within 75bp from an exon, excluding any base overlapping with any of the above elements.
  • ncRNAs (n=10,684, full length lincRNAs, miRNAs or rRNAs)
  • enhancers (n=194,054)
  • ultra-conserved regions (n=187,057, a collection of regions under negative selection based on 1,000 genomes data

很明显,只需要去特定的数据库下载感兴趣物种对应的参考基因组版本的注释文件,gtf或者gff格式均可,就可以根据注释的坐标信息制作出上面的文件啦。

当然,这个时候大部分软件都会有bed格式进行交流。

 cat CCDS.20110907.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i>exon_probe.hg19.gene.bed
 cat CCDS.20160908.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i >exon_probe.hg38.gene.bed

比如打开 上面得到的近20万行的外显子坐标文件 exon_probe.hg19.gene.bed

1   69090   70007   OR4F5
1   367658  368596  OR4F29
1   621095  622033  OR4F16
1   801942  802433  LINC00115
1   861321  861392  SAMD11
1   865534  865715  SAMD11
1   866418  866468  SAMD11
1   871151  871275  SAMD11
1   874419  874508  SAMD11
1   874654  874839  SAMD11

师傅领进门,修行在个人!

你可能需要耗费好几个小时看懂这篇教程,然后耗费十几个小时才能模仿,做出后面的坐标文件。

但是你至少找到了路,加油吧。!

你可能感兴趣的:(把全基因组分区研究)