人类参考基因组知识点(更新ing~)

一、人类基因组有多大

  • 参照UCSC提供的hg38版本,也是目前常用的人类参考基因组
    http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
     chr       size size2
1   chr1  248956422  249M
2   chr2  242193529  242M
3   chr3  198295559  198M
4   chr4  190214555  190M
5   chr5  181538259  182M
6   chr6  170805979  171M
7   chr7  159345973  159M
8   chrX  156040895  156M
9   chr8  145138636  145M
10  chr9  138394717  138M
11 chr11  135086622  135M
12 chr10  133797422  134M
13 chr12  133275309  133M
14 chr13  114364328  114M
15 chr14  107043718  107M
16 chr15  101991189  102M
17 chr16   90338345   90M
18 chr17   83257441   83M
19 chr18   80373285   80M
20 chr20   64444167   64M
21 chr19   58617616   59M
22  chrY   57227415   57M
23 chr22   50818468   51M
24 chr21   46709983   47M
25   SUM 3088269832 3088M
#未考虑M线粒体,其长度较短,为16569,16Kbp,
  • 如上可看出染色体序号越靠前的,长度越大,范围在50M~250M之间;
  • 由于人为二倍体,所以基因组由60亿个碱基组成;
  • 参考基因组一般保存为纯文本格式,即直接记录“A”、“T”、“C”、“G”这样的 ASCII 码字符。
  • 而1个 ASCII 字符,大小是 1B,所以,如果按纯文本保存 30亿个字母(单链),就是30亿字母 = 3,000,000,000 B = 3 GB。
from NCBI

二、奇怪的染色体name(chrUn,random,alt)

  • 同样以UCSC里的hg38版本为例
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
#提取染色体id
grep "^>" hg38.fa > chr.id
wc -l chr.id
#455 chr.id
head chr.id
####
>chr1
>chr10
>chr11
>chr11_KI270721v1_random
>chr12
>chr13
>chr14
>chr14_GL000009v2_random
>chr14_GL000225v1_random
>chr14_KI270722v1_random
  • 如上发现,序列并不是只有25条(22+X+Y+M),加起来共有455条。其它特殊的序列可分为三类。
  • 在此之前需要简单了解由最初的测序read数据组装成基因组的染色体序列需要经历contigs与scaffolds两个过程,如下图所示。contigs是依靠read间的重叠拼接的序列(a few kbp long),特点是不含有N碱基;scaffolds则主要依靠read pairs关系进一步拼接contigs,特点是会产生N碱基(a few hundred kbp);最终由scaffolds拼接成染色体序列。


    read→contigs→scaffolds

    read→chromosomes

2.1 Unlocalized scaffolds(*****random)

  • a sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome.
  • 简单理解:知道这个scaffolds在哪条染色体上,但不知道其在染色体的具体位置及方向
  • format: chr{chromosome number orname}_{sequence_accession}v{sequence_version}_random
grep "random" chr.id > chr.random
wc -l chr.random
#42 chr.random
head chr.random
###
>chr11_KI270721v1_random
>chr14_GL000009v2_random
>chr14_GL000225v1_random
>chr14_KI270722v1_random
>chr14_GL000194v1_random
>chr14_KI270723v1_random
>chr14_KI270724v1_random
>chr14_KI270725v1_random
>chr14_KI270726v1_random

2.2 Unplaced scaffolds(chrUn******)

  • a sequence found in an assembly that is not associated with any chromosome.
  • 简单理解:不知道这条scaffolds的所属染色体信息
  • format: chrUn_{sequence_accession}v{sequence_version}
grep "chrUn" chr.id > chr.chrUn
wc -l chr.chrUn
#127 chr.chrUn
head chr.chrUn
###
>chrUn_KI270302v1
>chrUn_KI270304v1
>chrUn_KI270303v1
>chrUn_KI270305v1
>chrUn_KI270322v1
>chrUn_KI270320v1
>chrUn_KI270310v1
>chrUn_KI270316v1
>chrUn_KI270315v1
>chrUn_KI270312v1

2.3 Alternate loci scaffolds(*****alt)

  • a scaffold that provides an alternate representation of a locus found in the primary assembly. These sequences do not represent a complete chromosome sequence although there is no hard limit on the size of the alternate locus; currently these are less than 1 Mb. These could either be NOVEL patch sequences, added through patch releases, or present in the initial assembly release.
  • 简单理解:参考基因组存在的主要依据是人类99.9%的序列是一致的。但是会存在一些序列在不同人群中不一致。例如49%人群该基因组特定位置为序列A,而49%人群则为序列B,都是正常的。但拿其中一种作为参考基因组都可能不太合适,因此标记出Alternate loci scaffolds。
  • format: chr{chromosome number or name}_{sequence_accession}v{sequence_version}_alt
  • Alternate loci scaffolds为hg38版本基因组新添类型Sequence,此前hg19版本还没有。
grep "alt" chr.id > chr.alt
wc -l chr.alt
#261 chr.alt
head chr.alt
###
>chr1_KI270762v1_alt
>chr1_KI270766v1_alt
>chr1_KI270760v1_alt
>chr1_KI270765v1_alt
>chr1_GL383518v1_alt
>chr1_GL383519v1_alt
>chr1_GL383520v2_alt
>chr1_KI270764v1_alt
>chr1_KI270763v1_alt
>chr1_KI270759v1_alt

注意:以上具体的chromosome name均为ucsc的hg版本,与GRCh38略有差异,但基本也是这几种类型sequence

三、编码基因占比多少

  • 在30亿碱基基因组中,能够编码蛋白质的基因总长度只占总长度的5%,而其中转录本exon单元总长度只占总长度的1.5%;
  • 人类染色体共编码2w~3w个蛋白基因,分布于不同染色体中。平均长度有10Kbp长度左右,而实际上基因的长度分布十分广泛(from a few hundred bases to more than 2 million bases)
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
awk '{print$1, $10}' hg38.refGene.gtf |sort -k 2|uniq|grep -v alt | grep -v random | grep -v alt | grep -v fix| sort -k 1 > chr.gene
cut -d" " -f 1 chr.gene | uniq -c
###
   1113 chr10
   1676 chr11
   1392 chr12
    632 chr13
    946 chr14
   1010 chr15
   1146 chr16
   1574 chr17
    434 chr18
   1791 chr19
   2832 chr1
    780 chr20
    414 chr21
    644 chr22
   1817 chr2
   1563 chr3
   1088 chr4
   1313 chr5
   1453 chr6
   1341 chr7
   1029 chr8
   1114 chr9
      1 chrM
   1157 chrX
    143 chrY

四、下载参考基因组

  • 目前常用的基因组版本为GRCh38/37,hg38/19,前者可通过NCBI/Ensembl下载,后者可通过UCSC网站下载。如下图所示GRCh38可认为等同于hg38,GRCh37可认为等同于hg19。


    human genome version
  • 以下载GRCh38/hg38为例,如下

4.1 NCBI

  • https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/
wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz
NCBI

4.2 ensembl

  • ensembl的release-103版本可以认为等于GRCh38
  • https://asia.ensembl.org/Homo_sapiens/Info/Index
  • http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/dna/
  • https://www.ensembl.org/info/data/ftp/index.html 还包含有cDNA转录本序列
wget -c http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
ensembl

4.3 UCSC

  • http://hgdownload.soe.ucsc.edu/downloads.html
  • http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
UCSC

五、更新ing~

  • 如有错误欢迎指正;
  • 以及关于生信研究中人类参考基因组其它常见问题,也可评论区留言,让我们一起弄明白,加油~

你可能感兴趣的:(人类参考基因组知识点(更新ing~))