有效基因组(Effective Genome Size)大小计算

macs2 callpeak其中的一个参数-g,设置有效基因组大小,

  -g GSIZE, --gsize GSIZE
                        Effective genome size. It can be 1.0e+9 or 1000000000,
                        or shortcuts:'hs' for human (2.7e9), 'mm' for mouse
                        (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for
                        fruitfly (1.2e8), Default:hs

根据教程Introduction to ChIP-Seq using high-performance computing跳转 https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html,
Option 1 faCount,下载后文件改为可执行,即可使用;原理是除了N之外的碱基数统计,适用于multimapping reads存在;
Option 2 unique-kmers.py,pip install可安装,但发现集群里有,直接cp到我的目录下使用;MAPQ等方法过滤后适用。
也可在khmer github下载 https://github.com/dib-lab/khmer/tree/master/scripts
【测试1】玉米参考基因组

[Option 1] Chr1-10 ATCG和为2,075,605,238,约为2.1e+9
$ ./faCount ./genome/b73_v4/Zea_mays.B73_RefGen_v4.dna.toplevel.fa
#seq    len A   C   G   T   N   cpg
1   307041717   80478275    71219187    71150610    80507052    3686593 13339729
2   244442276   63816849    56413504    56401383    63899356    3911184 10584323
3   235667834   61901922    54630954    54743629    62016611    2374718 10187767
4   246994605   64920679    56483306    56568849    64972046    4049725 10363139
5   223902240   58581754    51738109    51824187    58686601    3071589 9668744
6   174033170   44854954    39770359    39762960    44889331    4755566 7477887
7   182381542   47921757    42102677    42136861    47989908    2230339 7821653
8   181122637   47395194    41973473    42016725    47335775    2401470 7901594
9   159769782   41740561    36929012    36963464    41696724    2440021 6943031
10  150982314   39591853    34987130    34981695    39609962    1811674 6554095
Pt  140384  43281   26908   27087   43108   0   4429
Mt  569630  159353  125604  124652  160021  0   20332
......
total   2135083061  558869376   492941406   493261213   559278187   30732879    92095961

[Option 2] 约为1.8e+9
$ ./unique-kmers.py -k 150 ./genome/b73_v4/Zea_mays.B73_RefGen_v4.dna.toplevel.fa

|| This is the script unique-kmers.py in khmer.
|| You are running khmer version 2.1.1
|| You are also using screed version 1.0.4
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2015. http://dx.doi.org/10.12688/f1000research.6924.1
||   * A. Döring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
||   * Irber and Brown. http://dx.doi.org/10.1101/056846
||
|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

Estimated number of unique 150-mers in ./genome/b73_v4/Zea_mays.B73_RefGen_v4.dna.toplevel.fa: 1767458681
Total estimated number of unique 150-mers: 1767458681

【测试2】GRCh38,给的参考是Option 1为2,913,022,398,Option 2为GRCh38(Read length 150)为2,862,010,578
和我计算的差别不太大。

[Option 1] Chr和为2,934,860,425
$ ./faCount ./genome/Homo_sapiens.GRCh38.dna.toplevel.fa
#seq    len A   C   G   T   N   cpg
1   248956422   67070277    48055043    48111528    67244164    18475410    2375159
2   242193529   71791213    48318180    48450903    71987932    1645301 2192670
3   198295559   59689091    39233483    39344259    59833302    195424  1673293
4   190214555   58561236    36236976    36331025    58623430    461888  1503429
5   181538259   54699094    35731600    35879674    54955010    272881  1566535
6   170805979   51345477    33646690    33713330    51373025    727457  1511189
7   159345973   47058248    32317984    32378859    47215040    375842  1622825
8   145138636   43333530    29030173    29103787    43300646    370500  1338200
9   138394717   35736329    25099811    25170662    35783748    16604167    1255728
10  133797422   38875926    27639505    27719976    39027555    534460  1388978
11  135086622   39286730    27903257    27981801    39361954    552880  1333114
12  133275309   39370109    27092804    27182678    39492225    137493  1315968
13  114364328   30047611    18839192    18933605    30162717    16381203    842469
14  107043718   26673415    18423758    18559033    26911943    16475569    895881
15  101991189   24508669    17752941    17825903    24553812    17349864    906026
16  90338345    22558319    18172742    18299976    22774906    8532402 1150891
17  83257441    22639499    18723944    18851500    22705261    337237  1248328
18  80373285    24050680    15794455    16061651    24182819    283680  756014
19  58617616    15142293    13954580    14061132    15282753    176858  1105620
20  64444167    17867246    13916133    14094472    18066406    499910  773477
21  46709983    11820664    8185244 8226381 11856330    6621364 462299
22  50818468    10382214    9160652 9246186 10370725    11658691    634646
X   156040895   46754807    30523780    30697741    46916701    1147866 1322709
Y   57227415    7155845 4632232 4630489 7217789 33591060    169449
......
total   63147197748 913598440   635211771   638005906   916191480   60044190151 31647338

[Option 2] 
$ ./unique-kmers.py -k 150 ./genome/Homo_sapiens.GRCh38.dna.toplevel.fa 

|| This is the script unique-kmers.py in khmer.
|| You are running khmer version 2.1.1
|| You are also using screed version 1.0.4
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2015. http://dx.doi.org/10.12688/f1000research.6924.1
||   * A. Döring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
||   * Irber and Brown. http://dx.doi.org/10.1101/056846
||
|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

Estimated number of unique 150-mers in ./genome/Homo_sapiens.GRCh38.dna.toplevel.fa: 2882789054
Total estimated number of unique 150-mers: 2882789054

你可能感兴趣的:(有效基因组(Effective Genome Size)大小计算)