whole-genome-sequencing Data Analysis 学习笔记1 基本概念

Follow Jimmy:Gene Analysis
探索全基因组测序数据
1.参考文章
Systematic Characteristic Exploration of the Chimeras Generated in Multiple Displacement Amplification through Next Generation Sequencing Data Reanalysis
注释:
Chimera (genetics)
A genetic chimerism or chimera (also spelled chimaera) is a single organism composed of cells from different zygotes. This can result in male and female organs, two blood types, or subtle variations in form.[1] Animal chimeras are produced by the merger of multiple fertilized eggs. In plant chimeras, however, the distinct types of tissue may originate from the same zygote, and the difference is often due to mutation during ordinary cell division. Normally, genetic chimerism is not visible on casual inspection; however, it has been detected in the course of proving parentage.[2]
Another way that chimerism can occur in animals is by organ transplantation, giving one individual tissues that developed from two genomes. For example, a bone marrow transplant can change someone’s blood type.
[1]Norton, Aaron; Ozzie Zehner (2008). “Which Half Is Mommy?: Tetragametic Chimerism and Trans-Subjectivity”. Women’s Studies Quarterly. Fall/Winter: 106–127.
[2] Friedman, Lauren. “The Stranger-Than-Fiction Story Of A Woman Who Was Her Own Twin”. Retrieved 4 August 2014.

Method

We finished data mining towards a series of Next Generation Sequencing (NGS) reads
We established a bioinformatics pipeline based on subsection alignment strategy to discover all the chimeras inside and achieve their structural visualization.
Then, we artificially defined two statistical indexes (the chimeric distance and the overlap length), and their regular abundance distribution helped illustrate of the structural characteristics of the chimeras.
Finally we analyzed the relationship between the chimera type and the average insertion size, so that illustrate a method to decrease the proportion of wasted data in the procedure of DNA library construction.

Results/Conclusion

131.4 Gb pair-end (PE) sequence data was reanalyzed for the chimeras. Totally, 40,259,438 read pairs (6.19%) with chimerism were discovered among 650,430,811 read pairs.
The chimeric sequences are consisted of two or more parts which locate inconsecutively but adjacently on the chromosome. The chimeric distance between the locations of adjacent parts on the chromosome followed an approximate bimodal distribution ranging from 0 to over 5,000 nt, whose peak was at about 250 to 300 nt.
The overlap length of adjacent parts followed an approximate Poisson distribution and revealed a peak at 6 nt.
Moreover, unmapped chimeras, which were classified as the wasted data, could be reduced by properly increasing the length of the insertion segment size through a linear correlation analysis.

Significance

This study exhibited the profile of the phi29MDA chimeras by tens of millions of chimeric sequences, and helped understand the amplification mechanism of the phi29 DNA polymerase.
Our work also illustrated the importance of NGS data reanalysis, not only for the improvement of data utilization efficiency, but also for more potential genomic information.

2.NCBI的SRA数据库下载原始数据
SRX247249: Whole genome haplotyping, CEPH female, Corriell NA12878 (Kaper et al)
18 ILLUMINA (Illumina HiSeq 2000) runs: 1G spots, 207.9G bases, 123.6Gb downloads
https://www.ncbi.nlm.nih.gov/sra/?term=SRX247249

SRX252522: Whole genome haplotyping, NA18506
7 ILLUMINA (Illumina HiSeq 2000) runs: 715.1M spots, 144.5G bases, 83.2Gb downloads
https://www.ncbi.nlm.nih.gov/sra/?term=SRX252522

3.科研分析流程
原始数据—>过滤—>QC—>得到clean data—>alignment—–>variation calling变异检测—>SNV,INDEL,SV,CNV—–>annotation—–>statistics/visualization

全基因组测序,覆盖度30X(30X,就是平均下来能把我们的30亿个碱基每个都测到30次),也就是90G的raw data,测序策略是PE150,采用illumina的HiSeq X,用DNA小片段文库(350bp)进行建库
注释:
测序深度(Sequencing depth)是指测序得到的碱基总量(bp)与基因组大小的比值,它是评价测序量的指标之一。
测序深度与基因组覆盖度之间是一个正相关的关系,测序带来的错误率或假阳性结果会随着测序深度的提升而下降。
测序的个体,如果采用的是双末端或Mate-Pair方案,当测序深度在50X~100X以上时,基因组覆盖度和测序错误率控制均得以保证,后续序列组装成染色体才能变得更容易与精准。

测序覆盖度:基因组被测序得到的碱基覆盖的比例
测序覆盖度是反映测序随机性 的指标之一;测序序深度与覆盖度之间的关系可以过Lander-Waterman Model(1988)来确定。当深度达到5X时,则可覆盖基因组的约99.4%以上

在高通量测序有三种测序模式(现在大部分2种,single-end比较少了 454和以前的illumina GA为single -end),single-end(单端测序,只测一条序列的一头),pair-end(双端测序,测一条序列的两头),mate-pair(环化序列测序序列,然后在环化接口处生物素标记富集,测环化的接口处序列).PE30.PE100就是第二种,一条序列不管多长只各测两头的30bp(2X30bp).100bp(2X100).即一条序列测60bp或者200bp.所以PE150就是只测两头的150bp?

QC参考文章:
Three-stage quality control strategies for DNA re-sequencing data.
The advance in NGS technologies has also created significant challenges in bioinformatics. One of the major challenges is quality control of the sequencing data. In this review, we discuss the proper quality control procedures and parameters for Illumina technology-based human DNA re-sequencing at three different stages of sequencing: raw data, alignment and variant calling.

参考基因组
De novo assembly and phasing of a Korean human genome
Here we report the de novo assembly and haplotype phasing of the Korean individual AK1 (ref. 1) using single-molecule real-time sequencing2, next-generation mapping3, microfluidics-based linked reads4, and bacterial artificial chromosome (BAC) sequencing approaches.
Single-molecule sequencing coupled with next-generation mapping generated a highly contiguous assembly, with a contig N50 size of 17.9 Mb and a scaffold N50 size of 44.8 Mb, resolving 8 chromosomal arms into single scaffolds. The de novo assembly, along with local assemblies and spanning long reads, closes 105 and extends into 72 out of 190 euchromatic gaps in the reference genome, adding 1.03 Mb of previously intractable sequence. High concordance between the assembly and paired-end sequences from 62,758 BAC clones provides strong support for the robustness of the assembly.
We identify 18,210 structural variants by direct comparison of the assembly with the human reference, identifying thousands of breakpoints that,
Many of the insertions are reflected in the transcriptome and are shared across the Asian population. We performed haplotype phasing of the assembly with short reads, long reads and linked reads from whole-genome sequencing and with short reads from 31,719 BAC clones, thereby achieving phased blocks with an N50 size of 11.6 Mb. Haplotigs assembled from single-molecule real-time reads assigned to haplotypes on phased blocks covered 89% of genes. The haplotigs accurately characterized the hypervariable major histocompatability complex region as well as demonstrating allele configuration in clinically relevant genes such as CYP2D6.
This work presents the most contiguous diploid human genome assembly so far, with extensive investigation of unreported and Asian-specific structural variants, and high-quality haplotyping of clinically relevant alleles for precision medicine.

Variation calling预计用到的数据库
dbsnp147 (ncbi提供的最权威)
cgi69ExAC.vcf.gz(broadinstitute提供的外显子联盟)
Cosmic_v73.ann.vcf.gz (癌症突变信息集)
finalTCGA.vcf.gz (TCGA计划癌症相关)
1000g-ph3v5.gff.gz(千人基因组计划)
ESP6500(Variants from the Exome Sequencing Project (ESP))
还有各个国家级的基因组计划的数据(SCLP,SSM,SSI,GONL,UK10K)
三种主流注释软件:VEP,ANNOVAR,snpEFF

4.临床分析流程
大部分疾病评估是依据GWAS数据库对变异位点进行注释从而评估个体化疾病风险的,用药建议是根据PharmaGKB网站,遗传病风险则是HGMD数据库进行注释

你可能感兴趣的:(BioInfo)