本文图片来自于学习视频——新一代测序技术数据分析第三讲 DNA-seq
Alignment srategies
Smith-Waterman(speed too slow to use)
Fast alignment
Hash table
Seed and extension
Mask(for mismatches)
Suffix tree/prefix tree
Suffix array
Burrows-Wheeler Transformation
Usually stored in a compressed manner and can be indexed
QUAL: phred p-value of the variant call quality
Higher QUAL value —— less mistake
Filter
PASS—— if this position passed all the filters in the header files
q10:s50——list of filters that are not met
INFO: additional information
Optional
18 predefined options
Examples:
DB: dbSNP membership
DP: combined depth across all the samples
NS: number of samples with data
AF: estimated allele frequency
SB: strand bias at this position
AA: ancestral allele
Genotype fields: individual samples
Examples:
GT: genotype (0 reference, 1 first alternative, 2 second alternative…)
GQ: conditional genotype quality
-10long10[p-value(GT call is wrong | variants exist)]
DP: read depth at this position in this sample
HQ: hyplotype qualities
Bring together genome data and additional annotation data for viewing in a single browser of the genome
Genome Browsers provide context
Organize data based on chromosomal locations
Search for or navigate to genomic areas of interest to select and view annotation track for the region
EBI(Ensemble) genome browser
NCBI(Map Viewer)
UCSC Genome Browser
http://genome.ucsc.edu
Use this Gateway to search by
Gene names, symbols, IDs
Chromosome number chr7. or region: chr11:1038475-1075482
Keywords: kinase, receptor…
Viewing NGS data
Text files
Upload data/files to GENOME BROWSER sites
BED, GFF, GFT, WIG, MAF, BED detail, Personal Genome SNP, PSL
Binary files
Only portions of the files needed for display are transferred to UCSC
Enable to display files are very large
BAM, bigBED, bigWig,…
Viewing options
Hide: removes a track from view
Dense: all items collapsed into a single line
Squish: each items = separate line, but 50% height + packed
Pack: each item separate, but efficiently stacked(full height)
Full: each item on separate line
Supports a wide variety of data including sequence alignments, microarrays and genomic annotations
Java-based
SNP(Single nucleotide polymorphism)
1 in every few hundred bp
Mutation rate ~= 10-9
Short indels(insertion/deletion)
1 in every few kb
Mutation rate: variable
Microsatellite(STR) repeat number
1 in every few kb
2-6 bp repeat units
Mutation rate < 10-3
Minisatellites
1 in every few kb
10-100bp repeat units
Mutation rate < 10-1
Repeated genes
rRNA, histones
Large structure variations
Insertion/deletions
Duplications
Inversions
Copy number variations
…
Types of SNP
Transition: A,G or C,T
Transversion: substitution between purine and a pyrimidine
for whole human genome, ts/tv of around 2-2.1 is generally correct, in exon, it is 2.8~3.0
SNPs and haplotype
Haplotypes are ‘blocks’ of associated SNPs
Structure variations
Traditionally defined as deletions
insertions or inversions > 1kb
Often involves repetitive regions of the genome and complex rearrangements
No optimal method for SV discovery (before NGS)
Underlying hypothesis for GWAS
Common disease, common variants
Common variants present in more than 1-5% of the population contribute to common disease
GWAS generally do not capture rare variants
Successful GWAS stories
Significant associations reported through March 2010( Manollo. New England J OF med. 2010)
~800 SNPs, 545 studies, 150 diseases/traits
GWAS limitations: lack of functional information
Disease/trait-associated SNPs are not necessarily causative variants
statistical powers
reduce false-positives and improve reproducibility of results
Missing heritability
Median odds ratio copy of the risk allele 1.33
NGS breakthrough in genetics of complex disease
Whole genome sequencing following GWAS(Holm et al. Nat Gen 2011)——Sick Sinum Syndrome
Exome sequencing (Ng et al. Nat Gen 2011)—— Miller Syndrome
Pooled sequencing (Calvo et al. Nat Gen 2011)——Human Complex 1 disorder