1. 要分析的序列名称中,一般不要有空格
2. 准备Reference文件(需为fasta格式)及比对
1) Index处理:生成一个ref.fasta.fai的文件
2) 生成.dict文件:
samtools faidx reference.fasta``java -jar picard-tools/CreateSequenceDictionary.jar R=reference.fasta O= reference.dict``bwa index -a bwtsw ref.fasta
bwa mem -t 16 -M ref.fasta read.fq mates.fq >sample.sam
转换结果文件到bam格式
java -jar picardtools/SamFormatConvert I=xx.sam o=xx.bam
or
samtools view -bS xx.sam -o xx.bam
3. 准备样本的BAM文件
1)Sort the aligned reads by coordinate order
2)Mark duplicates
3)Add read group information (同时具有,sam2bam转换、sort功能,可合并setp 1)
4)Index the BAM file
java -jar picardtools/SortSam.jar INPUT=unsorted_reads.bam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate
(input可以输入sam文件,output输出bam,省去上述的格式转换)
java -jar picardtools/MarkDuplicates.jar INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE= sample01.dedup.metrics MAX_FILE_HANDLES=1000
注意:MAX_FILE_HANDLES=Integer,参数由“ulimit -n”获得极限值。
During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.
java -jar picardtools/AddOrReplaceReadGroups.jar I=dedup_reads.bam O=addrg_reads.bam ID=group1 LB= lib1 PL=illumina PU=unit1 SM=sample1
ID=String Read Group ID Default value: 1. This option can be set to 'null' to clear the default value. LB=String Read Group Library Required. PL=String Read Group platform (e.g. illumina, solid) Required. PU=String Read Group platform unit (eg. run barcode) Required. SM=String Read Group sample name Required.
java -jar picardtools/BuildBamIndex I=addrg_reads.bam
orsamtools index addrg_reads.bam
例如
1)bwa比对
bwa index -a bwtsw ref.fasta
bwa mem -t 16 -M ref.fasta read.fq mates.fq >sample.sam
2)转换sam到bam
samtools view -bS sample01.sam -o sample01.bam
java -jar picardtools/SamFormatConvert I=xx.sam o=xx.bam
3)排序
java -jar picardtools/SortSam.jar I= sample.bam O=sample.sorted.bam sort_order=coordinate
4)去重复
java -jar picardtools/MarkDuplicates.jar INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE= sample01.dedup.metrics MAX_FILE_HANDLES=1000
5)分组
java -jar picardtools/AddOrReplaceReadGroups.jar I=sample.sorted.bam O=group.bam ID=group1 LB=lib1 PL=illumina PU=unit1 SM=sample1
6)index样品
java -jar ~/my_bin/picardtools1.94/BuildBamIndex.jar I=group.bam
4. 使用参数--------------------------------------------------------------------------------
The Genome Analysis Toolkit (GATK) v2.6-4-g3e5ff60, Compiled 2013/06/24 14:48:56Copyright (c) 2010 The Broad InstituteFor support and documentation go to http://www.broadinstitute.org/gatk
All command line parameters accepted by all tools in the GATK
--analysis_type / -T ( required String )Type of analysis to run.
-I,--input_file input file(s), SAM or BAM-rbs,--read_buffer_size Number of reads per SAM file to buffer in memory
--BQSR / -BQSR ( File )
The input covariates table file which enables on-the-fly base quality score recalibration (intended for use with BaseRecalibrator and PrintReads). Enables on-the-fly recalibrate of base qualities. The covariates tables are produced by the BaseQualityScoreRecalibrator tool. Please be aware that one should only run recalibration with the covariates file created on the same input bam(s).
-K,--gatk_key GATK Key file. Required if running with -et NO_ET. Please see -home-and-how-does-it-affect-me#latest for details.
--intervals / -L ( List[IntervalBinding[Feature]] )
One or more genomic intervals over which to operate. Can be explicitly specified on the command line or in a file (including a rod file). Using this option one can instruct the GATK engine to traverse over only part of the genome. This argument can be specified multiple times. One may use samtools-style intervals either explicitly (e.g. -L chr1 or -L chr1:100-200) or listed in a file (e.g. -L myFile.intervals). Additionally, one may specify a rod file to traverse over the positions for which there is a record in the file (e.g. -L file.vcf). To specify the completely unmapped reads in the BAM file (i.e. those without a reference contig) use -L unmapped.
-XL,--excludeIntervals One or more genomic intervals to exclude from processing. Can be explicitly specified on the command line or in a file (including a rod file)
--reference_sequence / -R ( File )
Reference sequence file.
--num_threads / -nt ( Integer with default value 1 )
How many data threads should be allocated to running this analysis.. How many data threads should be allocated to this analysis? Data threads contains N cpu threads per data thread, and act as completely data parallel processing, increasing the memory usage of GATK by M data threads. Data threads generally scale extremely effectively, up to 24 cores
......
5. 分析流程
1) Mapping and Duplicate Marking
2) Local Realignment
3) Base Quality Recalibration
该步骤的运行,需要使用已知的snp/indel信息做参考。若没有已知信息,可以先用GATK和samtools初步获得,取其一致snp/indel信息,作为参考。具体可参考他人博客: http://blog.sina.com.cn/s/blog_6721167201018jik.html
4) Data Compression with Reduce Reads
5) Calling Variants
=============================以下步骤需要其他外部资源=================================
6) Variant Quality Score Recalibration
7) Genotype_refinement
8) Functional_annotation
9) Variant_analysis
===================================重要工具参数======================================
1. SelectVariants参数
Overview
Selects variants from a VCF source. Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented in the Using JEXL expressions section (http://www.broadinstitute.org/gatk/guide/article?id=1255). One can optionally include concordance or discordance tracks for use in selecting overlapping variants.
--concordance / -conc ( RodBinding[VariantContext] with default value none )
Output variants that were also called in this comparison track. A site is considered concordant if (1) we are not looking for specific samples and there is a variant called in both the variant and concordance tracks or (2) every sample present in the variant track is present in the concordance track and they have the sample genotype call. --concordance binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
--discordance / -disc ( RodBinding[VariantContext] with default value none )
Output variants that were not called in this comparison track. A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype and either the site isn't present in this track, the sample isn't present in this track, or the sample is called reference in this track. --discordance binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
--exclude_sample_file / -xl_sf ( Set[File] with default value [] )
File containing a list of samples (one per line) to exclude. Can be specified multiple times. Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded.
--exclude_sample_name / -xl_sn ( Set[String] with default value [] )
Exclude genotypes from this sample. Can be specified multiple times. Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded.
--excludeFiltered / -ef ( boolean with default value false )
Don't include filtered loci in the analysis.
--excludeNonVariants / -env ( boolean with default value false )
Don't include loci found to be non-variant after the subsetting procedure.
--keepIDs / -IDs ( File )
Only emit sites whose ID is found in this file (one ID per line). If provided, we will only include variants whose ID field is present in this list of ids. The matching is exact string matching. The file format is just one ID per line
--out / -o ( VariantContextWriter with default value stdout )
File to which variants should be written.
--sample_expressions / -se ( Set[String] )
Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times.
--sample_file / -sf ( Set[File] )
File containing a list of samples (one per line) to include. Can be specified multiple times.
--sample_name / -sn ( Set[String] with default value [] )
Include genotypes from this sample. Can be specified multiple times.
--select_expressions / -select ( ArrayList[String] with default value [] )
One or more criteria to use when selecting the data. Note that these expressions are evaluated after the specified samples are extracted and the INFO field annotations are updated.
--selectTypeToInclude / -selectType ( List[Type] with default value [] )
Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times. This argument select particular kinds of variants out of a list. If left empty, there is no type selection and all variant types are considered for other selection criteria. When specified one or more times, a particular type of variant is selected.
--variant / -V ( required RodBinding[VariantContext] )
Input VCF file. Variants from this VCF file are used by this tool as input. The file must at least contain the standard VCF header lines, but can be empty (i.e., no variants are contained in the file). --variant binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
Examples
Select two samples out of a VCF with many samples:
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant input.vcf -o output.vcf -sn SAMPLE_A_PARC -sn SAMPLE_B_ACTG
Select two samples and any sample that matches a regular expression:
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant input.vcf -o output.vcf -sn SAMPLE_1_PARC -sn SAMPLE_1_ACTG -se 'SAMPLE.+PARC'
Select any sample that matches a regular expression and sites where the QD annotation is more than 10:
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant input.vcf -o output.vcf -se 'SAMPLE.+PARC' -select "QD > 10.0"
Select a sample and exclude non-variant loci and filtered loci:
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant input.vcf -o output.vcf -sn SAMPLE_1_ACTG -env -ef
Select a sample and restrict the output vcf to a set of intervals:
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant input.vcf -o output.vcf -L /path/to/my.interval_list -sn SAMPLE_1_ACTG
Select all calls missed in my vcf, but present in HapMap (useful to take a look at why these variants weren't called by this dataset):
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant hapmap.vcf --discordance myCalls.vcf -o output.vcf -sn mySample
Select all calls made by both myCalls and hisCalls (useful to take a look at what is consistent between the two callers):
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant myCalls.vcf --concordance hisCalls.vcf -o output.vcf -sn mySample
Generating a VCF of all the variants that are mendelian violations:
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant input.vcf -bed family.ped -mvq 50 -o violations.vcf
Creating a set with 50% of the total number of variants in the variant VCF:
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant input.vcf -o output.vcf -fraction 0.5
Select only indels from a VCF:
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant input.vcf -o output.vcf -selectType INDEL
Select only multi-allelic SNPs and MNPs from a VCF (i.e. SNPs with more than one allele listed in the ALT column):
java -Xmx2g -jar GenomeAnalysisTK.jar -R ref.fasta -T SelectVariants --variant input.vcf -o output.vcf -selectType SNP -selectType MNP -restrictAllelesTo MULTIALLELIC
===============================Java表达式====================================
Java表达式(JEXL,主要用于VariantFiltration 和 SelectVariants)
JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.
2. Basic structure of JEXL expressions for use with the GATK
In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.
JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:
"QUAL > 30.0"
(1) QUAL is a key: the name of the annotation we want to look at
(2) 30.0 is a value: the threshold that we want to use to evaluate variant quality against
(3) > is an operator: it determines which "side" of the threshold we want to select
表达式必须放在双引号内;如果value值为字符型,还必须放在单引号内。
The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:
(4) "MY_STRING_KEY == 'foo'"
3. Evaluation on multiple annotations
You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:
(1) "QUAL / DP < 10.0"
(2) QUAL 除以 DP 小于 10.0
You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):
(3) "QUAL > 30.0 && DP == 10" 逻辑与
where && is the logical "AND".
Or if you want to select variants that have at least one of several conditions fulfilled:
(4) "QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0" 逻辑或
where || is the logical "OR".
4. Important caveats 重要说明
Missing annotations
It is very important to note that the JEXL evaluation subprogram cannot correctly handle cases where the annotations requested by the JEXL expression are missing for some variants in a VCF record. It will throw an exception (i.e. fail with an error) when it encounters this scenario. The default behavior of the GATK is to handle this by having the entire expression evaluate to FALSE in such cases (although some tools provide options to change this behavior). This is extremely important especially when constructing complex expressions, because it affects how you should interpret the result.
对于缺失的annotations,需要尤其注意。
For example, looking again at that last expression:
对于记录VCF record with INFO field QD=10.0; FS=300.0; ReadPosRankSum=-10.0
(1) "QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
(2) 上表达式返回 TRUE
但对于VCF记录 QD=10.0; FS=300.0,如果没有ReadPosRankSum值
(3) "QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
(4) 上表达式返回 FALSE
For this reason, we highly recommend that complex expressions involving OR operations be split up into separate expressions whenever possible. For example, the previous example would have 3 distinct expressions: "QD < 2.0", "ReadPosRankSum < -20.0", and "FS > 200.0". This way, although the ReadPosRankSum expression evaluates to FALSE when the annotation is missing, the record can still get filtered (again using the example of VariantFiltration) when the FS value is greater than 200.0.
Sensitivity to case and type
(1) Case
Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual,qual or whatever) in your JEXL expression.
(2) Type
The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g."QUAL < 50"), the system will throw a hissy fit (aka a Java exception).
表达式内区分大小写,并且区分value type ( string, integer, non-integer or boolean)
5. More complex JEXL magic
Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.
Accessing the underlying VariantContext directly
If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the vc variable.
For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'
Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:
! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263
Using the VariantContext to evaluate boolean values
The classic way of evaluating a boolean goes like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'
But you can also use the VariantContext object like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'
6. Using JEXL to evaluate arrays
Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'
=================================================================================================
CombineVariants
CombineVariants combines VCF records from different sources. Any (unique) name can be used to bind your rod data and any number of sources can be input. This tool currently supports two different combination types for each of variants (the first 8 fields of the VCF) and genotypes (the rest). Merge: combines multiple records into a single one; if sample names overlap then they are uniquified. Union: assumes each rod represents the same set of samples (although this is not enforced); using the priority list (if provided), it emits a single record instance at every position represented in the rods. CombineVariants will include a record at every site in all of your input VCF files, and annotate which input ROD bindings the record is present, pass, or filtered in in the set attribute in the INFO field. In effect, CombineVariants always produces a union of the input VCFs. However, any part of the Venn of the N merged VCFs can be exacted using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first run CombineVariants on the two files to generate a single VCF and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example in the documentation guide. Note that CombineVariants supports multi-threaded parallelism (8/15/12). This is particularly useful when converting from VCF to BCF2, which can be expensive. In this case each thread spends CPU time doing the conversion, and the GATK engine is smart enough to merge the partial BCF2 blocks together efficiency. However, since this merge runs in only one thread, you can quickly reach diminishing returns with the number of parallel threads. -nt 4 works well but -nt 8 may be too much. Some fine details about the merging algorithm:
· As of GATK 2.1, when merging multiple VCF records at a site, the combined VCF record has the QUAL of the first VCF record with a non-MISSING QUAL value. The previous behavior was to take the max QUAL, which resulted in sometime strange downstream confusion
Input
One or more variant sets to combine.
Output
A combined VCF.
Examples
java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T CombineVariants \ --variant input1.vcf \ --variant input2.vcf \ -o output.vcf \ -genotypeMergeOptions UNIQUIFY
java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T CombineVariants \ --variant:foo input1.vcf \ --variant:bar input2.vcf \ -o output.vcf \ -genotypeMergeOptions PRIORITIZE -priority foo,bar
参数
--assumeIdenticalSamples / -assumeIdenticalSamples ( boolean with default value false )
If true, assume input VCFs have identical sample sets and disjoint calls. This option allows the user to perform a simple merge (concatenation) to combine the VCFs, drastically reducing the runtime..
--filteredAreUncalled / -filteredAreUncalled ( boolean with default value false )
If true, then filtered VCFs are treated as uncalled, so that filtered set annotations don't appear in the combined VCF.
--filteredrecordsmergetype / -filteredRecordsMergeType ( FilteredRecordMergeType with default value KEEP_IF_ANY_UNFILTERED )
Determines how we should handle records seen at the same site in the VCF, but with different FILTER fields.The --filteredrecordsmergetype argument is an enumerated type (FilteredRecordMergeType), which can have one of the following values:
KEEP_IF_ANY_UNFILTERED
Union - leaves the record if any record is unfiltered.
KEEP_IF_ALL_UNFILTERED
Requires all records present at site to be unfiltered. VCF files that don't contain the record don't influence this.
KEEP_UNCONDITIONAL
If any record is present at this site (regardless of possibly being filtered), then all such records are kept and the filters are reset.
--genotypemergeoption / -genotypeMergeOptions ( GenotypeMergeType )
Determines how we should merge genotype records for samples shared across the ROD files.The --genotypemergeoption argument is an enumerated type (GenotypeMergeType), which can have one of the following values:
UNIQUIFY
Make all sample genotypes unique by file. Each sample shared across RODs gets named sample.ROD.
PRIORITIZE
Take genotypes in priority order (see the priority argument).
UNSORTED
Take the genotypes in any order.
REQUIRE_UNIQUE
Require that all samples/genotypes be unique between all inputs.
--mergeInfoWithMaxAC / -mergeInfoWithMaxAC ( boolean with default value false )
If true, when VCF records overlap the info field is taken from the one with the max AC instead of only taking the fields which are identical across the overlapping records..
--minimalVCF / -minimalVCF ( boolean with default value false )
If true, then the output VCF will contain no INFO or genotype FORMAT fields. Used to generate a sites-only file.
--minimumN / -minN ( int with default value 1 )
Combine variants and output site only if the variant is present in at least N input files..
--out / -o ( VariantContextWriter with default value stdout )
File to which variants should be written.
--printComplexMerges / -printComplexMerges ( boolean with default value false )
Print out interesting sites requiring complex compatibility merging.
--rod_priority_list / -priority ( String )
A comma-separated string describing the priority ordering for the genotypes as far as which record gets emitted. Used when taking the union of variants that contain genotypes. A complete priority list MUST be provided.
--setKey / -setKey ( String with default value set )
Key used in the INFO key=value tag emitted describing which set the combined VCF record came from. Set to 'null' if you don't want the set field emitted.
--suppressCommandLineHeader / -suppressCommandLineHeader ( boolean with default value false )
If true, do not output the header containing the command line used. This option allows the suppression of the command line in the VCF header. This is most often usefully when combining variants for dozens or hundreds of smaller VCFs.
--variant / -V ( required List[RodBinding[VariantContext]] )
Input VCF file. The VCF files to merge together variants can take any number of arguments on the command line. Each -V argument will be included in the final merged output VCF. If no explicit name is provided, the -V arguments will be named using the default algorithm: variants, variants2, variants3, etc. The user can override this by providing an explicit name -V:name,vcf for each -V argument, and each named argument will be labeled as such in the output (i.e., set=name rather than set=variants2). The order of arguments does not matter unless except for the naming, so if you provide an rod priority list and no explicit names than variants, variants2, etc are technically order dependent. It is strongly recommended to provide explicit names when a rod priority list is provided. --variant binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
==============================================
VariantEval
General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv ratios, and a lot more)
Given a variant callset, it is common to calculate various quality control metrics. These metrics include the number of raw or filtered SNP counts; ratio of transition mutations to transversions; concordance of a particular sample's calls to a genotyping chip; number of singletons per sample; etc. Furthermore, it is often useful to stratify these metrics by various criteria like functional class (missense, nonsense, silent), whether the site is CpG site, the amino acid degeneracy of the site, etc. VariantEval facilitates these calculations in two ways: by providing several built-in evaluation and stratification modules, and by providing a framework that permits the easy development of new evaluation and stratification modules.
Input
One or more variant sets to evaluate plus any number of comparison sets.
Output
Evaluation tables detailing the results of the eval modules which were applied.
参数
--ancestralAlignments / -aa ( File )
Fasta file with ancestral alleles.
--comp / -comp ( List[RodBinding[VariantContext]] with default value [] )
Input comparison file(s). The variant file(s) to compare against. --comp binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
--dbsnp / -D ( RodBinding[VariantContext] with default value none )
dbSNP file. dbSNP comparison VCF. By default, the dbSNP file is used to specify the set of "known" variants. Other sets can be specified with the -knownName (--known_names) argument. --dbsnp binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
--doNotUseAllStandardModules / -noEV ( Boolean with default value false )
Do not use the standard modules by default (instead, only those that are specified with the -EV option).
--eval / -eval ( required List[RodBinding[VariantContext]] )
Input evaluation file(s). The variant file(s) to evaluate. --eval binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
--evalModule / -EV ( String[] with default value [] )
One or more specific eval modules to apply to the eval track(s) (in addition to the standard modules, unless -noEV is specified). See the -list argument to view available modules.
--goldStandard / -gold ( RodBinding[VariantContext] with default value none )
Evaluations that count calls at sites of true variation (e.g., indel calls) will use this argument as their gold standard for comparison. Some analyses want to count overlap not with dbSNP (which is in general very open) but actually want to itemize their overlap specifically with a set of gold standard sites such as HapMap, OMNI, or the gold standard indels. This argument provides a mechanism for communicating which file to use --goldStandard binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
--keepAC0 / -keepAC0 ( boolean with default value false )
If provided, modules that track polymorphic sites will not require that a site have AC > 0 when the input eval has genotypes.
--known_names / -knownName ( HashSet[String] with default value [] )
Name of ROD bindings containing variant sites that should be treated as known when splitting eval rods into known and novel subsets. List of rod tracks to be used for specifying "known" variants other than dbSNP.
--knownCNVs / -knownCNVs ( IntervalBinding[Feature] )
File containing tribble-readable features describing a known list of copy number variants. File containing tribble-readable features containing known CNVs. For use with VariantSummary table.
--list / -ls ( Boolean with default value false )
List the available eval modules and exit. Note that the --list argument requires a fully resolved and correct command-line to work.
--mergeEvals / -mergeEvals ( boolean with default value false )
If provided, all -eval tracks will be merged into a single eval track. If true, VariantEval will treat -eval 1 -eval 2 as separate tracks from the same underlying variant set, and evaluate the union of the results. Useful when you want to do -eval chr1.vcf -eval chr2.vcf etc.
--minPhaseQuality / -mpq ( double with default value 10.0 )
Minimum phasing quality.
-mvq / --mendelianViolationQualThreshold ( double with default value 50.0 )
Minimum genotype QUAL score for each trio member required to accept a site as a violation. Default is 50..
-noST / --doNotUseAllStandardStratifications ( Boolean with default value false )
Do not use the standard stratification modules by default (instead, only those that are specified with the -S option).
--out / -o ( PrintStream with default value stdout )
An output file created by the walker. Will overwrite contents if file exists.
--requireStrictAlleleMatch / -strict ( boolean with default value false )
If provided only comp and eval tracks with exactly matching reference and alternate alleles will be counted as overlapping.
--sample / -sn ( Set[String] )
Derive eval and comp contexts using only these sample genotypes, when genotypes are available in the original context.
--samplePloidy / -ploidy ( int with default value 2 )
Per-sample ploidy (number of chromosomes per sample).
--select_exps / -select ( ArrayList[String] with default value [] )
One or more stratifications to use when evaluating the data.
--select_names / -selectName ( ArrayList[String] with default value [] )
Names to use for the list of stratifications (must be a 1-to-1 mapping).
--stratificationModule / -ST ( String[] with default value [] )
One or more specific stratification modules to apply to the eval track(s) (in addition to the standard stratifications, unless -noS is specified).
--stratIntervals / -stratIntervals ( IntervalBinding[Feature] )
File containing tribble-readable features for the IntervalStratificiation. File containing tribble-readable features for the IntervalStratificiation
未完待续