lj695242104

VCF (Variant Call Format) version 4.1

Please note the specification is now being maintained in github at https://github.com/samtools/hts-specs

0. Example

VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.

There is an option whether to contain genotype information on samples for each position or not.

Example:

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTC    G,GTCT  50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

This example shows in order a good simple SNP, a possible SNP that has been filtered out because its quality is below 10, a site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference sequencing error), a site that is calledmonomorphic reference (i.e. with no alternate alleles), and a microsatellite with two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T). Genotype data are given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depth and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite calls are unphased.

1. Meta-information lines

File meta-information is included after the ## string and must be key=value pairs.

A single 'fileformat' field is always required, must be the first line in the file, and details the VCF format version number. For example, forVCF version 4.1, this line should read:

##fileformat=VCFv4.1

It is strongly encouraged that information lines describing the INFO, FILTER and FORMAT entries used in the body of the VCF file be included in the meta-information section. Although they are optional, if these lines are present then they must be completely well-formed.

INFO fields should be described as follows (all keys are required):

##INFO=<ID=ID,Number=number,Type=type,Description=”description”>

Possible Types for INFO fields are: Integer, Float, Flag, Character, and String.

The Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be 1; if the INFO field describes a pair of numbers, then this value should be 2 and so on. If the field has one value per alternate allele then this value should be 'A'; if the field has one value for each possible genotype (more relevant to the FORMAT tags) then this value should be 'G'. If the number of possible values varies, is unknown, or is unbounded, then this value should be '.'. The 'Flag' type indicates that the INFO field does not contain a Value entry, and hence the Number should be 0 in this case. The Description value must be surrounded by double-quotes. Double-quote character can be escaped with backslash (\") and backslash as \\.

FILTERs that have been applied to the data should be described as follows:

##FILTER=<ID=ID,Description=”description”>

Likewise, Genotype fields specified in the FORMAT field should be described as follows:

##FORMAT=<ID=ID,Number=number,Type=type,Description=”description”>

Possible Types for FORMAT fields are: Integer, Float, Character, and String (this field is otherwise defined precisely as the INFO field).

Symbolic alternate alleles for imprecise structural variants:

##ALT=<ID=type,Description=description>

The ID field indicates the type of structural variant, and can be a colon-separated list of types and subtypes. ID values are case sensitive strings and may not contain whitespace or angle brackets. The first level type must be one of the following:

DEL Deletion relative to the reference
INS Insertion of novel sequence relative to the reference
DUP Region of elevated copy number relative to the reference
INV Inversion of reference sequence
CNV Copy number variable region (may be both deletion and duplication)

The CNV category should not be used when a more specific category can be applied. Reserved subtypes include:

DUP:TANDEM Tandem duplication
DEL:ME Deletion of mobile element relative to the reference
INS:ME Insertion of a mobile element relative to the reference

In addition, it is highly recommended (but not required) that the header include tags describing the reference and contigs backing the data contained in the file. These tags are based on the SQ field from the SAM spec; all tags are optional (see the VCF example above).

Breakpoint assemblies for structural variations may use an external file:

##assembly=url

The URL field specifies the location of a fasta file containing breakpoint assemblies referenced in the VCF records for structural variants via the BKPTID INFO key.

As with chromosomal sequences it is highly recommended (but not required) that the header include tags describing the contigs referred to in the VCF file. This furthermore allows these contigs to come from different files. The format is identical to that of a reference sequence, but with an additional URL tag to indicate where that sequence can be found. For example:.

##contig=<ID=ctg1,URL=ftp://somewhere.org/assembly.fa,...>

As explained below it is possible to define sample to genome mappings:

##SAMPLE=<ID=S_ID,Genomes=G1_ID;G2_ID; ...;GK_ID,Mixture=N1;N2; ...;NK,Description=S1;S2; ...; SK >

As well as derivation relationships between genomes using the following syntax:

##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID>

or a link to a database:

##pedigreeDB=<url>

2. The header line syntax

The header line names the 8 fixed, mandatory columns. These columns are as follows:

#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO

If genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs. The header line is tab-delimited.

3. Data lines

Fixed fields

There are 8 fixed fields per record. All data lines are tab-delimited. In all cases, missing values are specified with a dot (”.”). Fixed fields are:

CHROM chromosome: an identifier from the reference genome or an angle-bracketed ID String ("<ID>") pointing to a contig in the assembly file (cf. the ##assembly line in the header). All entries for a specific CHROM should form a contiguous block within the VCFfile. The colon symbol (:) must be absent from all chromosome names to avoid parsing errors when dealing with breakends. (String, no white-space permitted, Required).
POS position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required)
ID semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted)
REF reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event; this padding base is not required (although it is permitted) for e.g. complex substitutions or other events where all alleles have at least one base represented in their Strings. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String "<ID>") then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. Tools processing VCF files are not required to preserve case in the allele Strings. (String, Required).
ALT comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N, (case insensitive) or an angle-bracketed ID String (”<ID>”) or a breakend replacement string as described in the section on breakends. If there are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
QUAL phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10 p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric)
FILTER : PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. “q10;s50” might indicate that at this site the quality is below 10 and the number of samples with data is below 50% of the total number of samples. “0” is reserved and should not be used as a filter String. If filters have not been applied, then this field should be set to the missing value. (String, no white-space or semi-colons permitted)
INFO additional information: (String, no white-space, semi-colons, or equals-signs permitted; commas are permitted only as delimiters for lists of values) INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: <key>=<data>[,data]. Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional):
- AA : ancestral allele
- AC : allele count in genotypes, for each ALT allele, in the same order as listed
- AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
- AN : total number of alleles in called genotypes
- BQ : RMS base quality at this position
- CIGAR : cigar string describing how to align an alternate allele to the reference allele
- DB : dbSNP membership
- DP : combined depth across samples, e.g. DP=154
- END : end position of the variant described in this record (for use with symbolic alleles)
- H2 : membership in hapmap2
- H3 : membership in hapmap3
- MQ : RMS mapping quality, e.g. MQ=52
- MQ0 : Number of MAPQ == 0 reads covering this record
- NS : Number of samples with data
- SB : strand bias at this position
- SOMATIC : indicates that the record is a somatic mutation, for cancer genomics
- VALIDATED : validated by follow-up experiment
- 1000G : membership in 1000 Genomes

The exact format of each INFO sub-field should be specified in the meta-information (as described above).

Example for an INFO field: DP=154;MQ=52;H2. Keys without corresponding values are allowed in order to indicate group membership (e.g. H2 indicates the SNP is found in HapMap 2). It is not necessary to list all the properties that a site does NOT have, by e.g. H2=0.

See below for additional reserved INFO sub-fields used to encode structural variants.

Genotype fields

If genotype information is present, then the same types of data must be present for all samples. First a FORMAT field is given specifying the data types and order (colon-separated alphanumeric String). This is followed by one field per sample, with the colon-separated data in this field corresponding to the types specified in the format. The first sub-field must always be the genotype (GT) if it is present. There are no required sub-fields.

As with the INFO field, there are several common, reserved keywords that are standards across the community:

GT : genotype, encoded as allele values separated by either of ”/” or “|”. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1, 1|0, or 1/2, etc. For haploid calls, e.g. on Y, male non-pseudoautosomal X, or mitochondrion, only one allele value should be given; a triploid call might look like 0/0/1. If a call cannot be made for a sample at a given locus, ”.”should be specified for each missing allele in the GT field (for example "./." for a diploid genotype and "." for haploid genotype). The meanings of the separators are as follows (see the PS field below for more details on incorporating phasing information into the genotypes):
- / : genotype unphased
- | : genotype phased
DP : read depth at this position for this sample (Integer)
FT : sample genotype filter indicating if this genotype was “called” (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semi-colon separated list of codes for filters that fail, or ”.” to indicate that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs (String, no white-space or semi-colons permitted)
GL : genotype likelihoods comprised of comma separated floating point log10-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelicsites the ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)
GLE : genotype likelihoods of heterogeneous ploidy, used in presence of uncertain copy number. For example:GLE=0:-75.22,1:-223.42,0/0:-323.03,1/0:-99.29,1/1:-802.53 (String)
PL : the phred-scaled genotype likelihoods rounded to the closest integer (and otherwise defined precisely as the GL field) (Integers)
GP : the phred-scaled genotype posterior probabilities (and otherwise defined precisely as the GL field); intended to store imputed genotype probabilities (Floats)
GQ : conditional genotype quality, encoded as a phred quality -10log_10p(genotype call is wrong, conditioned on the site's being variant) (Integer)
HQ : haplotype qualities, two comma separated phred qualities (Integers)
PS : phase set. A phase set is defined as a set of phased genotypes to which this genotype belongs. Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set. A phase set specifies multi-marker haplotypes for the phased genotypes in the set. All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). (Non-negative 32-bit Integer)
PQ : phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set). We note that we have not yet included the specific measure for precisely defining "phasing quality"; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality. (Integer)
EC : comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field (typically used in association analyses) (Integers)
MQ : RMS mapping quality, similar to the version in the INFO field. (Integer)

If any of the fields is missing, it is replaced with the missing value. For example if the FORMAT is GT:GQ:DP:HQ then 0|0:.:23:23,34 indicates that GQ is missing. Trailing fields can be dropped (with the exception of the GT field, which should always be present if specified in the FORMAT field).

See below for additional genotype fields used to encode structural variants.

Additional Genotype fields can be defined in the meta-information. However, software support for such fields is not guaranteed.

4. Understanding the VCF format and the haplotype representation

VCF records use a single general system for representing genetic variation data composed of:

Allele: representing single genetic haplotypes (A, T, ATC).
Genotype: an assignment of alleles for each chromosome of a single named sample at a particular locus.
VCF record: a record holding all segregating alleles at a locus (as well as genotypes, if appropriate, for multiple individuals containing alleles at that locus).

VCF records use a simple haplotype representation for REF and ALT alleles to describe variant haplotypes at a locus. ALT haplotypes are constructed from the REF haplotype by taking the REF allele bases at the POS in the reference genotype and replacing them with the ALT bases. In essence, the VCF record specifies a-REF-t and the alternative haplotypes are a-ALT-t for each alternative allele.

5. INFO keys used for structural variants

When the INFO keys reserved for encoding structural variants are used for imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).

The following INFO keys are reserved for encoding structural variants. In general, when these keys are used by imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).

##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description=“Imprecise structural variation”>

##INFO=<ID=NOVEL,Number=0,Type=Flag,Description=“Indicates a novel structural variation”>

##INFO=<ID=END,Number=1,Type=Integer,Description=“End position of the variant described in this record”>

For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate.

##INFO=<ID=SVTYPE,Number=1,Type=String,Description=“Type of structural variant”>

Value should be one of DEL, INS, DUP, INV, CNV, BND. This key can be derived from the REF/ALT fields but is useful for filtering.

##INFO=<ID=SVLEN,Number=.,Type=Integer,Description=“Difference in length between REF and ALT alleles”>

One value for each ALT allele. Longer ALT alleles (e.g. insertions) have positive values, shorter ALT alleles (e.g. deletions) have negative values.

##INFO=<ID=CIPOS,Number=2,Type=Integer,Description=“Confidence interval around POS for imprecise variants”>

##INFO=<ID=CIEND,Number=2,Type=Integer,Description=“Confidence interval around END for imprecise variants”>

##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description=“Length of base pair identical micro-homology at event breakpoints”>

##INFO=<ID=HOMSEQ,Number=.,Type=String,Description=“Sequence of base pair identical micro-homology at event breakpoints”>

##INFO=<ID=BKPTID,Number=.,Type=String,Description=“ID of the assembled alternate allele in the assembly file”>

For precise variants, the consensus sequence the alternate allele assembly is derivable from the REF and ALT fields. However, the alternate allele assembly file may contain additional information about the characteristics of the alt allele contigs.

##INFO=<ID=MEINFO,Number=4,Type=String,Description=“Mobile element info of the form NAME,START,END,POLARITY”>

##INFO=<ID=METRANS,Number=4,Type=String,Description=“Mobile element transduction info of the formCHR,START,END,POLARITY”>

##INFO=<ID=DGVID,Number=1,Type=String,Description=“ID of this element in Database of Genomic Variation”>

##INFO=<ID=DBVARID,Number=1,Type=String,Description=“ID of this element in DBVAR”>

##INFO=<ID=DBRIPID,Number=1,Type=String,Description=“ID of this element in DBRIP”>

##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakends">

##INFO=<ID=PARID,Number=1,Type=String,Description="ID of partner breakend">

##INFO=<ID=EVENT,Number=1,Type=String,Description="ID of event associated to breakend">

##INFO=<ID=CILEN,Number=2,Type=Integer,Description="Confidence interval around the length of the inserted material betweenbreakends">

##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth of segment containing breakend">

##INFO=<ID=DPADJ,Number=.,Type=Integer,Description="Read Depth of adjacency">

##INFO=<ID=CN,Number=1,Type=Integer,Description="Copy number of segment containing breakend">

##INFO=<ID=CNADJ,Number=.,Type=Integer,Description="Copy number of adjacency">

##INFO=<ID=CICN,Number=2,Type=Integer,Description="Confidence interval around copy number for the segment">

##INFO=<ID=CICNADJ,Number=.,Type=Integer,Description="Confidence interval around copy number for the adjacency">

6. FORMAT keys used for structural variants

##FORMAT=<ID=CN,Number=1,Type=Integer,Description=“Copy number genotype for imprecise events”>

##FORMAT=<ID=CNQ,Number=1,Type=Float,Description=“Copy number genotype quality for imprecise events”>

##FORMAT=<ID=CNL,Number=.,Type=Float,Description=“Copy number genotype likelihood for imprecise events”>

##FORMAT=<ID=NQ,Number=1,Type=Integer,Description=“Phred style probability score that the variant is novel with respect to the genome's ancestor”>

##FORMAT=<ID=HAP,Number=1,Type=Integer,Description=“Unique haplotype identifier”>

##FORMAT=<ID=AHAP,Number=1,Type=Integer,Description=“Unique identifier of ancestral haplotype”>

These keys are analogous to GT/GQ/GL and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined). CN specifies the integer copy number of the variant in this sample. CNQ is encoded as a phred quality -10log_10p(copy number genotype call is wrong). CNL specifies a list of log10likelihoods for each potential copy number, starting from zero. When possible, GT/GQ/GL should be used instead of (or in addition to) these keys.

7. How do I represent example variation in VCF records?

SNPs and Small Indels

For example, suppose we are looking at a locus in the genome:

Ref: a t C g a // C is the reference base
   : a t G g a // C base is a G in some individuals
   : a t - g a // C base is deleted w.r.t. the reference
   : a t CAg a // A base is inserted w.r.t. the reference sequence

In the above cases, what are the alleles and how would they be represented as a VCF record?

* First is a SNP polymorphism of C/G → { C , G } → C is the reference allele

20     3 .         C      G       .   PASS  DP=100

* Second, 1 base deletion of C → { tC , t } → tC is the reference allele

20     2 .         TC      T      .   PASS  DP=100

* Third, 1 base insertion of A → { tC ; tCA } → tC is the reference allele

20     2 .         TC      TCA    .   PASS  DP=100

Suppose I see a the following in a population of individuals and want to represent these three segregating alleles:

Ref: a t C g a // C is the reference base
   : a t G g a // C base is a G in some individuals
   : a t - g a // C base is deleted w.r.t. the

How do I represent this? There are three segregating alleles: { tC , tG , t } with a corresponding VCF record:

20     2 .         TC      TG,T    .   PASS  DP=100

Now suppose I have this more complex example:

Ref: a t C g a // C is the reference base
   : a t - g a
   : a t - - a
   : a t CAg a

There are actually four segregating alleles: { tCg , tg, t, and tCAg } over bases 2-4. This complex set of allele is represented in VCF as:

20     2 .         TCG      TG,T,TCAG    .   PASS  DP=100

Note that in VCF records, the molecular equivalence explicitly listed above in the per-base alignment is discarded, so the actual placement of equivalent g isn't retained.

For completeness, VCF records are dynamically typed, so whether a VCF record is a SNP, Indel, Mixed, or Reference site depends on the properties of the alleles in the record.

What do example VCF records indicate as variation from the reference?

SNP VCF record

Suppose I receive the following VCF record:

20     3 .         C      T    .   PASS  DP=100

This is a SNP since its only single base substitution and there are only two alleles so I have the two following segregating haplotypes:

Ref: a t C g a // C is the reference base
   : a t T g a // C base is a T in some individuals

Insertion VCF record

Suppose I receive the following VCF record:

20     3 .         C      CTAG    .   PASS  DP=100

This is a insertion since the reference base C is being replaced by C [the reference base] plus three insertion bases TAG. Again there are only two alleles so I have the two following segregating haplotypes:

Ref: a t C - - - g a // C is the reference base
   : a t C T A G g a // following the C base is an insertion of 3 bases

Deletion VCF record

Suppose I receive the following VCF record:

20     2 .         TCG      T    .   PASS  DP=100

This is a deletion of two reference bases since the reference allele TCG is being replaced by just the T [the reference base]. Again there are only two alleles so I have the two following segregating haplotypes:

Ref: a t C g a // C is the reference base
   : a t - - a // following the C base is a deletion of 2 bases

Mixed VCF record for a microsatellite

Suppose I receive the following VCF record:

20     4 .         GCG      G,GCGCG    .   PASS  DP=100

This is a mixed type record containing a 2 base insertion and a 2 base deletion. There are are three segregating alleles so I have the three following haplotypes:

Ref: a t c g c g - - a  // C is the reference base
   : a t c g - - - - a  // following the C base is a deletion of 2 bases
   : a t c g c g c g a  // following the C base is a insertion of 2 bases


Note that in all of these examples dashes have been added to make the haplotypes clearer but of course the equivalence among bases isn't provided by the VCF. Technically the following is an equivalent alignment:

Ref: a t c g - - c g a  // C is the reference base
   : a t c g - - - - a  // following the C base is a deletion of 2 bases
   : a t c g c g c g a  // following the C base is a insertion of 2 bases

Encoding Structural Variants in VCF

This section describes additional rules for encoding structural variation in VCF format.

The encoding of structural variants in VCF is guided by two principles:

a) When breakpoints / alleles of structural variants are precisely known, then the format should be completely compatible with the format used for smaller indels. b) When the position, length and/or base composition of the variant is not known, we want to encode as much useful information as possible about the variant.

For precisely known variants, the REF and ALT fields should contain the full sequences for the alleles, following the usual VCFconventions. For imprecise variants, the REF field may contain a single base and the ALT fields should contain symbolic alleles (e.g. <ID>), described in more detail below. Imprecise variants should also be marked by the presence of an IMPRECISE flag in the INFO field.

In both cases, the POS field should specify the 1-based coordinate of the base before the variant or the best estimate thereof. When the position is ambiguous due to identical reference sequence, the POS coordinate is based on the leftmost possible position of the variant.

Structural Variant Example

Examples of structural variants encoded in VCF:

##fileformat=VCFv4.1
##fileDate=20100501
##reference=1000GenomesPilot-NCBI36
##assembly=ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fasta
##INFO=<ID=BKPTID,Number=.,Type=String,Description="ID of the assembled alternate allele in the assembly file">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END,POLARITY">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=DEL:ME:L1,Description="Deletion of L1 element">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=DUP:TANDEM,Description="Tandem Duplication">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=INS:ME:ALU,Description="Insertion of ALU element">
##ALT=<ID=INS:ME:L1,Description="Insertion of L1 element">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=CNV,Description="Copy number variable region">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype quality">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events">
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality for imprecise events">
#CHROM  POS ID REF ALT           QUAL  FILTER  INFO                                                                                          FORMAT        NA00001
1 2827694   rs2376870  CGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA  C . PASS  SVTYPE=DEL;END=2827762;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=G;SVLEN=-68 GT:GQ 1/1:13.9
2 321682    .  T   <DEL>         6     PASS    IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-205;CIPOS=-56,20;CIEND=-10,62                          GT:GQ         0/1:12
2 14477084  .  C   <DEL:ME:ALU>  12    PASS    IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32  GT:GQ         0/1:12
3 9425916   .  C   <INS:ME:L1>   23    PASS    IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,-                 GT:GQ         1/1:15
3 12665100  .  A   <DUP>         14    PASS    IMPRECISE;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500                   GT:GQ:CN:CNQ  ./.:0:3:16.2
4 18665128  .  T   <DUP:TANDEM>  11    PASS    IMPRECISE;SVTYPE=DUP;END=18665204;SVLEN=76;CIPOS=-10,10;CIEND=-10,10                          GT:GQ:CN:CNQ  ./.:0:5:8.3

The example shows in order:

A precise deletion with known breakpoint, a one base micro-homology, and a sample that is homozygous for the deletion.
An imprecise deletion of approximately 105 bp.
An imprecise deletion of an ALU element relative to the reference.
An imprecise insertion of an L1 element relative to the reference.
An imprecise duplication of approximately 21Kb. The sample genotype is copy number 3 (one extra copy of the duplicated sequence).
An imprecise tandem duplication of 76bp. The sample genotype is copy number 5 (but the two haplotypes are not known).

Specifying Complex Rearrangements with Breakends

An arbitrary rearrangement event can be summarized as a set of novel adjacencies.

Each adjacency ties together 2 breakends. The two breakends at either end of a novel adjacency are called mates.

There is one line of VCF (i.e. one record) for each of the two breakends in a novel adjacency. A breakend record is identified with the tagSYTYPE=BND" in the INFO field. The REF field of a breakend record indicates a base or sequence s of bases beginning at position POS, as in all VCF records. The ALT field of a breakend record indicates a replacement for s. This "breakend replacement" has three parts:

the string t that replaces places s. The string t may be an extended version of s if some novel bases are inserted during the formation of the novel adjacency.
The position p of the mate breakend, indicated by a string of the form "chr:pos". This is the location of the first mapped base in the piece being joined at this novel adjacency.
The direction that the joined sequence continues in, starting from p. This is indicated by the orientation of square brackets surrounding p.

These 3 elements are combined in 4 possible ways to create the ALT. In each of the 4 cases, the assertion is that s is replaced with t, and then some piece starting at position p is joined to t. The cases are:

REF   ALT    Meaning
s     t[p[   piece extending to the right of p is joined after t
s     t]p]   reverse comp piece extending left of p is joined after t
s     ]p]t   piece extending to the left of p is joined before t
s     [p[t   reverse comp piece extending right of p is joined before t

The following example shows a 3-break operation involving 6 breakends. It exemplifies all possible orientations of breakends in adjacencies. Notice how the ALT field expresses the orientation of the breakends.

#CHROM POS    ID     REF ALT           QUAL FILT INFO
2      321681 bnd_W  G   G]17:198982]  6    PASS SVTYPE=BND
2      321682 bnd_V  T   ]13:123456]T  6    PASS SVTYPE=BND
13     123456 bnd_U  C   C[2:321682[   6    PASS SVTYPE=BND
13     123457 bnd_X  A   [17:198983[A  6    PASS SVTYPE=BND
17     198982 bnd_Y  A   A]2:321681]   6    PASS SVTYPE=BND
17     198983 bnd_Z  C   [13:123457[C  6    PASS SVTYPE=BND

On the same data, it is possible to explicitly state the pairing between mate breakends:

#CHROM POS    ID     REF ALT           QUAL FILT INFO
2      321681 bnd_W  G   G]17:198982]  6    PASS SVTYPE=BND;MATEID=bnd_Y
2      321682 bnd_V  T   ]13:123456]T  6    PASS SVTYPE=BND;MATEID=bnd_U
13     123456 bnd_U  C   C[2:321682[   6    PASS SVTYPE=BND;MATEID=bnd_V
13     123457 bnd_X  A   [17:198983[A  6    PASS SVTYPE=BND;MATEID=bnd_Z
17     198982 bnd_Y  A   A]2:321681]   6    PASS SVTYPE=BND;MATEID=bnd_W
17     198983 bnd_Z  C   [13:123457[C  6    PASS SVTYPE=BND;MATEID=bnd_X

Inserted Sequence

Sometimes, as shown below, some bases are inserted between the two breakends, this information is also carried in the ALT column:

#CHROM POS    ID     REF ALT                     QUAL FILT INFO
2      321682 bnd_V  T   ]13:123456]AGTNNNNNCAT     6    PASS SVTYPE=BND;MATEID=bnd_U
13     123456 bnd_U  C   CAGTNNNNNCA[2:321682[      6    PASS SVTYPE=BND;MATEID=bnd_V

Large Insertions

If the insertion is too long to be conveniently stored in the ALT column, as in the 329 base insertion below, it can be represented by acontig from the assembly file:

#CHROM POS    ID     REF ALT           QUAL FILT INFO
13     123456 bnd_U  C   C[<ctg1>:1[   6    PASS SVTYPE=BND
13     123457 bnd_V  A   ]<ctg1>:329]A 6    PASS SVTYPE=BND

Note: In the special case of the complete insertion of a sequence between two base pairs, it is recommended to use the shorthand notation described above:

#CHROM POS    ID     REF ALT           QUAL FILT INFO
13     123456 INS0   C   C<ctg1>       6    PASS SVTYPE=INS

If only a portion of <ctg1>, say from position 7 to position 214, is inserted, the VCF would be:

#CHROM POS    ID     REF ALT           QUAL FILT INFO
13     123456 bnd_U  C   C[<ctg1>:7[   6    PASS SVTYPE=BND
13     123457 bnd_V  A   ]<ctg1>:214]A 6    PASS SVTYPE=BND

If <ctg1> is circular and a segment from position 229 to position 45 is inserted, i.e. continuing from position 329 on to position 1, this is represented by adding a circular adjacency:

#CHROM POS    ID     REF ALT           QUAL FILT INFO
13     123456 bnd_U  C   C[<ctg1>:229[ 6    PASS SVTYPE=BND
13     123457 bnd_V  A   ]<ctg1>:45]A  6    PASS SVTYPE=BND
<ctg1> 1      bnd_X  A   ]<ctg1>:329]A 6    PASS SVTYPE=BND
<ctg1> 329    bnd_Y  T   T[<ctg1>:1[   6    PASS SVTYPE=BND

Multiple mates

If a breakend has multiple mates (either because of breakend reuse or of uncertainty in the measurement), these alternate adjacencies are treated as alternate alleles:

#CHROM POS    ID     REF ALT                        QUAL FILT INFO
2      321682 bnd_V  T   ]13:123456]T               6    PASS SVTYPE=BND;MATEID=bnd_U
13     123456 bnd_U  C   C[2:321682[,C[17:198983[   6    PASS SVTYPE=BND;MATEID=bnd_V,bnd_Z
17     198983 bnd_Z  A   ]13:123456]A               6    PASS SVTYPE=BND;MATEID=bnd_U

Explicit partners

Two breakends which are connected in the reference genome but disconnected in the variants are called partners. Each breakend only has one partner, typically one basepair left or right. However, it is not uncommon to observe loss of a few basepairs during the rearrangement. It is then possible to explicitly name a breakend's partner. E.g.:

#CHROM  POS        ID     REF  ALT            INFO 
2       321681     bnd_W  G    G[13:123460[   PARID=bnd_V;MATEID=bnd_X
2       321682     bnd_V  T    ]13:123456]T   PARID=bnd_W;MATEID=bnd_U
13      123456     bnd_U  C    C[2:321682[    PARID=bnd_X;MATEID=bnd_V
13      123460     bnd_X  A    ]2:321681]A    PARID=bnd_U;MATEID=bnd_W

Telomeres

For a rearrangement involving the telomere end of a reference chromosome, we define a virtual telomeric breakend that serves as abreakend partner for the breakend at the telomere. That way every breakend has a partner. If the chromosome extends from position 1 to N, then the virtual telomeric breakends are at positions 0 and N+1.

For example, to describe the reciprocal translocation of the entire chromosome 1 into chromosome 13, as illustrated here:

the records would look like:

#CHROM POS    ID     REF ALT              QUAL FILT INFO
1      0      bnd_X  N   .[13:123457[     6    PASS SVTYPE=BND;MATEID=bnd_V
1      1      bnd_Y  T   ]13:123456]T     6    PASS SVTYPE=BND;MATEID=bnd_U
13     123456 bnd_U  C   C[1:1[           6    PASS SVTYPE=BND;MATEID=bnd_Y
13     123457 bnd_V  A   ]1:0]A           6    PASS SVTYPE=BND;MATEID=bnd_X

Event Identifiers

As mentionned previously, a single rearrangement event can be described as a set of novel adjacencies. For example, a reciprocal rearrangement such as:

would be described as:

#CHROM POS    ID     REF ALT           QUAL FILT INFO
2      321681 bnd_W  G   G[13:123457[  6    PASS SVTYPE=BND;MATEID=bnd_X;EVENT=RR0
2      321682 bnd_V  T   ]13:123456]T  6    PASS SVTYPE=BND;MATEID=bnd_U;EVENT=RR0
13     123456 bnd_U  C   C[2:321682[   6    PASS SVTYPE=BND;MATEID=bnd_V;EVENT=RR0
13     123457 bnd_X  A   ]2:321681]A   6    PASS SVTYPE=BND;MATEID=bnd_W;EVENT=RR0

Inversion

Similarly an inversion such as:

can be described equivalently in two ways. Either one uses the short hand notation described previously (recommended for simple cases):

#CHROM POS    ID     REF ALT           QUAL FILT INFO
2      321682 INV0   T   <INV>	       6    PASS SVTYPE=INV;END=421681

or one describes the breakends:

#CHROM POS    ID     REF ALT           QUAL FILT INFO
2      321681 bnd_W  G   G]2:421681]   6    PASS SVTYPE=BND;MATEID=bnd_U;EVENT=INV0
2      321682 bnd_V  T   [2:421682[T   6    PASS SVTYPE=BND;MATEID=bnd_X;EVENT=INV0
2      421681 bnd_U  A   A]2:321681]   6    PASS SVTYPE=BND;MATEID=bnd_W;EVENT=INV0
2      421682 bnd_X  C   [2:321682[C   6    PASS SVTYPE=BND;MATEID=bnd_V;EVENT=INV0

Uncertainty around breakend location

It sometimes is difficult to determine the exact position of a break, generally because of homologies between the sequences being modified. The breakend is then placed arbitrarily at the left most position, and the uncertainty is represented with the CIPOS tag. The ALT string is then constructed assuming this arbitrary breakend choice.

The figure above represents a nonreciprocal translocation with microhomology. Even if we know that breakend U is rearranged withbreakend V, actually placing these breaks can be extremely difficult. The red and green dashed lines represent the most extreme possible recombination events which are allowed by the sequence evidence available. We therefore place both U and V arbitrarily within the interval of possibility:

CHROM POS    ID     REF ALT           QUAL FILT INFO
2      321681 bnd_V  T   T]13:123462]  6    PASS SVTYPE=BND;MATEID=bnd_U;CIPOS=0,6
13     123456 bnd_U  A   A]2:321687]   6    PASS SVTYPE=BND;MATEID=bnd_V;CIPOS=0,6

Note that the coordinate in breakend U's ALT string does not correspond to the designated position of breakend V, but to the position that V would take if U's position were fixed (and vice-versa). The CIPOS tags describe the uncertainty around the positions of U and V.

The fact that breakends U and V are mates is preserved thanks to the MATEID tags. If this were a reciprocal translocation, then there would be additional breakends X and Y, say with X the partner of V on Chr 2 and Y the partner of U on Chr 13, and there would be two more lines of VCF for the XY novel adjacency. Depending on which positions are chosen for the breakends X and Y, it might not be obvious that X is the partner of V and Y is the partner of U from their locations alone. This partner relation ship can be specified explicitly with the tag PARID=bnd_X in the VCF line for breakend V and PARID=bnd_Y in the VCF line for breakend U, and vice versa.

Single breakends

We allow for the definition of a breakend that is not part of a novel adjacency, also identified by the tag SVTYPE=BND. We call these singlebreakends, because they lack a mate. Breakends that are unobserved partners of breakends in observed novel adjacencies are one kind of single breakend. For example, if the true situation is known to be either as depicted in the figure below, and we only observe the adjacency (U,V), and no adjacencies for W, X, Y, or Z, then we cannot be sure whether we have a simple reciprocal translocation or a more complex 3-break operation. Yet we know the partner X of U and the partner W of V exist and are breakends. In this case we can specify these as single breakends, with unknown mates. The 4 lines of VCF representing this situation would be:

#CHROM POS    ID     REF ALT           QUAL FILT INFO
2      321681 bnd_W  G   G.            6    PASS SVTYPE=BND
2      321682 bnd_V  T   ]13:123456]T  6    PASS SVTYPE=BND;MATEID=bnd_U
13     123456 bnd_U  C   C[2:321682[   6    PASS SVTYPE=BND;MATEID=bnd_V	
13     123457 bnd_X  A   .A            6    PASS SVTYPE=BND

On the other hand, if we know a simple reciprocal translocation has occurred as in below, then even if we have no evidence for the (W,X) adjacency, for accounting purposes an adjacency between W and X may also be recorded in the VCF file. These two breakends W and X can still be crossed-referenced as mates. The 4 VCF records describing this situation would look exactly as below, but perhaps with a special quality or filter value for the breakends W and X.

Another possible reason for calling single breakends is an observed but unexplained change in copy number along a chromosome.

#CHROM POS   ID    REF ALT   QUAL FILT INFO
3      12665 bnd_X A   .A    6    PASS SVTYPE=BND;CIPOS=-50,50
3      12665 .     A   <DUP> 14   PASS IMPRECISE;SVTYPE=DUP;END=13686;CIPOS=-50,50;CIEND=-50,50
3      13686 bnd_Y T   T.    6    PASS SVTYPE=BND;CIPOS=-50,50

Finally, if an insertion is detected but only the first few base-pairs provided by overhanging reads could be assembled, then this inserted sequence can be provided on that line, in analogy to paired breakends:

#CHROM POS   ID    REF ALT   QUAL FILT INFO
3      12665 bnd_X A   .TGCA 6    PASS SVTYPE=BND;CIPOS=-50,50
3      12665 .     A   <DUP> 14   PASS IMPRECISE;SVTYPE=DUP;END=13686;CIPOS=-50,50;CIEND=-50,50
3      13686 bnd_Y T   TCC.  6    PASS SVTYPE=BND;CIPOS=-50,50

8. Sample Mixtures

It may be extremely difficult to obtain clinically perfect samples, with only one type of cell. Let's imagine that two samples are taken from a cancer patient: healthy blood, and some tumor tissue with an estimated 30% stromal contamination. This would then be expressed in the header as:

##SAMPLE=<ID=Blood,Genomes=Germline,Mixture=1.,Description="Patient germline genome">
##SAMPLE=<ID=TissueSample,Genomes=Germline;Tumor,Mixture=.3,.7,Description="Patient germline genome;Patient tumor genome">

Because of this distinction between sample and genome, it is possible to express the data along both distinctions. For example, in a first pass, a structural variant caller would simply report counts per sample. Using the example of the inversion just above, the VCF code could become:

#CHROM POS    ID     REF ALT           QUAL FILT INFO                                FORMAT     Blood  TissueSample
2      321681 bnd_W  G   G]2:421681]   6    PASS SVTYPE=BND;MATEID=bnd_U;EVENT=INV0  GT:DPADJ   0:32   0|1:9|21
2      321682 bnd_V  T   [2:421682[T   6    PASS SVTYPE=BND;MATEID=bnd_X;EVENT=INV0  GT:DPADJ   0:29   0|1:11|25
13     421681 bnd_U  A   A]2:321681]   6    PASS SVTYPE=BND;MATEID=bnd_W;EVENT=INV0  GT:DPADJ   0:34   0|1:10|23
13     421682 bnd_X  C   [2:321682[C   6    PASS SVTYPE=BND;MATEID=bnd_V;EVENT=INV0  GT:DPADJ   0:31   0|1:8|20

However, a more evolved algorithm could attempt actually deconvolving the two genomes and generating copy number estimates based on the raw data:

#CHROM POS    ID     REF ALT           QUAL FILT INFO                                FORMAT     Blood  Tumor
2      321681 bnd_W  G   G]2:421681]   6    PASS SVTYPE=BND;MATEID=bnd_U;EVENT=INV0  GT:CNADJ   0:1    1:1
2      321682 bnd_V  T   [2:421682[T   6    PASS SVTYPE=BND;MATEID=bnd_X;EVENT=INV0  GT:CNADJ   0:1    1:1
13     421681 bnd_U  A   A]2:321681]   6    PASS SVTYPE=BND;MATEID=bnd_W;EVENT=INV0  GT:CNADJ   0:1    1:1
13     421682 bnd_X  C   [2:321682[C   6    PASS SVTYPE=BND;MATEID=bnd_V;EVENT=INV0  GT:CNADJ   0:1    1:1

9. Clonal derivation relationships

In cancer, each VCF file represents several genomes from a patient, but one genome is special in that it represents the germlinegenome of the patient. This genome is contrasted to a second genome, the cancer tumor genome. In the simplest case the VCF file for a single patient contains only these two genomes. This is assumed in most of the discussion of the sections below.

In general there may be several tumor genomes from the same patient in the VCF file. Some of these may be secondary tumors derived from an original primary tumor. We suggest the derivation relationships between genomes in a cancer VCF file be represented in the header with PEDIGREE tags.

Analogously, there might also be several normal genomes from the same patient in the VCF (typically double normal studies with blood and solid tissue samples). These normal genomes are then considered to be derived from the original germline genome, which has to be inferred by parsimony.

The general format of a PEDIGREE line describing asexual, clonal derivation is:

PEDIGREE=<Derived=ID2,Original=ID1>

This line asserts that the DNA in genome is asexually or clonally derived with mutations from the DNA in genome . This is the asexual analog of the VCF format that has been proposed for family relationships between genomes, i.e. there is one entry per of the form:

PEDIGREE=<Child=CHILD-GENOME-ID,Mother=MOTHER-GENOME-ID,Father=FATHER-GENOME-ID>

Let's consider a cancer patient VCF file with 4 genomes: germline, primary_tumor, secondary_tumor1, and secondary_tumor2. The primary_tumor is derived from the germline and the secondary tumors are each derived independently from the primary tumor, in all cases by clonal derivation with mutations. The PEDIGREE lines would look like:

##PEDIGREE=<Derived=PRIMARY-TUMOR-GENOME-ID,Original=GERMLINE-GENOME-ID> 
##PEDIGREE=<Derived=SECONDARY1-TUMOR-GENOME-ID,Original=PRIMARY-TUMOR-GENOME-ID>
##PEDIGREE=<Derived=SECONDARY2-TUMOR-GENOME-ID,Original=PRIMARY-TUMOR-GENOME-ID>

Alternately, if data on the genomes is compiled in a database, a simple pointer can be provided:

##pedigreeDB=<url>

The most general form of a pedigree line is:

##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID>

which means that the genome Name_0 is derived from the N >= 1 genomes Name_1, ..., Name_N. Based on these derivation relationships two new pieces of information can be specified.

Firstly, we wish to express the knowledge that a variant is novel to a genome, with respect to its parent genome. Ideally, this could be derived by simply comparing the features on either genomes. However, insufficient data or sample mixtures might prevent us from clearly determining at which stage a given variant appeared. This would be represented by a mutation quality score.

Secondly, we define a haplotype as a set of variants which are known to be on the same chromosome in the germline genome.Haplotype identifiers must be unique across the germline genome, and are conserved along clonal lineages, regardless of mutations, rearrangements, or recombination. In the case of the duplication of a region within a haplotype, one copy retains the original haplotypeidentifier, and the others are considered to be novel haplotypes with their own unique identifiers. All these novel haplotypes have in common their haplotype ancestor in the parent genome.

10. Phasing adjacencies in an aneuploid context

In a cancer genome, due to duplication followed by mutation, there can in principle exist any number of haplotypes in the sampled genome for a given location in the reference genome. We assume each haplotype that the user chooses to name is named with a numerical haplotype identifier. Although it is difficult with current technologies to associate haplotypes with novel adjacencies, it might be partially possible to deconvolve these connections in the near future. We therefore propose the following notation to allow haplotype-ambiguous as well as haplotype-unambiguous connections to be described. The general term for these haplotype-specific adjacencies is bundles.

The following diagram will be used to support examples below:

In this example, we know that in the sampled genome

a reference bundle connects breakend U, haplotype 5 on chr13 to its partner, breakend X, haplotype 5 on chr13,
a novel bundle connects breakend U, haplotype 1 on chr13 to its mate breakend V, haplotype 11 on chr2, and finally,
a novel bundle connects breakend U, haplotypes 2, 3 and 4 on chr13 to breakend V, haplotypes 12, 13 or 14 on chr2 without any explicit pairing.

These three are the bundles for breakend U. Each such bundle is referred to as a haplotype of the breakend U. Each allele of a breakendcorresponds to one or more haplotypes. In the above case there are two alleles: the 0 allele, corresponding to the adjacency to the partner X, which has haplotype (1), and the 1 allele, corresponding to the two haplotypes (2) and (3) with adjacency to the mate V.

For each haplotype of a breakend, say the haplotype (2) of breakend U above, connecting the end of haplotype 1 on a segment of Chr 13 to a mate on Chr 2 with haplotype 11, in addition to the list of haplotype-specific adjacencies that define it, we can also specify in VCFseveral other quantities. These include:

the depth of reads on the segment where the breakend occurs that support the haplotype, e.g. the depth of reads supportinghaplotype 1 in the segment containing breakend U,
the estimated copy number of the haplotype on the segment where the breakend occurs,
the depth of paired-end or split reads that support the haplotype-specific adjacencies, e.g. that support the adjacency betweenhaplotype 1 on Chr 13 to haplotype 11 on Chr 2
the estimated copy number of the haplotype-specific adjacencies, and
an overall quality score indicating how confident we are in this asserted haplotype.

These are specified using the using the DP, CN, BDP, BCN, and HQ subfields, respectively. The total information available about the three haplotypes of breakend U in the figure above may be visualized in a table as follows.

Allele 	                1	1		0
Haplotype 	        1>11	2,3,4>12,13,14	5>5
Segment Depth	        5	17		4
Segment Copy Number	1	3		1
Bundle Depth	        4	0		3
Bundle Copy Number	1	3		1
Haplotype qual.	        30	40		40

你可能感兴趣的:(格式,vcf,遗传数据)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
【iOS】MVC设计模式 Magnetic_h ios mvc 设计模式 objective-c 学习 ui
MVC前言如何设计一个程序的结构，这是一门专门的学问，叫做"架构模式"（architecturalpattern），属于编程的方法论。MVC模式就是架构模式的一种。它是Apple官方推荐的App开发架构，也是一般开发者最先遇到、最经典的架构。MVC各层controller层Controller/ViewController/VC（控制器）负责协调Model和View，处理大部分逻辑它将数据从Mod
微服务下功能权限与数据权限的设计与实现 nbsaas-boot 微服务 java 架构
在微服务架构下，系统的功能权限和数据权限控制显得尤为重要。随着系统规模的扩大和微服务数量的增加，如何保证不同用户和服务之间的访问权限准确、细粒度地控制，成为设计安全策略的关键。本文将讨论如何在微服务体系中设计和实现功能权限与数据权限控制。1.功能权限与数据权限的定义功能权限：指用户或系统角色对特定功能的访问权限。通常是某个用户角色能否执行某个操作，比如查看订单、创建订单、修改用户资料等。数据权限：
c++ 的iostream 和 c++的stdio的区别和联系黄卷青灯77 c++算法开发语言 iostream stdio
在C++中，iostream和C语言的stdio.h都是用于处理输入输出的库，但它们在设计、用法和功能上有许多不同。以下是两者的区别和联系：区别1.编程风格iostream（C++风格）：C++标准库中的输入输出流类库，支持面向对象的输入输出操作。典型用法是cin（输入）和cout（输出），使用>操作符来处理数据。更加类型安全，支持用户自定义类型的输入输出。#includeintmain(){in
《投行人生》读书笔记小蘑菇的树洞
《投行人生》----作者詹姆斯-A-朗德摩根斯坦利副主席40年的职业洞见-很短小精悍的篇幅，比较适合初入职场的新人。第一部分成功的职业生涯需要规划1.情商归为适应能力分享与协作同理心适应能力，更多的是自我意识，你有能力识别自己的情并分辨这些情绪如何影响你的思想和行为。2.对于初入职场的人的建议，细节，截止日期和数据很重要截止日期，一种有效的方法是请老板为你所有的任务进行优先级排序。和老板喝咖啡的好
Long类型前后端数据不一致 igotyback 前端
响应给前端的数据浏览器控制台中response中看到的Long类型的数据是正常的到前端数据不一致前后端数据类型不匹配是一个常见问题，尤其是当后端使用Java的Long类型（64位）与前端JavaScript的Number类型（最大安全整数为2^53-1，即16位）进行数据交互时，很容易出现精度丢失的问题。这是因为JavaScript中的Number类型无法安全地表示超过16位的整数。为了解决这个问
LocalDateTime 转 String igotyback java 开发语言
importjava.time.LocalDateTime;importjava.time.format.DateTimeFormatter;publicclassMain{publicstaticvoidmain(String[]args){//获取当前时间LocalDateTimenow=LocalDateTime.now();//定义日期格式化器DateTimeFormatterformat
【一起学Rust | 设计模式】习惯语法——使用借用类型作为参数、格式化拼接字符串、构造函数广龙宇一起学Rust #Rust设计模式 rust 设计模式开发语言
提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录前言一、使用借用类型作为参数二、格式化拼接字符串三、使用构造函数总结前言Rust不是传统的面向对象编程语言，它的所有特性，使其独一无二。因此，学习特定于Rust的设计模式是必要的。本系列文章为作者学习《Rust设计模式》的学习笔记以及自己的见解。因此，本系列文章的结构也与此书的结构相同（后续可能会调成结构），基本上分为三个部分
Python数据分析与可视化实战指南 William数据分析 python python 数据
在数据驱动的时代，Python因其简洁的语法、强大的库生态系统以及活跃的社区，成为了数据分析与可视化的首选语言。本文将通过一个详细的案例，带领大家学习如何使用Python进行数据分析，并通过可视化来直观呈现分析结果。一、环境准备1.1安装必要库在开始数据分析和可视化之前，我们需要安装一些常用的库。主要包括pandas、numpy、matplotlib和seaborn等。这些库分别用于数据处理、数学
WPF中的ComboBox控件几种数据绑定的方式互联网打工人no1 wpf c#
一、用字典给ItemsSource赋值（此绑定用的地方很多，建议熟练掌握）在XMAL中：在CS文件中privatevoidBindData(){DictionarydicItem=newDictionary();dicItem.add(1,"北京");dicItem.add(2,"上海");dicItem.add(3,"广州");cmb_list.ItemsSource=dicItem;cmb_l
Pyecharts数据可视化大屏：打造沉浸式数据分析体验我的运维人生信息可视化数据分析数据挖掘运维开发技术共享
Pyecharts数据可视化大屏：打造沉浸式数据分析体验在当今这个数据驱动的时代，如何将海量数据以直观、生动的方式展现出来，成为了数据分析师和企业决策者关注的焦点。Pyecharts，作为一款基于Python的开源数据可视化库，凭借其丰富的图表类型、灵活的配置选项以及高度的定制化能力，成为了构建数据可视化大屏的理想选择。本文将深入探讨如何利用Pyecharts打造数据可视化大屏，并通过实际代码案例
Python教程：一文了解使用Python处理XPath 旦莫 Python进阶 python 开发语言
目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath？2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代
Google earth studio 简介陟彼高冈yu 旅游
GoogleEarthStudio是一个基于Web的动画工具，专为创作使用GoogleEarth数据的动画和视频而设计。它利用了GoogleEarth强大的三维地图和卫星影像数据库，使用户能够轻松地创建逼真的地球动画、航拍视频和动态地图可视化。网址为https://www.google.com/earth/studio/。GoogleEarthStudio是一个基于Web的动画工具，专为创作使用G
LLM 词汇表落难Coder LLMs NLP 大语言模型大模型 llama 人工智能
Contextwindow“上下文窗口”是指语言模型在生成新文本时能够回溯和参考的文本量。这不同于语言模型训练时所使用的大量数据集，而是代表了模型的“工作记忆”。较大的上下文窗口可以让模型理解和响应更复杂和更长的提示，而较小的上下文窗口可能会限制模型处理较长提示或在长时间对话中保持连贯性的能力。Fine-tuning微调是使用额外的数据进一步训练预训练语言模型的过程。这使得模型开始表示和模仿微调数
将cmd中命令输出保存为txt文本文件落难Coder Windows cmd window
最近深度学习本地的训练中我们常常要在命令行中运行自己的代码，无可厚非，我们有必要保存我们的炼丹结果，但是复制命令行输出到txt是非常麻烦的，其实Windows下的命令行为我们提供了相应的操作。其基本的调用格式就是：运行指令>输出到的文件名称或者具体保存路径测试下，我打开cmd并且ping一下百度：pingwww.baidu.com>./data.txt看下相同目录下data.txt的输出：如果你再
关于提高复杂业务逻辑代码可读性的思考编程经验分享开发经验 java 数据库开发语言
目录前言需求场景常规写法拆分方法领域对象总结前言实际工作中大部分时间都是在写业务逻辑，一般都是三层架构，表示层（Controller）接收客户端请求，并对入参做检验，业务逻辑层（Service）负责处理业务逻辑，一般开发都是在这一层中写具体的业务逻辑。数据访问层（Dao）是直接和数据库交互的，用于查数据给业务逻辑层，或者是将业务逻辑层处理后的数据写入数据库。简单的增删改查接口不用多说，基本上写好一
SQL Server_查询某一数据库中的所有表的内容 qq_42772833 SQL Server 数据库 sqlserver
1.查看所有表的表名要列出CrabFarmDB数据库中的所有表（名），可以使用以下SQL语句：USECrabFarmDB;--切换到目标数据库GOSELECTTABLE_NAMEFROMINFORMATION_SCHEMA.TABLESWHERETABLE_TYPE='BASETABLE';对这段SQL脚本的解释：SELECTTABLE_NAME：这个语句的作用是从查询结果中选择TABLE_NAM
使用LLaVa和Ollama实现多模态RAG示例 llzwxh888 python 人工智能开发语言
本文将详细介绍如何使用LLaVa和Ollama实现多模态RAG（检索增强生成），通过提取图像中的结构化数据、生成图像字幕等功能来展示这一技术的强大之处。安装环境首先，您需要安装以下依赖包：!pipinstallllama-index-multi-modal-llms-ollama!pipinstallllama-index-readers-file!pipinstallunstructured!p
python是什么意思中文-在python中%是什么意思编程大乐趣
Python中%有两种：1、数值运算：%代表取模，返回除法的余数。如：>>>7%212、%操作符（字符串格式化，stringformatting），说明如下：%[(name)][flags][width].[precision]typecode(name)为命名flags可以有+，-，''或0。+表示右对齐。-表示左对齐。''为一个空格，表示在正数的左侧填充一个空格，从而与负数对齐。0表示使用0填
使用Apify加载Twitter消息以进行微调的完整指南 nseejrukjhad twitter easyui 前端 python
#使用Apify加载Twitter消息以进行微调的完整指南##引言在自然语言处理领域，微调模型以适应特定任务是提升模型性能的常见方法。本文将介绍如何使用Apify从Twitter导出聊天信息，以便进一步进行微调。##主要内容###使用Apify导出推文首先，我们需要从Twitter导出推文。Apify可以帮助我们做到这一点。通过Apify的强大功能，我们可以批量抓取和导出数据，适用于各类应用场景。
深入理解 MultiQueryRetriever：提升向量数据库检索效果的强大工具 nseejrukjhad 数据库 python
深入理解MultiQueryRetriever：提升向量数据库检索效果的强大工具引言在人工智能和自然语言处理领域，高效准确的信息检索一直是一个关键挑战。传统的基于距离的向量数据库检索方法虽然广泛应用，但仍存在一些局限性。本文将介绍一种创新的解决方案：MultiQueryRetriever，它通过自动生成多个查询视角来增强检索效果，提高结果的相关性和多样性。MultiQueryRetriever的工
如何部分格式化提示模板:LangChain中的高级技巧 nseejrukjhad langchain java 服务器 python
标题:如何部分格式化提示模板:LangChain中的高级技巧内容:如何部分格式化提示模板:LangChain中的高级技巧引言在使用大型语言模型(LLM)时,提示工程是一个关键环节。LangChain提供了强大的提示模板功能,让我们能更灵活地构建和管理提示。本文将介绍LangChain中一个高级特性-部分格式化提示模板,这个技巧可以让你的提示管理更加高效和灵活。什么是部分格式化提示模板?部分格式化提
数组去重好奇的猫猫猫
整理自js中基础数据结构数组去重问题思考？如何去除数组中重复的项例如数组：[1,3,4,3,5]我们在做去重的时候，一开始想到的肯定是，逐个比较，外面一层循环，内层后一个与前一个一比较，如果是久不将当前这一项放进新的数组，挨个比较完之后返回一个新的去过重复的数组不好的实践方式上述方法效率极低，代码量还多，思考？有没有更好的方法这时候不禁一想当然有了！！！hashtable啊，通过对象的hash办法
Day1笔记-Python简介&标识符和关键字&输入输出 ~在杰难逃~ Python python 开发语言大数据数据分析数据挖掘
大家好，从今天开始呢，杰哥开展一个新的专栏，当然，数据分析部分也会不定时更新的，这个新的专栏主要是讲解一些Python的基础语法和知识，帮助0基础的小伙伴入门和学习Python，感兴趣的小伙伴可以开始认真学习啦！一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码，再通过语言处理程序执行向计算机发送指令，让计算机完成对应的工作，编程
【目标检测数据集】卡车数据集1073张VOC+YOLO格式熬夜写代码的平头哥∰ 目标检测 YOLO 人工智能
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：1073标注数量(xml文件个数)：1073标注数量(txt文件个数)：1073标注类别数：1标注类别名称:["truck"]每个类别标注的框数：truck框数=1120总框数：1120使用标注工具：labelImg标注
MongoDB Oplog 窗口喝醉酒的小白 MongoDB 运维
在MongoDB中，oplog（操作日志）是一个特殊的日志系统，用于记录对数据库的所有写操作。oplog允许副本集成员（通常是从节点）应用主节点上已经执行的操作，从而保持数据的一致性。它是MongoDB副本集实现数据复制的基础。MongoDBOplog窗口oplog窗口是指在MongoDB副本集中，从节点可以用来同步数据的时间范围。这个窗口通常由以下因素决定：Oplog大小：oplog的大小是有限
Faiss Tips：高效向量搜索与聚类的利器焦习娜Samantha
FaissTips：高效向量搜索与聚类的利器faiss_tipsSomeusefultipsforfaiss项目地址:https://gitcode.com/gh_mirrors/fa/faiss_tips项目介绍Faiss是由FacebookAIResearch开发的一个用于高效相似性搜索和密集向量聚类的库。它支持多种硬件平台，包括CPU和GPU，能够在海量数据集上实现快速的近似最近邻搜索（AN
pyecharts——绘制柱形图折线图 2224070247 信息可视化 python java 数据可视化
一、pyecharts概述自2013年6月百度EFE(ExcellentFrontEnd）数据可视化团队研发的ECharts1.0发布到GitHub网站以来，ECharts一直备受业界权威的关注并获得广泛好评，成为目前成熟且流行的数据可视化图表工具，被应用到诸多数据可视化的开发领域。Python作为数据分析领域最受欢迎的语言，也加入ECharts的使用行列，并研发出方便Python开发者使用的数据
番茄西红柿叶子病害分类数据集12882张11类别 futureflsl 数据集分类数据挖掘人工智能
数据集类型：图像分类用，不可用于目标检测无标注文件数据集格式：仅仅包含jpg图片，每个类别文件夹下面存放着对应图片图片数量(jpg文件个数)：12882分类类别数：11类别名称:["Bacterial_Spot_Bacteria","Early_Blight_Fungus","Healthy","Late_Blight_Water_Mold","Leaf_Mold_Fungus","Powdery
钢筋长度超限检测检数据集VOC+YOLO格式215张1类别 futureflsl 数据集 YOLO 深度学习机器学习
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：215标注数量(xml文件个数)：215标注数量(txt文件个数)：215标注类别数：1标注类别名称:["iron"]每个类别标注的框数：iron框数=215总框数：215使用标注工具：labelImg标注规则：对类别进
java的(PO,VO,TO,BO,DAO,POJO) Cb123456 VO TO BO POJO DAO
转: http://www.cnblogs.com/yxnchinahlj/archive/2012/02/24/2366110.html ------------------------------------------------------------------- O/R Mapping 是 Object Relational Mapping（对象关系映
spring ioc原理（看完后大家可以自己写一个spring） aijuans spring
最近，买了本Spring入门书：spring In Action 。大致浏览了下感觉还不错。就是入门了点。Manning的书还是不错的，我虽然不像哪些只看Manning书的人那样专注于Manning,但怀着崇敬的心情和激情通览了一遍。又一次接受了IOC 、DI、AOP等Spring核心概念。先就IOC和DI谈一点我的看法。IO
MyEclipse 2014中Customize Persperctive设置无效的解决方法 Kai_Ge MyEclipse2014
高高兴兴下载个MyEclipse2014，发现工具条上多了个手机开发的按钮，心生不爽就想弄掉他！结果发现Customize Persperctive失效！！有说更新下就好了，可是国内Myeclipse访问不了，何谈更新... so~这里提供了更新后的一下jar包，给大家使用！ 1、将9个jar复制到myeclipse安装目录\plugins中 2、删除和这9个jar同包名但是版本号较
SpringMvc上传 120153216 springMVC
@RequestMapping(value = WebUrlConstant.UPLOADFILE) @ResponseBody public Map<String, Object> uploadFile(HttpServletRequest request,HttpServletResponse httpresponse) { try { //
Javascript----HTML DOM 事件何必如此 JavaScript html Web
HTML DOM 事件允许Javascript在HTML文档元素中注册不同事件处理程序。事件通常与函数结合使用，函数不会在事件发生前被执行！注：DOM：指明使用的 DOM 属性级别。 1.鼠标事件属性
动态绑定和删除onclick事件 357029540 JavaScript jquery
因为对JQUERY和JS的动态绑定事件的不熟悉，今天花了好久的时间才把动态绑定和删除onclick事件搞定!现在分享下我的过程。在我的查询页面，我将我的onclick事件绑定到了tr标签上同时传入当前行(this值)参数，这样可以在点击行上的任意地方时可以选中checkbox，但是在我的某一列上也有一个onclick事件是用于下载附件的，当
HttpClient|HttpClient请求详解 7454103 apache 应用服务器网络协议网络应用 Security
HttpClient 是 Apache Jakarta Common 下的子项目，可以用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包，并且它支持 HTTP 协议最新的版本和建议。本文首先介绍 HTTPClient，然后根据作者实际工作经验给出了一些常见问题的解决方法。HTTP 协议可能是现在 Internet 上使用得最多、最重要的协议了，越来越多的 Java 应用程序需
递归逐层统计树形结构数据 darkranger 数据结构
将集合递归获取树形结构: /** * * 递归获取数据 * @param alist:所有分类 * @param subjname:对应统计的项目名称 * @param pk:对应项目主键 * @param reportList: 最后统计的结果集 * @param count:项目级别 */ public void getReportVO(Arr
访问WEB-INF下使用frameset标签页面出错的原因 aijuans struts2
<frameset rows="61,*,24" cols="*" framespacing="0" frameborder="no" border="0">
MAVEN常用命令 avords
Maven库： http://repo2.maven.org/maven2/ Maven依赖查询： http://mvnrepository.com/ Maven常用命令： 1. 创建Maven的普通java项目： mvn archetype:create -DgroupId=packageName
PHP如果自带一个小型的web服务器就好了 houxinyou apache 应用服务器 Web PHP 脚本
最近单位用PHP做网站，感觉PHP挺好的，不过有一些地方不太习惯，比如，环境搭建。PHP本身就是一个网站后台脚本，但用PHP做程序时还要下载apache，配置起来也不太很方便，虽然有好多配置好的apache+php+mysq的环境，但用起来总是心里不太舒服，因为我要的只是一个开发环境，如果是真实的运行环境，下个apahe也无所谓，但只是一个开发环境，总有一种杀鸡用牛刀的感觉。如果php自己的程序中
NoSQL数据库之Redis数据库管理(list类型) bijian1013 redis 数据库 NoSQL
3.list类型及操作 List是一个链表结构，主要功能是push、pop、获取一个范围的所有值等等，操作key理解为链表的名字。Redis的list类型其实就是一个每个子元素都是string类型的双向链表。我们可以通过push、pop操作从链表的头部或者尾部添加删除元素，这样list既可以作为栈，又可以作为队列。 &nbs
谁在用Hadoop？ bingyingao hadoop 数据挖掘公司应用场景
Hadoop技术的应用已经十分广泛了，而我是最近才开始对它有所了解，它在大数据领域的出色表现也让我产生了兴趣。浏览了他的官网，其中有一个页面专门介绍目前世界上有哪些公司在用Hadoop，这些公司涵盖各行各业，不乏一些大公司如alibaba,ebay,amazon,google,facebook,adobe等，主要用于日志分析、数据挖掘、机器学习、构建索引、业务报表等场景,这更加激发了学习它的热情。
【Spark七十六】Spark计算结果存到MySQL bit1129 mysql
package spark.examples.db import java.sql.{PreparedStatement, Connection, DriverManager} import com.mysql.jdbc.Driver import org.apache.spark.{SparkContext, SparkConf} object SparkMySQLInteg
Scala: JVM上的函数编程 bookjovi scala erlang haskell
说Scala是JVM上的函数编程一点也不为过，Scala把面向对象和函数型编程这两种主流编程范式结合了起来，对于熟悉各种编程范式的人而言Scala并没有带来太多革新的编程思想，scala主要的有点在于Java庞大的package优势，这样也就弥补了JVM平台上函数型编程的缺失，MS家.net上已经有了F#，JVM怎么能不跟上呢？对本人而言
jar打成exe bro_feng java jar exe
今天要把jar包打成exe，jsmooth和exe4j都用了。遇见几个问题。记录一下。两个软件都很好使，网上都有图片教程，都挺不错。首先肯定是要用自己的jre的，不然不能通用，其次别忘了把需要的lib放到classPath中。困扰我很久的一个问题是，我自己打包成功后，在一个同事的没有装jdk的电脑上运行，就是不行，报错jvm.dll为无效的windows映像，如截图最后发现
读《研磨设计模式》-代码笔记-策略模式-Strategy bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /* 策略模式定义了一系列的算法，并将每一个算法封装起来，而且使它们还可以相互替换。策略模式让算法独立于使用它的客户而独立变化简单理解： 1、将不同的策略提炼出一个共同接口。这是容易的，因为不同的策略，只是算法不同，需要传递的参数
cmd命令值cvfM命令 chenyu19891124 cmd
cmd命令还真是强大啊。今天发现jar -cvfM aa.rar @aaalist 就这行命令可以根据aaalist取出相应的文件例如：在d：\workspace\prpall\test.java 有这样一个文件，现在想要将这个文件打成一个包。运行如下命令即可比如在d：\wor
OpenJWeb(1.8) Java Web应用快速开发平台 comsci java 框架 Web 项目管理企业应用
OpenJWeb(1.8) Java Web应用快速开发平台的作者是我们技术联盟的成员，他最近推出了新版本的快速应用开发平台 OpenJWeb(1.8)，我帮他做做宣传 OpenJWeb快速开发平台以快速开发为核心，整合先进的java 开源框架，本着自主开发+应用集成相结合的原则，旨在为政府、企事业单位、软件公司等平台用户提供一个架构透
Python 报错：IndentationError: unexpected indent daizj python tab 空格缩进
IndentationError: unexpected indent 是缩进的问题，也有可能是tab和空格混用啦 Python开发者有意让违反了缩进规则的程序不能通过编译，以此来强制程序员养成良好的编程习惯。并且在Python语言里，缩进而非花括号或者某种关键字，被用于表示语句块的开始和退出。增加缩进表示语句块的开
HttpClient 超时设置 dongwei_6688 httpclient
HttpClient中的超时设置包含两个部分： 1. 建立连接超时，是指在httpclient客户端和服务器端建立连接过程中允许的最大等待时间 2. 读取数据超时，是指在建立连接后，等待读取服务器端的响应数据时允许的最大等待时间在HttpClient 4.x中如下设置： HttpClient httpclient = new DefaultHttpC
小鱼与波浪 dcj3sjt126com
一条小鱼游出水面看蓝天，偶然间遇到了波浪。　　小鱼便与波浪在海面上游戏，随着波浪上下起伏、汹涌前进。　　小鱼在波浪里兴奋得大叫：“你每天都过着这么刺激的生活吗？简直太棒了。”　　波浪说：“岂只每天过这样的生活，几乎每一刻都这么刺激！还有更刺激的，要有潮汐变化，或者狂风暴雨，那才是兴奋得心脏都会跳出来。”　　小鱼说：“真希望我也能变成一个波浪，每天随着风雨、潮汐流动，不知道有多么好！”　　很快，小鱼
Error Code: 1175 You are using safe update mode and you tried to update a table dcj3sjt126com mysql
快速高效用：SET SQL_SAFE_UPDATES = 0；下面的就不要看了！今日用MySQL Workbench进行数据库的管理更新时，执行一个更新的语句碰到以下错误提示： Error Code: 1175 You are using safe update mode and you tried to update a table without a WHERE that
枚举类型详细介绍及方法定义 gaomysion enum javaee
转发 http://developer.51cto.com/art/201107/275031.htm 枚举其实就是一种类型，跟int, char 这种差不多，就是定义变量时限制输入的，你只能够赋enum里面规定的值。建议大家可以看看，这两篇文章，《java枚举类型入门》和《C++的中的结构体和枚举》，供大家参考。枚举类型是JDK5.0的新特征。Sun引进了一个全新的关键字enum
Merge Sorted Array hcx2013 array
Given two sorted integer arrays nums1 and nums2, merge nums2 into nums1 as one sorted array. Note:You may assume that nums1 has enough space (size that is
Expression Language 3.0新特性 jinnianshilongnian el 3.0
Expression Language 3.0表达式语言规范最终版从2013-4-29发布到现在已经非常久的时间了；目前如Tomcat 8、Jetty 9、GlasshFish 4已经支持EL 3.0。新特性包括：如字符串拼接操作符、赋值、分号操作符、对象方法调用、Lambda表达式、静态字段/方法调用、构造器调用、Java8集合操作。目前Glassfish 4/Jetty实现最好，对大多数新特性
超越算法来看待个性化推荐 liyonghui160com 超越算法来看待个性化推荐
一提到个性化推荐，大家一般会想到协同过滤、文本相似等推荐算法，或是更高阶的模型推荐算法，百度的张栋说过，推荐40%取决于UI、30%取决于数据、20%取决于背景知识，虽然本人不是很认同这种比例，但推荐系统中，推荐算法起的作用起的作用是非常有限的。就像任何
写给Javascript初学者的小小建议 pda158 JavaScript
　　一般初学JavaScript的时候最头痛的就是浏览器兼容问题。在Firefox下面好好的代码放到IE就不能显示了，又或者是在IE能正常显示的代码在firefox又报错了。　　如果你正初学JavaScript并有着一样的处境的话建议你：初学JavaScript的时候无视DOM和BOM的兼容性，将更多的时间花在了解语言本身（ECMAScript）。只在特定浏览器编写代码（Chrome/Fi
Java 枚举 ShihLei java enum 枚举
注：文章内容大量借鉴使用网上的资料，可惜没有记录参考地址，只能再传对作者说声抱歉并表示感谢！一基础 1）语法枚举类型只能有私有构造器（这样做可以保证客户代码没有办法新建一个enum的实例）枚举实例必须最先定义 2）特性 &nb
Java SE 6 HotSpot虚拟机的垃圾回收机制 uuhorse java HotSpot GC 垃圾回收 VM
官方资料，关于Java SE 6 HotSpot虚拟机的garbage Collection，非常全，英文。 http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html Java SE 6 HotSpot[tm] Virtual Machine Garbage Collection Tuning &