SAM格式详解

SAM

Table of contents

  • SAM format specification
    • Using SAM to store various types of alignments
      • Clipped alignment
      • Spliced alignment
      • Multi-part alignment
      • Padded alignment
    • Working on an example
    • Parsing SAM files using Perl
    • SAM Format Specification


SAM format specification


The Sequence Alignment/Map (SAM) format is a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly. Importantly:

  • It is flexible enough to store all the alignment information generated by various alignment programs;
  • It is simple enough to be easily generated by alignment programs or converted from existing alignment formats;
  • It is compact in file size;
  • It allows most of the operations on the alignment to work on a stream without loading the whole alignment into memory;
  • It allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.

SAM is a tab-delimited te xt format. SAM is a bit slow to parse; so there is a binary equivalent to SAM, called BAM.

SAM allows optional fields to be stored. In SAM, each alignment must contain a fi xed number of mandatory fields that describe the key information about the alignment (such as coordinate detailed alignment and sequences) and may contain a variable number of optional fields which are less important or aligner specific.

Using SAM to store various types of alignments


SAM is able to store clipped alignments, spliced alignments, multi-part alignments, padded alignments and alignments in colour space. The e xtended CIGAR string is the key to describing these types of alignments.

Clipped alignment


In Smith-Waterman alignment, a sequence may not be aligned from the first residue to the last one. Subsequences at the ends may be clipped off. We introduce operation 'S' to describe clipped alignment. Suppose the clipped alignment is:

clipped_alignment
REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG
READ:          gggGTGTAACC-GACTAGgggg


where on the read sequence, bases in uppercase are matches and bases in lowercase are clipped off. The CIGAR for this alignment is: 3S8M1D6M4S (which I interpret as 3 soft, 8 match, 1 deletion, 6 match and 4 soft).

Spliced alignment


In cDNA-to-genome alignment, we may want to distinguish introns from deletions in e xons. We introduce openation 'N' to represent long skip on the reference sequence. Suppose the spliced alignment is:

clipped_alignment
REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG
READ:          GTGTAACCC................................TCAGAATA


where '...' on the read sequence indicates intron. The CIGAR for this alignment is : 9M32N8M.

Multi-part alignment


One query sequence may be aligned to multiple places on the reference genome, either with or without overlaps. In SAM, we keep multiple hits as multiple alignment records. To avoid presenting the full query sequence multiple times for non-overlapping hits, we introduce operation 'H' to describe hard clipped alignment. Hard clipping (H) is similar to soft clipping (S). They are different in that hard clipped subsequence is not present in the alignment record. The e xample alignment in "clipped alignment" can also be represented with CIGAR: 3H8M1D6M4H, but in this case, the sequence stored in SAM is "GTGTAACCGACTAG", instead of "GGGGTGTAACCGACTAGGGGG" if soft clipping is in use.

Padded alignment


Most sequence aligners only give the sequences inserted to the reference genome, but do not present how these inserted sequences are aligned against each other. Alignment with inserted sequences fully aligned is called padded alignment. Padded alignment is always produced by de novo assemblers and is important for an alignment viewer to display the alignment properly. To store padded alignment, we introduce operation 'P' which can be considered as a silent deletion from padded reference sequence. In the following e xample, GA on READ1 and A on READ2 are inserted to the reference. With unpadded CIGAR, we would not be able to distinguish the following padded multi-alignments:

clipped_alignment
REF:  CACGATCA**GACCGATACGTCCGA        REF:  CACGATCA**GACCGATACGTCCGA
READ1:  CGATCAGAGACCGATA               READ1:  CGATCAGAGACCGATA
READ2:    ATCA*AGACCGATAC              READ2:    ATCAA*GACCGATAC
READ3:   GATCA**GACCG                  READ3:   GATCA**GACCG

The padded CIGAR are different:
READ1: 6M2I8M                          READ1: 6M2I8M
READ2: 4M1P1I9M                        READ2: 4M1I1P9M
READ3: 5M2P5M                          READ3: 5M2P5M


Note that it is hard to convert unpadded CIGAR to padded one. Fully resolving the alignment between inserted sequences would essentially require a de novo assembler. However, it is easy vice versa. By simply removing all P operations we get the CIGAR without padding.

Alignments in colour space Colour alignments are stored as normal nucleotide alignments with additional tags describing the raw colour sequences, qualities and colour-specific properties.

Working on an example


Downloaded from the 1000 genome project ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA06984/alignment/NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.bam

samtools view NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.bam | head -1

SRR035022.2621862 163 16 59999 37 22S54M = 60102 179 CCAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCGACCCTCACCCTCACCC >AAA=>?AA>@@B@B?AABAB?AABAB?AAC@B?@AB@A?A>A@A?AAAAB??ABAB?79A?AAB;B?@?@<=8:8 XT:A:M XN:i:2 SM:i:37 AM:i:37 XM:i:0 XO:i:0 XG:i:0 RG: Z:SRR035022 NM:i:2 MD: Z:0N0N52 OQ: Z:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCBBCC@CCCCCCCCCCACCCCC;CCCBBC?CCCACCACA@

The alignment section consists of multiple TAB-delimited lines with each line describing an alignment. Each line is:
sequence_string.sam
        ZE>   [:: [...]]




1. QNAME and FLAG are required for all alignments. If the mapping position of the query is not available, RNAME and CIGAR are set as “*”, and POS and MAPQ as 0. If the query is unpaired or pairing information is not available, MRNM equals “*”, and MPOS and ISI ZE equal 0. SEQ and QUAL can both be absent, represented as a star “*”. If QUAL is not a star, it must be of the same length as SEQ.
2. The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits. The ma ximum string length is 254.
3. If SQ is present in the header, RNAME and MRNM must appear in an SQ header record.
4. Field MAPQ considers pairing in calculation if the read is paired. Providing MAPQ is recommended. If such a calculation is difficult, 255 should be applied, indicating the mapping quality is not available.
5. If the two reads in a pair are mapped to the same reference, ISI ZE equals the difference between the coordinate of the 5ʼ-end of the mate and of the 5ʼ-end of the current read; otherwise ISI ZE equals 0 (by the “5ʼ-end” we mean the 5ʼ-end of the original read, so for Illumina short-insert paired end reads this calculates the difference in mapping coordinates of the outer edges of the original sequenced fragment). ISI ZE is negative if the mate is mapped to a smaller coordinate than the current read.
6. Color alignments are stored as normal nucleotide alignments with additional tags describing the raw color sequences, qualities and color-specific properties (see also Note 5 in section 2.2.4).
7. All mapped reads are represented on the forward genomic strand. The bases are reverse complemented from the unmapped read sequence and the quality scores and cigar strings are recorded consistently with the bases. This applies to information in the mate tags (R2, Q2, S2, etc.) and any other tags that are strand sensitive. The strand bits in the flag simply indicates whether this reverse complement transform was applied from the original read sequence to obtain the bases listed in the SAM file.

So in our e xample:

QNAME SRR035022.2621862
FLAG 163
RNAME 16
POS 59999
MAQ 37
CIGAR 22S54M
MRNM =
MPOS 60102
ISIZE 179
SEQ CCAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCGACCCTCACCCTCACCC
QUAL >AAA=>?AA>@@B@B?AABAB?AABAB?AAC@B?@AB@A?A>A@A?AAAAB??ABAB?79A?AAB;B?@?@<=8:8
TAG XT:A:M
TAG XN:i:2
TAG SM:i:37
TAG AM:i:37
TAG XM:i:0
TAG XO:i:0
TAG XG:i:0
TAG RG:Z:SRR035022
TAG NM:i:2
TAG MD:Z:0N0N52
TAG OQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCBBCC@CCCCCCCCCCACCCCC;CCCBBC?CCCACCACA


The QNAME is the query name. For the FLAG of 163 we transform this into a binary string: 10100011. So accordingly to the flag table:



1 the read is paired in sequencing, no matter whether it is mapped in a pair
1 the read is mapped in a proper pair
0 not unmapped
0 mate is not unmapped
0 forward strand
1 mate strand is negative
0 the read is not the first read in a pair
1 the read is the second read in a pair


The RNAME is chromosome 16; the POS is the 59_999; the MAQ is 37.

And now the CIGAR format:

A CIGAR string is comprised of a series of operation lengths plus the operations. The conventional CIGAR format allows for three types of operations: M for match or mismatch, I for insertion and D for deletion. The e xtended CIGAR format futher allows four more operations, as is shown in the following table, to describe clipping, padding and splicing:

op Description
M Alignment match (can be a sequence match or mismatch
I Insertion to the reference
D Deletion from the reference
N Skipped region from the reference
S Soft clip on the read (clipped sequence present in )
H Hard clip on the read (clipped sequence NOT present in )
P Padding (silent deletion from the padded reference sequence)


Optional fields are in the format: ::. Each tag is encoded in two alphanumeric characters and appears only once for an alignment. The follows Perl's rule (see perldoc -f pack). Valid types in SAM are:

Type Description
A Printable character
i Signed 32-bin interger
f Single-precision float number
Z Printable string
H Hex string (high nybble first)


There are a bunch of predefined tags, please see the SAM manual for more information. For the tags used in this e xample:

Any tags that start with X? are reserved fields for end users: XT:A:M, XN:i:2, XM:i:0, XO:i:0, XG:i:0

SM:i:37
- Mapping quality if the read is mapped as a single read rather than as a read pair
AM:i:37 - Smaller single-end mapping quality of the two reads in a pair
RG:Z:SRR035022 - Read group. Value matches the header RG-ID tag if @RG is present in the header
NM:i:2 - Number of nucleotide differences (i.e. edit distance to the reference sequence)
MD:Z:0N0N52 - String for mismatching positions in the format of [0-9]+(([ACGTN]|\^[ACGTN]+)[0-9]+)*
OQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCBBCC@CCCCCCCCCCACCCCC;CCCBBC?CCCACCACA@ - Original base quality. Encoded in the same wasy as QUAL



Parsing SAM files using Perl


http://search.cpan.org/~lds/Bio-SamTools/lib/Bio/DB/Sam.pm

SAM Format Specification


http://samtools.sourceforge.net/SAM1.pdf

你可能感兴趣的:(Linux,bioinfo,巨人肩上)