samtools软件从bam文件提取目标序列

可以使用samtools fastq命令从bam文件中提取特定的fastq序列。
使用帮助

samtools fastq
Usage: samtools fastq [options...] 
Options:
  -0 FILE              write reads designated READ_OTHER to FILE
  -1 FILE              write reads designated READ1 to FILE
  -2 FILE              write reads designated READ2 to FILE
                       note: if a singleton file is specified with -s, only
                       paired reads will be written to the -1 and -2 files.
  -f INT               only include reads with all  of the FLAGs in INT present [0]
  -F INT               only include reads with none of the FLAGS in INT present [0]
  -G INT               only EXCLUDE reads with all  of the FLAGs in INT present [0]
  -n                   don't append /1 and /2 to the read name
  -N                   always append /1 and /2 to the read name
  -O                   output quality in the OQ tag if present
  -s FILE              write singleton reads designated READ1 or READ2 to FILE
  -t                   copy RG, BC and QT tags to the FASTQ header line
  -T TAGLIST           copy arbitrary tags to the FASTQ header line
  -v INT               default quality score if not given in file [1]
  -i                   add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
  -c                   compression level [0..9] to use when creating gz or bgzf fastq files
  --i1 FILE            write first index reads to FILE
  --i2 FILE            write second index reads to FILE
  --barcode-tag TAG    Barcode tag [default: BC]
  --quality-tag TAG    Quality tag [default: QT]
  --index-format STR   How to parse barcode and quality tags

      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]

Reads are designated READ1 if FLAG READ1 is set and READ2 is not set.
Reads are designated READ2 if FLAG READ1 is not set and READ2 is set.
Reads are designated READ_OTHER if FLAGs READ1 and READ2 are either both set
or both unset.
Run 'samtools flags' for more information on flag codes and meanings.

The index-format string describes how to parse the barcode and quality tags, for example:
   i14i8       the first 14 characters are index 1, the next 8 characters are index 2
   n8i14       ignore the first 8 characters, and use the next 14 characters for index 1
If the tag contains a separator, then the numeric part can be replaced with '*' to mean
'read until the separator or end of tag', for example:
   n*i*        ignore the left part of the tag until the separator, then use the second part
               of the tag as index 1

Examples:
 To get just the paired reads in separate files, use:
   samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n -F 0x900 in.bam

 To get all non-supplementary/secondary reads in a single file, redirect the output:
   samtools fastq -F 0x900 in.bam > all_reads.fq

Flags对应的意义如下:
BAM比对文件中flags对应的意义 - (jianshu.com)

Flags对应的意义.png

以上可以知道数字4可以将未比对上序列的和比对上序列分开。
数字64或128可以将read1和read2区分开。
这是示例序列的比对结果:

36599658 reads; of these:
36599658 (100.00%) were paired; of these:
36567223 (99.91%) aligned concordantly 0 times
32435 (0.09%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
----
36567223 pairs aligned concordantly 0 times; of these:
182 (0.00%) aligned discordantly 1 time
----
36567041 pairs aligned 0 times concordantly or discordantly; of these:
73134082 mates make up the pairs; of these:
73133169 (100.00%) aligned 0 times
913 (0.00%) aligned exactly 1 time
0 (0.00%) aligned >1 times
0.09% overall alignment rate

提取所有reads

双末端测序

ls *.bam |while read id
do
samtools fastq -@ 4 -N $id -1 ${id%%_*}_1.fq -2 ${id%%_*}_2.fq
gzip ${id%%_*}_1.fq ${id%%_*}_2.fq
done

单末端测序

ls *.bam |while read id
do
samtools fastq -@ 2 -N $id > ${id%%_*}.fq
gzip ${id%%_*}.fq
done

如果要提取特定状态的reads,需要使用f(提取)或F(过滤)参数+flags(BAM比对文件中flags对应的意义 - (jianshu.com)
),例如:-f 2为提取两条reads都比对上的reads;-F 4:提取所有比对上的reads(包含双末端中只有单条比对上的reads);-F 8过滤两条reads都没比对上的reads。
有些状态的reads不能通过一个步骤就将其提取;例如配对的两条reads只有其中一条比对上,我想提取另一条。可以分为两步提取:

samtools view -f 4 -b in.bam | samtools fastq -F 8 -1 read1-1.fq -2 read2-2.fq 

或者两条reads只有其中一条比对上,我想提取比对上的这一条

samtools view -F 4 -b in.bam | samtools fastq -F 2 -1 read1-1.fq -2 read2-2.fq

你可能感兴趣的:(samtools软件从bam文件提取目标序列)