1.Samtools的help:
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools --help Program: samtools (Tools for alignments in the SAM format) Version: 1.3-20-gd49c73b (using htslib 1.3-35-g26b3085) Usage: samtools <command> [options] Commands: -- Indexing dict create a sequence dictionary file faidx index/extract FASTA index index alignment -- Editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header rmdup remove PCR duplicates targetcut cut fosmid regions (for fosmid pool only) addreplacerg adds or replaces RG tags -- File operations collate shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group quickcheck quickly check if SAM/BAM/CRAM file appears intact fastq converts a BAM to a FASTQ fasta converts a BAM to a FASTA -- Statistics bedcov read depth per BED region depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck) -- Viewing flags explain BAM flags tview text alignment viewer view SAM<->BAM<->CRAM conversion depad convert padded BAM to unpadded BAM
2.Bam和Sam相互转换:
samtools view -h file.bam > file.sam samtools view -b -S file.sam > file.bam
示例:
samtools view -h NA12878.bam >NA12878_2.sam samtools view -b -S NA12878.sam > NA12878_2.bam
3.构建索引index:
samtools index sorted.bam实例:
samtools index NA12878.bam生成NA12878.bam.bai,具体:
不知道如何查看.bai文件???
4.统计信息flagstat:
samtools flagstat NA12878Merge.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools flagstat NA12878Merge.bam 1058 + 72 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 118 + 32 duplicates 1058 + 72 mapped (100.00% : 100.00%) 1058 + 72 paired in sequencing 516 + 26 read1 542 + 46 read2 1048 + 64 properly paired (99.05% : 88.89%) 1048 + 64 with itself and mate mapped 10 + 8 singletons (0.95% : 11.11%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
hadoop@Mcnode6:~/cloud/adam/xubo/1000genomes/GIH/NA21144/alignment$ samtools flagstat NA21144.alt_bwamem_GRCh38DH.20150718.GIH.low_coverage.cram 247384356 + 0 in total (QC-passed reads + QC-failed reads) 2090839 + 0 secondary 9082475 + 0 supplementary 5797777 + 0 duplicates 246212003 + 0 mapped (99.53% : N/A) 236211042 + 0 paired in sequencing 118105521 + 0 read1 118105521 + 0 read2 226096416 + 0 properly paired (95.72% : N/A) 233890591 + 0 with itself and mate mapped 1148098 + 0 singletons (0.49% : N/A) 4379358 + 0 with mate mapped to a different chr 1561618 + 0 with mate mapped to a different chr (mapQ>=5)
5.合并文件merge:
bam文件:
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools merge NA12878Merge.bam NA12878.bam NA12878_2.bam hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ls aln.sorted.bam cramtools-3.0.jar NA12878_2.bam NA12878_2.sam NA12878.bam NA12878.bam.bai NA12878Merge.bam NA12878.sam hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ll total 4676 drwxrwxr-x 2 hadoop hadoop 4096 3月 9 20:41 ./ drwx------ 13 hadoop hadoop 4096 3月 7 22:07 ../ -rw-rw-r-- 1 hadoop hadoop 64321 3月 9 20:29 aln.sorted.bam -rw-rw-r-- 1 hadoop hadoop 3978747 3月 8 22:01 cramtools-3.0.jar -rw-rw-r-- 1 hadoop hadoop 64321 3月 9 20:27 NA12878_2.bam -rw-rw-r-- 1 hadoop hadoop 255818 3月 9 16:06 NA12878_2.sam -rw-rw-r-- 1 hadoop hadoop 64321 3月 9 16:00 NA12878.bam -rw-rw-r-- 1 hadoop hadoop 200 3月 9 16:04 NA12878.bam.bai -rw-r--r-- 1 hadoop hadoop 4096 3月 9 16:01 .NA12878.bam.swp -rw-rw-r-- 1 hadoop hadoop 75781 3月 9 20:41 NA12878Merge.bam -rw-r--r-- 1 hadoop hadoop 255818 11月 16 2014 NA12878.sam
bam统计信息:
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools flagstat NA12878.bam 529 + 36 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 59 + 16 duplicates 529 + 36 mapped (100.00% : 100.00%) 529 + 36 paired in sequencing 258 + 13 read1 271 + 23 read2 524 + 32 properly paired (99.05% : 88.89%) 524 + 32 with itself and mate mapped 5 + 4 singletons (0.95% : 11.11%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools flagstat NA12878Merge.bam 1058 + 72 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 118 + 32 duplicates 1058 + 72 mapped (100.00% : 100.00%) 1058 + 72 paired in sequencing 516 + 26 read1 542 + 46 read2 1048 + 64 properly paired (99.05% : 88.89%) 1048 + 64 with itself and mate mapped 10 + 8 singletons (0.95% : 11.11%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools merge NA12878Merge.sam NA12878.sam NA12878_2.sam hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ls aln.sorted.bam cramtools-3.0.jar NA12878_2.bam NA12878_2.sam NA12878.bam NA12878.bam.bai NA12878Merge.bam NA12878Merge.sam NA12878.sam hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ll total 5188 drwxrwxr-x 2 hadoop hadoop 4096 3月 9 20:44 ./ drwx------ 13 hadoop hadoop 4096 3月 7 22:07 ../ -rw-rw-r-- 1 hadoop hadoop 64321 3月 9 20:29 aln.sorted.bam -rw-rw-r-- 1 hadoop hadoop 3978747 3月 8 22:01 cramtools-3.0.jar -rw-rw-r-- 1 hadoop hadoop 64321 3月 9 20:27 NA12878_2.bam -rw-rw-r-- 1 hadoop hadoop 255818 3月 9 16:06 NA12878_2.sam -rw-rw-r-- 1 hadoop hadoop 64321 3月 9 16:00 NA12878.bam -rw-rw-r-- 1 hadoop hadoop 200 3月 9 16:04 NA12878.bam.bai -rw-r--r-- 1 hadoop hadoop 4096 3月 9 16:01 .NA12878.bam.swp -rw-rw-r-- 1 hadoop hadoop 75781 3月 9 20:41 NA12878Merge.bam -rw-rw-r-- 1 hadoop hadoop 521677 3月 9 20:44 NA12878Merge.sam -rw-r--r-- 1 hadoop hadoop 255818 11月 16 2014 NA12878.sam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools flagstat NA12878Merge.sam 1058 + 72 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 118 + 32 duplicates 1058 + 72 mapped (100.00% : 100.00%) 1058 + 72 paired in sequencing 516 + 26 read1 542 + 46 read2 1048 + 64 properly paired (99.05% : 88.89%) 1048 + 64 with itself and mate mapped 10 + 8 singletons (0.95% : 11.11%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools flagstat NA12878.sam 529 + 36 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 59 + 16 duplicates 529 + 36 mapped (100.00% : 100.00%) 529 + 36 paired in sequencing 258 + 13 read1 271 + 23 read2 524 + 32 properly paired (99.05% : 88.89%) 524 + 32 with itself and mate mapped 5 + 4 singletons (0.95% : 11.11%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
samtools sort -o NA12878.Sorted.bam NA12878.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools index aln.sorted.bam hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools idxstats aln.sorted.bam 20 63025520 565 0 * 0 0 0 hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ls aln.sorted.bam aln.sorted.bam.bai cramtools-3.0.jar NA12878_2.bam NA12878_2.sam NA12878.bam NA12878.bam.bai NA12878Merge.bam NA12878Merge.sam NA12878.sam NA12878.Sorted.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools view -b -S small.sam >small.bam hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ ls artificial.fa fastq_sample1.fq interleaved_fastq_sample1.ifq proper_pairs_2.fq small_realignment_targets.pileup artificial.README.txt fastq_sample2.fq interleaved_fastq_sample1.ifq.output reads-0-2-0 small_realignment_targets_README.txt artificial.realigned.sam fastq_sample3.fq interleaved_fastq_sample2.ifq reads12_diff1.sam small_realignment_targets.sam artificial.sam fastq_sample4.fq interleaved_fastq_sample2.ifq.output reads12.sam small.sam bqsr1-ref.observed features interleaved_fastq_sample3.ifq reads21.sam small.vcf bqsr1.sam gencode.v19.pc_transcripts.250k.fa.gz interleaved_fastq_sample3.ifq.output single_fastq_sample1.fq.output sorted.sam bqsr1.snps hg19.chrM.2bit interleaved_fastq_sample4.ifq single_fastq_sample2.fq.output tags.sam bqsr1.vcf Hs_Ensembl_example_genes.gtf interleaved_fastq_sample4.ifq.output single_fastq_sample3.fq.output test.conf chr20.250k.fa.gz human_g1k_v37_chr1_59kb.2bit multiline_fastq.fq single_fastq_sample4.fq.output test_rowgroup_rangeindex.1.txt dict_with_accession.dict human_g1k_v37_chr1_59kb.fasta ordered.sam small.bam unmapped.sam example_intervals.list improper_pairs_1.fq parquet_lister_dir_empty small_missing.vcf unordered.sam fastq_noqual.fq improper_pairs_2.fq proper_pairs_1.fq small_realignment_targets.intervals unsorted.sam hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools sort -o smallSort.bam small.bam hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools index smallSort.bam hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools idxstats smallSort.bam 1 249250621 20 0 2 243199373 0 0 * 0 0 0
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools depth --help depth: unrecognized option '--help' Usage: samtools depth [options] in1.bam [in2.bam [...]] Options: -a output all positions (including zero depth) -a -a (or -aa) output absolutely all positions, including unused ref. sequences -b <bed> list of positions or regions -f <list> list of input BAM filenames, one per line [null] -l <int> read length threshold (ignore reads shorter than <int>) -d/-m <int> maximum coverage depth [8000] -q <int> base quality threshold -Q <int> mapping quality threshold -r <chr:from-to> region --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null] The output is a simple tab-separated table with three columns: reference name, position, and coverage depth. Note that positions with zero coverage may be omitted by default; see the -a option. hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools depth small.bam 1 26472784 1 1 26472785 1 1 26472786 1 1 26472787 1 1 26472788 1 1 26472789 1 1 26472790 1 1 26472791 1 1 26472792 1 1 26472793 1 1 26472794 1 1 26472795 1 1 26472796 1 1 26472797 1 1 26472798 1 1 26472799 1 1 26472800 1 1 26472801 1 1 26472802 1 1 26472803 1 1 26472804 1 1 26472805 1 1 26472806 1 1 26472807 1 1 26472808 1 1 26472809 1 1 26472810 1 1 26472811 1 1 26472812 1 1 26472813 1 1 26472814 1 1 26472815 1 1 26472816 1 1 26472817 1 1 26472818 1 1 26472819 1 1 26472820 1 1 26472821 1 1 26472822 1 1 26472823 1 1 26472824 1 1 26472825 1 1 26472826 1 1 26472827 1 1 26472828 1 1 26472829 1 1 26472830 1 1 26472831 1 1 26472832 1 1 26472833 1 1 26472834 1 1 26472835 1 1 26472836 1 1 26472837 1 1 26472838 1 1 26472839 1 1 26472840 1 1 26472841 1 1 26472842 1 1 26472843 1 1 26472844 1 1 26472845 1 1 26472846 1 1 26472847 1 1 26472848 1 1 26472849 1 1 26472850 1 1 26472851 1 1 26472852 1 1 26472853 1 1 26472854 1 1 26472855 1 1 26472856 1 1 26472857 1 1 26472858 1
faidx samtools faidx <ref.fasta> [region1 [...]] Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the BGZF format. The sequences in the input file should all have different names. If they do not, indexing will emit a warning about duplicate sequences and retrieval will only produce subsequences from the first sequence with the duplicated name.
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ ls 00README.txt ex1.fa ex1.sam.gz toy.fa toy.sam hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools faidx ex1.fa hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ ls 00README.txt ex1.fa ex1.fa.fai ex1.sam.gz toy.fa toy.sam
samtools tview [-p chr:pos] [-s STR] [-d display] <in.sorted.bam> [ref.fasta] Text alignment viewer (based on the ncurses library). In the viewer, press `?' for help and press `g' to check the alignment start from a region in the format like `chr10:10,000,000' or `=10,000,000' when viewing the same reference sequence. Options: -d display Output as (H)tml or (C)urses or (T)ext -p chr:pos Go directly to this position -s STR Display only alignments from this sample or read group
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools view -h -S toy.sam >toy.bam hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools sort -o toy.sorted.bam toy.bam hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools tview toy.bam toy.fa Cannot read index for 'toy.bam'. hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools index toy.sorted.bam hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ ls 00README.txt ex1.fa ex1.fa.fai ex1.sam.gz toy.bam toy.fa toy.fa.fai toy.sam toy.sorted.bam toy.sorted.bam.bai toy.sorted.sam hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools tview toy.sbam toy.fa toy.sam toy.sorted.bam toy.sorted.bam.bai toy.sorted.sam hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools tview toy.sbam toy.fa toy.sam toy.sorted.bam toy.sorted.bam.bai toy.sorted.sam hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools tview toy.sorted.bam toy.fa 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 AGCATGTTAGATAA****GATA**GCTGTGCTAGTAGGCAG*TCAGCGCCATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ........ .... ......K.K......K. .......... ........AGAG....***... ,,,,, ,,,,,,,,, ......GG**....AA ..C...**** ...**...>>>>>>>>>>>>>>T.....
10.将bam/sam/cram文件转换成fasta/fastq文件:
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools fasta -s NA12878.fa NA12878.sam [M::bam2fq_mainloop_singletontrack] discarded 565 singletons [M::bam2fq_mainloop_singletontrack] processed 565 reads hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ls aln.sorted.bam cramtools-3.0.jar NA12878_2.sam NA12878.bam.bai NA12878Merge.bam NA12878.sam resources aln.sorted.bam.bai NA12878_2.bam NA12878.bam NA12878.fa NA12878Merge.sam NA12878.Sorted.bam hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ vi NA12878.fa hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools fastq -s NA12878.fq NA12878.sam [M::bam2fq_mainloop_singletontrack] discarded 565 singletons [M::bam2fq_mainloop_singletontrack] processed 565 reads hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ more NA12878.fq @20FUKAAXX100202:3:6:15018:84106/2 ACCCAAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAA + ?@BBBCEEDFEFEEEFDEEFEEEEBFEDEFCFDDEEFEDFDFEEEFEEEECEEFEEFCEFDEEFFEFEDEEEFFFDECEDCEFEEDDFFBFEFGEAEDCCC @20GAVAAXX100126:8:62:5578:2527/1 TTGCTGATTGAATCCCCATGGTGGTATTTCATGATCTTCCTTTGTCCTTGGTATATCCTAGAGCTTGGGAGTTAGAGTGGGAGCCTTGATTAGCTTTGGGT + 1;@3@?<>>?<;:=;==?=>9=?9>9===>=<<;;@@BAABAA>B>4?><8<:=<==;>9:@?>=;<7+/;:/80?5>;-(-;89<::8::8:(1,/:438 @20FUKAAXX100202:4:47:20584:49257/2 CCAAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAAT + ?ACDBBCEDFEDEFEEEFEDBECFBFEFCFDEEEFEDFDFEEEFEEEECEEFEEFCEFFEEFFEFEDEAEFFFAECEFCDFEEFBFFDBEEC:@6A?C4>B @20GAVAAXX100126:7:47:4730:37293/2 CCAAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAAT + ?BB@BCBFDDECC=E@@DB;BDCFDE<<AEB@B>BADD>?C?EDEB>@AC=<?=DAE?E=CAC?;<C=@ADD?ACACCAC>:>4=B676<17@@<:AA<;6 @20GAVAAXX100126:5:46:21151:39489/1 ATTTGCTGATTGAATCCCCATGGTGGTATTTCATGATCTTCCTTTGTCCTTGGTATATCCTAGAGCTTGGGAGTTAGAGTGGGAGCCTTGATTAGATTTGG + A=DED=FEFEFEFFEEEEEFEEEEEEEEEFFDEEDFEDFFDDFFFDDDDFFEEDEEEEDDFEEFEEFFEEEFEDFEEFECEEEFEEEFFE>BB>BBB=<9> @20FUKAAXX100202:6:2:2264:85470/1 CAAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAATT + ABDFEEFEGEEFFEFEFEEEFCCDFEFCFEEEEFEDFDF?EEFCEEEBCEFECFCEFFEDFFBE=CF=CFEFDEB@D=CDDC>D:?;;=@DACECAD5A>9 @20FUKAAXX100202:7:24:21050:192748/1 CAAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAATT + ABEFEEFFGFEFFEEEFEEEFCFDFEFCFEEEEFEDFDFEEEFEEEECDEFEEFCEFFEEFFEFCCFBE@FGDEAAEBCDC@:FBFBB?EDEEDCCDCB=< @20FUKAAXX100202:6:48:15418:77607/1 AAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAATTC + ACDEDFEGEEFGEEEFEEEECFEFEFCFEEEEFEDFDFEEEFEEEECEEFEEFCEFFEEFFEFEDEEEFFFEECEFCEFFEFFFDDFEFGFEFFEEFECCB @20GAVAAXX100126:3:65:13772:1294/1 AAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAATTC +
详细操作请参考【1】
参考资料:
【1】 http://www.bbioo.com/lifesciences/40-113338-1.html
【2】 http://www.htslib.org/doc/samtools.html
【3】