HTS Notes

      生物学重复:指样本重复,比如3只小鼠,同时做一种处理,就是三个生物学重复。

      技术重复:一般是三次实验,比如对一块组织,提了三次RNA,做三次real time。

##HISAT Indexs

genome: HFM index for reference

genome_snp: Hierarchical Graph FM index (HGFM) index for reference plus SNPs

genome_tran: HGFM index for reference plus transcripts

genome_snp_tran: HGFM index for reference plus SNPs and transcripts


1. 统计 reads length

1) 统计fastq reads length

awk 'NR%4 == 2 {lengths[length($0)]++ ; counter++} END {for (l in lengths) {print l, lengths[l]}; print "total reads: " counter}' file.fastq

2) 统计bam中mapped reads length

samtools view ${id%%.*}_mapped.bam|awk 'BEGIN{OFS="\t"}{a[length($10)]++}END{for(i in a) print i,a[i]}' >${id%%.*}_mapped_read_length.txt

2. samtools view

提取比对到参考序列上的比对结果

$ samtools view -bF 4 abc.bam > abc.F.bam

提取paired reads中两条reads都比对到参考序列上的比对结果,只需要把两个4+8的值12作为过滤参数即可

$ samtools view -bF 12 abc.bam > abc.F12.bam

提取没有比对到参考序列上的比对结果

$ samtools view -bf 4 abc.bam > abc.f.bam

提取bam文件中比对到caffold1上的比对结果,并保存到sam文件格式

$ samtools view abc.bam scaffold1 > scaffold1.sam

提取scaffold1上能比对到30k到100k区域的比对结果

$ samtools view abc.bam scaffold1:30000-100000 $gt; scaffold1_30k-100k.sam

根据fasta文件,将 header 加入到 sam 或 bam 文件中

$ samtools view -T genome.fasta -h scaffold1.sam > scaffold1.h.sam


To get theunmappedreads from a bam file use :

samtools view -f 4 file.bam > unmapped.sam, the output will be in sam

to get the output in bam use : samtools view -b -f 4 file.bam > unmapped.bam

To get only the mapped reads use the parameter 'F', which works like-v of grep and skips the alignments for a specific flag.

samtools view -b -F 4 file.bam > mapped.bam

From themanual; there are different int codes you can use with the parameter 'f', based on what you want :

-f INT    Only output alignments with all bits in INT present in

the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]

Each bit in the FLAG field is defined as:

Flag        Chr    Description

0x0001      p      the read is paired in sequencing

0x0002      P      the read is mapped in a proper pair

0x0004      u      the query sequence itself is unmapped

0x0008      U      the mate is unmapped

0x0010      r      strand of the query (1 for reverse)

0x0020      R      strand of the mate

0x0040      1      the read is the first read in a pair

0x0080      2      the read is the second read in a pair

0x0100      s      the alignment is not primary

0x0200      f      the read fails platform/vendor quality checks

0x0400      d      the read is either a PCR or an optical duplicate

Like for getting theuniquereads (a single read mapping at one best position); I use :

-q INT    Skip alignments with MAPQ smaller than INT [0]

samtools view -bq 1 file.bam > unique.bam

你可能感兴趣的:(HTS Notes)