RNA-seq分析(Fastqc+Trimmomatic+STAR+HTseq-count+DESeq2)

最近做RNA-seq,正好把流程整理下,也希望分享和相互学习。
具体将以Fastqc + Trimmomatic + STAR + HTseq-count + DEseq2的流程来进行。

查看数据完整性

for dir in `ls`; do cd $dir; md5sum -c MD5_*txt; cd ..; done

预处理

FastQC + Trimmomatic

fastqc -t 5 sample_R1.fq.gz
fastqc -t 5 sample_R2.fq.gz
java -jar ~/tools/Trimmomatic/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 20 sample_R1.fq.gz sample_R2.fq.gz -baseout sample_filtered.fq.gz ILLUMINACLIP:~/tools/Trimmomatic/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 HEADCROP:8 MINLEN:36 HEADCROP:15

fastqc后发现有些样本per tile sequence content 1,Per base sequence content,Adapter Content,Kmer Content没有通过。主要问题是去除些质量差的reads;根据前15个左右碱基比不均一,用HEADCROP去掉。用的是TruSeq的adapter,故而加上,用Trimmomatic。
Trimmomatic相关学习内容,见2,3.4.

STAR

make index
人和小鼠的基因组和参考注释用Tophat的igenomes下:

STAR --runThreadN 30 --runMode genomeGenerate --genomeDir STARINDEX_20180118/ --genomeFastaFiles WholeGenomeFasta/genome.fa --sjdbGTFfile ../Annotation/Genes/genes.gtf --sjdbOverhang 134

do the alignment.
可以基于第一次比对的结果,用SJ.out.tab于重新Genome的Index,然后再比对(在用找SNP和Indel时尤其推荐)。7

STAR --runThreadN 30 --genomeDir ~/Ref/UCSC_hg19/Homo_sapiens/UCSC/hg19/Sequence/STARIndex_20180118 --readFilesIn sample_filtered_1P.fq.gz sample_filtered_2P.fq.gz --outFileNamePrefix ./Hs_treat3/Hs_treat3 --readFilesCommand zcat

参考内容:5, 6,
Trim reads map to multiple regions.

samtools view -bS -F 4 Hs_treat3Aligned.out.sam > Hs_treat3_mapped.bam
samtools sort -n Hs_treat3_mapped.bam Hs_treat3_sort

HTSeq

用htseq-count计算read counts。8,9

htseq-count -f bam -s no Hs_treat3_sort.bam ~/Ref/UCSC_hg19/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > sample.count

DESeq2差异分析

library(DESeq2)
condition <- factor(c("A","A","B","B"))
dds <- DESeqDataSetFromMatrix(hs, DataFrame(condition), ~ condition)
dds <- dds[ rowSums(counts(dds)) > 1, ]   #过滤low count数据
nrow(dds)
dds <- DESeq(dds)     #差异分析
res <- results(dds)   #用result()函数获取结果
summary(res)  #summary()函数统计结果
count_r <- counts(dds, normalized=T)  #提取normalized count matrix

10

你可能感兴趣的:(RNA-seq)