写在前面
- 很荣幸参加生信技能树全国巡讲-第一站-重庆站的RNA-seq线下培训课程。
- 以下为线下课结合线上课所做的笔记,希望能帮到你。
一些小练习题
R语言的练习题
初级10 个题目,尽量根据参考代码理解及完成:http://www.bio-info-trainee.com/3793.html
中级要求是:http://www.bio-info-trainee.com/3750.html
高级要求是完成20题: http://www.bio-info-trainee.com/3415.html
LINUX的练习题:
最低要求是完成我的 linux 20题 http://www.bio-info-trainee.com/2900.html
其次完成生物信息学数据格式的习题(blast/blat/fa-fq/sam-bam/vcf/bed/gtf-gff),收集这些格式的说明书。
fasta和fastq格式文件的shell小练习 http://www.bio-info-trainee.com/3575.html
sam和bam格式文件的shell小练习 http://www.bio-info-trainee.com/3578.html
VCF格式文件的shell小练习 http://www.bio-info-trainee.com/3577.html
RNA-seq只有小考核 http://www.bio-info-trainee.com/3920.html
下面开始RNA-seq流程
1.软件下载
1.1下载Miniconda3
- 清华镜像关闭后,只有选择其它下载:
[官网]:https://docs.conda.io/en/latest/miniconda.html - 参考教程: conda的安装与使用
# 下载Miniconda3
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
## 选择miniconda2/3都可以,可以通过创建另一个版本Python环境。
# 安装Miniconda3
$ bash Miniconda3-latest-Linux-x86_64.sh
$ source ~/.bashrc
(base) vip39 13:30:42 ~ #表示激活miniconda
# 配置镜像,清华镜像挂掉后悬着这个吧
conda config --add channels bioconda #生信软件
conda config --add channels conda-forge
# 查看一下配置的镜像
$ vim ~/.condarc
channels:
- conda-forge
- bioconda
- defaults
show_channel_urls: true
# 创建命名为rna的python2的环境
$ conda create -n rna python=2
Collecting package metadata: /
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
# 激活/进入conda的rna环境,避免每次用-n rna
$ source activate rna
(rna) vip39 13:47:00 ~
1.2安装软件
jimmy老师给出的安装列表
- 数据资源下载,参考基因组及参考转录组
gtf ,genome,fa - 质控,需要fastqc及multiqc
trimmomatic,cutadapt,trim-galore - 对比
star,HISAT2,TOPHAT2, bowtie,bwa,subread - 计数
featureCounts,htseq-counts,bedtools - nomalization 归一化,差异分析等
DEseq2,edgeR,limma
$ conda install -y fastqc trim-galore star bwa hisat2 bowtie2 subread tophat htseq bedtools deeptools salmon cutadapt multiqc sra-tools
$ conda install -c bioconda trimmomatic
- 清华镜像挂掉后,就多用google,或者https://anaconda.org/bioconda/
2.下载数据
- 找到SRA数据 [可修改SRP号]
- 找到SRA数据集,找到_List.txt
#得到_List.txt后,我们直接用循环下载数据
$ cat SRR_Acc_List.txt | while read id; do (prefetch ${id});done
##上面程序挂为后台运行的方式为:
cat SRR_Acc_List.txt | while read id; do (prefetch $id &);done
## 杀死上面这个循坏
ps -ef |grep prefecth awk '{print $2}'|while read id ;do kill $ id ;done
# 下载得到数据,这里我们下载得到4个数据
$ tree
.
├── SRR1039508.sra
├── SRR1039509.sra
├── SRR1039510.sra
└── SRR1039514.sra
# 3. SRA数据转成fastq.gz格式
## 使用循环来实现
$ mkdir 1.fastq_qc
$ cd 1.fastq_qc
$ ls ~/ncbi/public/sra/*|while read id ;do (nohup fastq-dump --gzip --split-3 -O ./ $id &);done
## 你也可以手动实现转换
fastq-dump --gzip --split-3 -O ~/ ./SRR1039508.sra
3.质控
3.1了解数据大小、质量
$ ls *gz |xargs fastqc -t 10
## multiqc整合结果
$ multiqc ./
...
[INFO ] fastqc : Found 8 reports #找到8个,对应上我上面的4个
[INFO ] multiqc : Report : multiqc_report.html #将这个结果报告下载下来查看。
[INFO ] multiqc : MultiQC complete
-
查看multiqc_report.html
3.2过滤数据
- 去除接头
- 去除两端低质量碱基(-q 25)
- 最大允许错误率(默认-e 0.1)
- 去除<36的reads(--length 36)
- 切除index的overlap>3的碱基
- reads去除以对为单位(--paired)
##过滤数据
###step3:filter the bad quality reads and remove adaptors。
$ cd 1.fastq_qc/
$ ls ~/1.fastq_qc/*_1.fastq.gz
/home/vip51/1.fastq_qc/SRR1039508_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039509_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039510_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039514_1.fastq.gz
$ ls ~/1.fastq_qc/*_1.fastq.gz >1
$ ls ~/1.fastq_qc/*_2.fastq.gz >2
$ paste 1 2
$ paste 1 2 >config
$ cat config
/home/vip51/1.fastq_qc/SRR1039508_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039508_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039509_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039509_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039510_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039510_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039514_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039514_2.fastq.gz
## 建立好config后,接下来就可以进行批量质控了
## 建立批量处理脚本
$ cat >qc.sh
source activate rna
bin_trim_galore=trim_galore
mkdir filter-data
dir='/home/vip51/filter-data'
cat config |while read id
do
arr=($id)
fq1=${arr[0]}
fq2=${arr[1]}
nohup $bin_trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $dir $fq1 $fq2 &
done
source deactivate
## 拿到config和qc.sh后
$ bash qc.sh config
4比对-HISAT2 mapping
4.1参考基因组hg38及构建索引
- 【直播】我的基因组(五):测试数据及参考基因组的准备
- RNA-seq(4):下载参考基因组及基因注释
#hisat软件建立索引文件
cd ~/reference
mkdir -p index/hisat && cd index/hisat
nohup wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg38.tar.gz &
nohup tar zxvf grcm38.tar.gz &
## 解压后
(rna) vip51 00:19:08 ~/reference/index/hisat/hg38
$ ls
genome.1.ht2 genome.2.ht2 genome.3.ht2 genome.4.ht2 genome.5.ht2 genome.6.ht2 genome.7.ht2 genome.8.ht2 make_hg38.sh
# 这里表示构建索引成功,简单的说,就是去官网下载index,然后解压即可成功
- 其实构建索引的时候,我花了很多时间,就在于
~/reference/index/hisat/hg38/genome
这个的理解不到位,所以跑去Google-构建索引失败才发现问题.
hisat2 -p 2 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 2>${id}.sam.log
4.2 比对
- bam查看方法
samtools view -h smaple.bam
samtools view smaple.bam
## 将文件名为*gq.gz改为*fq,head-1000取的前1000行,这一步非必要的
$ ls ~/filter-data/*gz|while read id ;do (zcat $id|head -1000> $(basename $id ".gz"));done
$ (rna) vip51 20:30:12 ~/filter-data
$ ls -lh
total 12G
-rw-rw-r-- 1 vip51 vip51 2.9K Jun 2 15:29 SRR1039508_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 vip51 vip51 64K Jun 6 16:53 SRR1039508_1_val_1.fq #取的前1000行,所以数据很小。
-rw-rw-r-- 1 vip51 vip51 1.3G Jun 2 15:44 SRR1039508_1_val_1.fq.gz
## hisat2 比对
id=SRR1039508
hisat2 -p 2 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 2>${id}.sam.log
## hisat批量比对
$ ls *gz|cut -d "_" -f 1|sort -u
SRR1039508
SRR1039509
SRR1039510
SRR1039514
## 建立一个hisat.sh
$ cat > hisat.sh
cd filter-data/
ls *gz|cut -d "_" -f 1|sort -u |while read id ;do
cd ~/4.mapping/
ls -lh ${id}_1_val_1.fq.gz ${id}_2_val_2.fq.gz
hisat2 -p 10 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq.gz -2 ~/filter-data/${id}_2_val_2.fq.gz -S ${id}.hisat.sam
done
###批量对比
$ source hisat.sh
(rna) vip51 23:15:04 ~/4.mapping
$ ls -lh
total 56G
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 12G Jun 7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 21G Jun 7 23:08 SRR1039514.hisat.sam
##上面看出.sam文件还是很大的
## 如果跑的是这个
# hisat2 -p 10 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam
## 上面对对应的就是head -1000,数据会很小
###.sam转化为.bam
$ ls *.sam |while read id;do (samtools sort -O bam -@ 5 -o $(basename ${id} ".sam").bam ${id});done
(rna) vip51 23:34:31 ~/4.mapping
$ ls -lh
total 66G
-rw-rw-r-- 1 vip51 vip51 2.1G Jun 7 23:21 SRR1039508.hisat.bam
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 1.9G Jun 7 23:24 SRR1039509.hisat.bam
-rw-rw-r-- 1 vip51 vip51 12G Jun 7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 2.1G Jun 7 23:28 SRR1039510.hisat.bam
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 3.7G Jun 7 23:34 SRR1039514.hisat.bam
-rw-rw-r-- 1 vip51 vip51 21G Jun 7 23:08 SRR1039514.hisat.sam
# 可以看出.bam文件要小得多
4.3 比对策略统计比对效率
ls *.bam |while read id ;do ( samtools flagstat -@ 1 $id > $(basename ${id} ".bam").flagstat );done
(rna) vip51 23:42:28 ~/4.mapping
$ ls -lh
total 66G
-rw-rw-r-- 1 vip51 vip51 2.1G Jun 7 23:21 SRR1039508.hisat.bam
-rw-rw-r-- 1 vip51 vip51 448 Jun 7 23:38 SRR1039508.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 1.9G Jun 7 23:24 SRR1039509.hisat.bam
-rw-rw-r-- 1 vip51 vip51 448 Jun 7 23:38 SRR1039509.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51 12G Jun 7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 2.1G Jun 7 23:28 SRR1039510.hisat.bam
-rw-rw-r-- 1 vip51 vip51 449 Jun 7 23:39 SRR1039510.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 3.7G Jun 7 23:34 SRR1039514.hisat.bam
-rw-rw-r-- 1 vip51 vip51 449 Jun 7 23:39 SRR1039514.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51 21G Jun 7 23:08 SRR1039514.hisat.sam
$ ls *.bam|xargs -i samtools index {} #构建索引,拿到IGV里面看
###
(rna) vip51 23:43:56 ~/4.mapping
$ mkdir stat
$ mv *flagstat stat/
$ cd stat/
$ ls
SRR1039508.hisat.flagstat SRR1039509.hisat.flagstat SRR1039510.hisat.flagstat SRR1039514.hisat.flagstat
(rna) vip51 00:24:55 ~/4.mapping/stat
$ wc -l *
13 SRR1039508.hisat.flagstat
13 SRR1039509.hisat.flagstat
13 SRR1039510.hisat.flagstat
13 SRR1039514.hisat.flagstat
52 total
(rna) vip51 00:25:41 ~/4.mapping/stat
$ cat *|awk '{print $1}' |paste - - - - - - - - - - - - -
50094593 5350127 0 0 49304507 44744466 22372233 22372233 43122616 43495718 458662 121216 89837
45068638 4229594 0 0 44198454 40839044 20419522 20419522 39169816 39531396 437464 121706 92949
49310701 5020389 0 0 48454263 44290312 22145156 22145156 42601074 42978162 455712 143714 111364
82290684 6889352 0 0 81370594 75401332 37700666 37700666 73172032 73858618 622624 395008 343293
$ cut -d"+" -f 2 SRR1039508.hisat.flagstat |cut -d" " -f 3-90
$ cut -d"+" -f 2 SRR1039508.hisat.flagstat |cut -d" " -f 3-90
in total (QC-passed reads
secondary
supplementary
duplicates
mapped (98.42% : N/A)
paired in sequencing
read1
read2
properly paired (96.38% : N/A)
with itself and mate mapped
singletons (1.03% : N/A)
with mate mapped to a different chr
with mate mapped to a different chr (mapQ>=5)
# 如下表
- multiqc ./hisat.flagstat*
## 上面的SRR1039508.hisat.flagstat SRR1039509.hisat.flagstat SRR1039510.hisat.flagstat SRR1039514.hisat.flagstat 也可以如下multiqc生成网页版报告
$ multiqc ./
(rna) vip51 00:43:27 ~/4.mapping/stat
$ ls
multiqc_data multiqc_report.html
-
查看multiqc_report.html
5.差异分析-featureCounts
5.1featureCounts 需要两个输入文件:
- 比对产生的BAM/ SAM文件(上面我们已经的到)
$ ls *bam
SRR1039508.hisat.bam SRR1039509.hisat.bam SRR1039510.hisat.bam SRR1039514.hisat.bam
- 区间注释文件,这个文件需要自己创建
$ cd ~/reference/gtf/
## 下载最新的
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz
- 去Gencode网站下载,这里我们下载最新的,下载GTF文件。
- 解压缩,得到Load annotation file gencode.v30.annotation.gtf
$ gunzip gencode.v30.annotation.gtf.gz
## 得到
$ ls
gencode.v30.annotation.gtf human.gene.positions nohup.out
6.2 featureounts
$ cd ~/6.featureCounts/
$ featureCounts -T 5 -p -t exon -g gene_id -a ~/reference/gtf/gencode.v30.annotation.gtf -o ~/all.id.txt ~/4.mapping/*.bam
## 得到如下:
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v1.6.4
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 4 BAM files ||
|| P SRR1039508.hisat.bam ||
|| P SRR1039509.hisat.bam ||
|| P SRR1039510.hisat.bam ||
|| P SRR1039514.hisat.bam ||
|| ||
|| Output file : all.id.txt ||
|| Summary : all.id.txt.summary ||
|| Annotation : gencode.v30.annotation.gtf (GTF) ||
|| Dir for temp files : /home/vip51 ||
|| ||
|| Threads : 5 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file gencode.v30.annotation.gtf ... ||
|| Features : 1279686 ||
|| Meta-features : 58870 ||
|| Chromosomes/contigs : 25 ||
|| ||
|| Process BAM file SRR1039508.hisat.bam... ||
|| Paired-end reads are included. ||
|| Assign alignments (paired-end) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Pairing up the read pairs. ||
|| ||
|| Total alignments : 25134856 ||
|| Successfully assigned alignments : 18588891 (74.0%) ||
|| Running time : 0.41 minutes ||
|| ||
|| Process BAM file SRR1039509.hisat.bam... ||
|| Paired-end reads are included. ||
|| Assign alignments (paired-end) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Pairing up the read pairs. ||
|| ||
|| Total alignments : 22612196 ||
|| Successfully assigned alignments : 17040035 (75.4%) ||
|| Running time : 0.36 minutes ||
|| ||
|| Process BAM file SRR1039510.hisat.bam... ||
|| Paired-end reads are included. ||
|| Assign alignments (paired-end) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Pairing up the read pairs. ||
|| ||
|| Total alignments : 24741721 ||
|| Successfully assigned alignments : 18349007 (74.2%) ||
|| Running time : 0.32 minutes ||
|| ||
|| Process BAM file SRR1039514.hisat.bam... ||
|| Paired-end reads are included. ||
|| Assign alignments (paired-end) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Pairing up the read pairs. ||
|| ||
|| Total alignments : 41252925 ||
|| Successfully assigned alignments : 31979144 (77.5%) ||
|| Running time : 0.54 minutes ||
|| ||
|| ||
|| Summary of counting results can be found in file "/home/vip51/all.id.txt. ||
|| summary" ||
|| ||
\\============================================================================//
- 得到下面两个结果
# all.id.txt all.id.txt.summary
## 我们对all.id.txt.summary 进行multiqc一下,看看Counts情况
$ multiqc all.id.txt.summary
...
[INFO ] multiqc : Report : multiqc_report_1.html
[INFO ] multiqc : Data : multiqc_data_1
[INFO ] multiqc : MultiQC complete
-
查看上面得的到的multiqc_report_1.html