转录组测序流程-生信技能树学习笔记

以Illumina测序仪说明二代测序的一般流程:

(1)文库制备

将DNA用雾化或超声波随机片段化成几百碱基或更短的小片段。用聚合酶和外切核酸酶把DNA片段切成平末端,紧接着磷酸化并增加一个核苷酸黏性末端。然后将Illumina测序接头与片段连接。

(2)簇的创建

将模板分子加入芯片用于产生克隆簇和测序循环。芯片有8个纵向泳道的硅基片。每个泳道内芯片表面有无数的被固定的单链接头。上述步骤得到的带接头的DNA 片段变性成单链后与测序通道上的接头引物结合形成桥状结构,以供后续的预扩增使用。通过不断循环获得上百万条成簇分布的双链待测片段。

(3)测序

分三步:DNA聚合酶结合荧光可逆终止子,荧光标记簇成像,在下一个循环开始前将结合的核苷酸剪切并分解。

(4)数据分析

可以通过陈巍学基因视频1:Illumina测序化学原理学习,讲解得很好!

数据分析流程:

1. 安装Miniconda3

mkdir -p ~/biosoft
cd ~/biosoft
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

2.设置conda镜像

source ~/.bashrc
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
conda config --set show_channel_urls yes

3.调用conda 安装软件

质控

fastqc , multiqc, trimmomatic, cutadapt ,trim-galore

比对

star, hisat2, bowtie2, tophat, bwa, subread

计数

htseq, bedtools, deeptools, salmon

为了避免污染linux工作环境,推荐在conda中创建各个流程的安装环境,比如:

source ~/miniconda3/bin/activate
conda create -n rna python=2 #创建名为rna的软件环境
conda info --envs #查看当前conda环境
source activate rna #激活conda的RNA环境
###软件安装
conda install -y aspera connect
conda install -y sra-tools #38M
conda install -y trimmomatic#175M
conda install -y cutadapt multiqc#129.5M
conda install -y trim-galore #9.6 MB
conda install -y star hisat2 bowtie2 #16.2MB+5.8MB+13.7MB
conda install -y subread tophat htseq bedtools deeptools
conda install -y salmon
source deactivate #注销当前的rna环境
conda update -n base -c defaults conda#conda软件更新

如果你对一个软件不了解的话,那么安装之前在https://bioconda.github.io/recipes.html,检索该软件包是否存在,或者使用 conda search packagename进行检索。

sra-tools

sra-tools 这个软件,主要用途是把NGS序列原始数据从 sra 格式转换到 fastq 格式,以便于后续的数据分析。https://www.cnblogs.com/OA-maque/p/4799074.html

trimmomatic

NGS 原始数据过滤对后续分析至关重要,去除一些无用的序列也可以提高后续分析的准确率和效率。Trimmomatic 是一个功能强大的数据过滤软件, 支持多线程,处理数据速度快,主要用来去除 Illumina 平台的 Fastq 序列中的接头,并根据碱基质量值对 Fastq 进行修剪。软件有两种过滤模式,分别对应 SE 和 PE 测序数据,同时支持 gzip 和 bzip2 压缩文件。详见 NGS 数据过滤之 Trimmomatic 详细说明 - https://www.jianshu.com/p/a8935adebaae

cutadapt

当我们是双端测序数据的时候,去除接头时,也会丢掉太短的reads,就容易导致左右两端测序文件reads数量不平衡,有一个比较好的软件能解决这个问题,Jimmy大神比较喜欢的是cutadapt软件的PE模式来去除接头!尤其是做基因组或者转录组de novo 组装的时候,尤其要去掉接头,去的干干净净!详见用cutadapt软件来对双端测序数据去除接头 https://www.jianshu.com/p/1a3ca70fb326

multiqc

功能:把多个测序结果的qc结果整合成一个报告。支持fastqc、trimmomatic、bowtie、STAR等多种软件结果的整合。详见青山屋主笔记 批量显示QC结果的利器http://fbb84b26.wiz03.com/share/s/3XK4IC0cm4CL22pU-r1HPcQQ1iRTvV2GwkwL2AaxYi2fXHP7
详细说明 homepage: http://multiqc.info

trim-galore

软件先去除低质量reads,再调用cutadapt去除接头(默认调用illumina接头),最后可选择调用fastqc看看read质量情况。详见Trim_galore使用(2018-05-25) - https://www.jianshu.com/p/1925a3356071

STAR

基于一种以前未描述的RNA-seq比对算法开发了STAR(Spliced Transcripts Alignments to a Reference,STAR)软件,该算法使用了未压缩后缀阵列中的连续最大可比对种子搜索,接着种子聚类和缝合过程。
STAR在比对速度上胜过其他比对器50多倍,在一个普通的12核服务器上,每小时比对5.5亿2 x 76 bp双端片段到人类基因组上,同时改进了比对敏感性和准确性。除了典型剪接的非偏从头检测外,STAR能够发现非典型拼接和嵌合(融合)转录本,并能够比对全长RNA序列。https://www.plob.org/article/10220.html

HISAT2

HISAT2是TopHat2/Bowti2的继任者,使用改进的BWT算法,实现了更快的速度和更少的资源占用,作者推荐TopHat2/Bowti2和HISAT的用户转换到HISAT2。
RNA-Seq基因组比对工具HISAT2://www.plob.org/article/10380.html

Bowtie2

Bowtie2是一个超高速的,节约内存且灵活与成熟的短序列比对软件,比较适合下一代测序技术。通常 使 用 全 文 分 索 引 (FM-index) 以 及 Burrows-Wheeler变换(BWT)索引基因组使得比对非常快速且内存高效,但是这种方法不适合于找到较长的、带缺口的序列比对。

Bowtie2使用方法与参数详细介绍 https://www.plob.org/article/4540.html
bowtie简单使用: www.bio-info-trainee.com/398.html
Bowtie2中文使用手册Bowtie2-Manual https://cncbi.github.io/Bowtie2-Manual-CN/

BWA

BWA 主要应用二代测序后的大量短小片段与参考基因组之间的定位比对。 需要先对参考序列建建立索引,BWA 也是基于 BWT 和 FM-INDEX 理论来对参考基因组做索引。 根据测序方法的不同,有单末端序列(Single-end, SE)比对和双末端序列(Pair-end, PE)比对。
参考文献: 四种常用的生物序列比对软件比较 陈凤珍,李 玲, 操利超, 严志祥∗2016

4. 转录组流程

step1:srafastq

下载SRA数据

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778
新建一个名为SRR_Acc_List.txt的文档,将SRR号码保存在文档内,一个号码占据一行。文件可以 在Jimmy老师GitHub下载,下载地址:https://github.com/jmzeng1314/GEO/blob/master/airway_RNAseq/SRR_Acc_List.txt

1 SRR1039508
2 SRR1039509
3 SRR1039510
4 SRR1039511
5 SRR1039512
6 SRR1039513
7 SRR1039514
8 SRR1039515
9 SRR1039516
10 SRR1039517
11 SRR1039518
12 SRR1039519
13 SRR1039520
14 SRR1039521
15 SRR1039522
16 SRR1039523

  • prefetch下载数据
    wkd=/home/jmzeng/project/airway/ #设置工作目录
    source activate rna #激活conda的RNA环境
wget https://github.com/jmzeng1314/GEO/blob/master/airway_RNAseq/SRR_Acc_List.txt

cat SRR_Acc_List.txt | while read id; do (prefetch ${id} &); done

Aspera Connect命令行工具

参考:RNA-seq(2)-1:原始数据下载的几种方法 - https://www.jianshu.com/p/8dca09077df3
首先,goto Aspera connect,选择linux版本,复制链接地址(这个需要代理下载)

wget http://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz  
 
#解压缩 
tar zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz
 
# install
bash aspera-connect-3.7.4.147727-linux-64.sh

# check the .aspera directory
cd # go to root directory
ls -a # if you could see .aspera, the installation is OK

# add environment variable
echo 'export PATH=~/.aspera/connect/bin:$PATH' >> ~/.bashrc
source ~/.bashrc   

#密钥备份到/home/的家目录(后面会用,否则报错)
cp ~/.aspera/connect/etc/asperaweb_id_dsa.openssh ~/

# check help file
ascp --help 

恭喜,安装完成,开始下载,感受飞一般的速度,为了节省时间,避免占用过多的服务器存储,学习时就用了3组数据:SRR1039510.sra、SRR1039511.sra、SRR1039512.sra

for ((i=10;i<=12;i++));do ascp -QT -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l200m [email protected]:/sra/sra-instant/reads/ByRun/sra/SRR/SRR103/SRR10395${i}/SRR10395${i}.sra .;done
# vip05 13:26:38 /teach/project/1.rna/1.sra_data
$ ls -lh
total 5.1G
-rwxr-xr-x 1 qmcui qmcui 1.6G Feb 23 09:07 SRR1039510.sra
-rwxr-xr-x 1 qmcui qmcui 1.5G Feb 23 09:07 SRR1039511.sra
-rwxr-xr-x 1 qmcui qmcui 2.1G Feb 23 09:07 SRR1039512.sraa

格式转换

for ((i=10;i<=12;i++));do fastq-dump --gzip --split-3 -A SRR10395$i.sra -O .;done
ll -h
total 8.3G
-rw-rw-r-- 1 qmcui qmcui 1.3G Feb 23 09:06 SRR1039510_1.fastq.gz
-rw-rw-r-- 1 qmcui qmcui 1.3G Feb 23 09:06 SRR1039510_2.fastq.gz
-rw-rw-r-- 1 qmcui qmcui 1.3G Feb 23 09:05 SRR1039511_1.fastq.gz
-rw-rw-r-- 1 qmcui qmcui 1.3G Feb 23 09:06 SRR1039511_2.fastq.gz
-rw-rw-r-- 1 qmcui qmcui 1.7G Feb 23 09:06 SRR1039512_1.fastq.gz
-rw-rw-r-- 1 qmcui qmcui 1.7G Feb 23 09:06 SRR1039512_2.fastq.gz

查看fastq文件格式,取SRR1039510_1.fastq.gz前12行

#vip05 21:17:28 /teach/project/1.rna/2.raw_fq
$ zless -S /teach/project/1.rna/2.raw_fq/SRR1039510_1.fastq.gz|head -n 12
@SRR1039510.1 HWI-ST177:290:C0TECACXX:1:1101:1373:2104 length=63
TGGGAGGCTGAGGCAGGAGAATCACTTAAACCTGGGAGGCAGAGGTTACAGTGAGCCGAGATT
+SRR1039510.1 HWI-ST177:290:C0TECACXX:1:1101:1373:2104 length=63
HJJJIJJJJJJJJIJJJGHHIJIIIIIIJJEHGGIJGIJIJJIJHHHGGFFDFFFDEDDDBDC
@SRR1039510.2 HWI-ST177:290:C0TECACXX:1:1101:1340:2124 length=63
AAAGAAGGCGACAGTGAGAAGGAGTCCGAGAAGAGTGATGGAGACCCAATAGTCGATCCTGAG
+SRR1039510.2 HWI-ST177:290:C0TECACXX:1:1101:1340:2124 length=63
HJJJJJJJJJJJIJIIGIJJJJGJHJJJHHDFFFE@CEEEDDDDDDDDDDDDDDDBDDDDDDD
@SRR1039510.3 HWI-ST177:290:C0TECACXX:1:1101:1273:2183 length=63
CTGCTGGGCCCCAAGGTCCTCCTGGTCCCAGTGGTGAAGAAGGAAAGAGAGGCCCTAATGGGG
+SRR1039510.3 HWI-ST177:290:C0TECACXX:1:1101:1273:2183 length=63
HJJJJJJJJJJJJJJJGIIIJJJJJHIJJJJHIJFHGIJJJJJJJHHHHHFFFDDDEDDDDDD
转录组测序流程-生信技能树学习笔记_第1张图片
image.png

step2: check quality of sequence reads

质控

质控目的:
• 了解数据质量、大小、
• 过滤数据
• 去除接头
• 去除两端低质量碱基(-q 25)
• 最大允许错误率(默认-e 0.1)
• 去除<36的reads(--length 36)
• 切除index的overlap>3的碱基
• reads去除以对为单位(--paired)

软件:

• fastqc
• cutadapt
• Trim Galore

fastqc生成质控报告,multiqc将各个样本的质控报告整合为一个。

# 路径 vip05 21:25:33 /teach/project/1.rna/2.raw_fq
source activate rna  #激活rna环境
fastqc -t 2 -o ~/project/1.rna ./*.fastq.gz
multiqc ./ # 整合结果

每个id fastqc.html都是一个质量报告,multiqcreport.html是所有样本的整合报告

(rna) vip05 21:51:54 ~/project/1.rna/multiqcreport
$ ll -h
total 6.2M
drwxrwxr-x 2 vip05 vip05 4.0K Feb 27 21:50 ./
drwxrwxr-x 5 vip05 vip05 4.0K Feb 27 21:50 ../
-rw-rw-r-- 1 vip05 vip05 1.2M Feb 27 21:47 multiqc_report_1.html
-rw-rw-r-- 1 vip05 vip05 617K Feb 25 12:01 SRR1039510_1_100000.rawfq_fastqc.html
-rw-rw-r-- 1 vip05 vip05 647K Feb 27 21:40 SRR1039510_1_fastqc.html
-rw-rw-r-- 1 vip05 vip05 619K Feb 25 11:50 SRR1039510_2_100000.rawfq_fastqc.html
-rw-rw-r-- 1 vip05 vip05 651K Feb 27 21:40 SRR1039510_2_fastqc.html
-rw-rw-r-- 1 vip05 vip05 645K Feb 27 21:42 SRR1039511_1_fastqc.html
-rw-rw-r-- 1 vip05 vip05 645K Feb 27 21:42 SRR1039511_2_fastqc.html
-rw-rw-r-- 1 vip05 vip05 648K Feb 27 21:44 SRR1039512_1_fastqc.html
-rw-rw-r-- 1 vip05 vip05 642K Feb 27 21:44 SRR1039512_2_fastqc.html

step3: filter the bad quality reads and remove adaptors.

运行如下代码,得到名为config的文件,包含两列数据

我使用的代码

mkdir clean 
cd clean 
trim_galore --phred33 -q 25 -e 0.1 --length 36 --stringency 3 --paired -o ~/project/1.rna/clean  *.fastq.gz # 过滤数据
# vip05 22:59:06 ~/project/1.rna/clean
$ ls -lh
total 7.9G
-rw-rw-r-- 1 vip05 vip05 1.3G Feb 27 22:15 SRR1039510_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.3G Feb 27 22:15 SRR1039510_2_val_2.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_2_val_2.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_2_val_2.fq.gz

课件中的批量过滤数据代码

cat $1 |while read id
do
arr=(${id})
fq1=${arr[0]}
fq2=${arr[1]}
$bin_trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o $dir $fq1 $fq2
done

打开文件 qc.sh ,并且写入如下内容

trim_galore,用于去除低质量和接头数据
fastqc和cutadapt,一样的安装方法。然后运行trim_galore,这个软件先去除低质量reads,再调用cutadapt去除接头(默认调用illumina接头),最后可选择调用fastqc看看read质量情况。

step4: alignment比对 HISAT2 mapping

比对目的:
• 将打断测序的reads比回参考基因组
• samtools将比对结果排序后得到sort的bam,用于后续分析

比对策略:
• 跨外显子比对
软件:
• hisat2、samtools

参数:
• -p、 -@分别设置线程数
• -x 索引
• -1、-2 双端reads1、reads2的fq文件
• -o 输出结果
• - 传入管道符前的缓存内容

参考基因组:
• hg38.fa

###建立路径
mkdir 0.config 4.mapping
###1.make index
# 下载hisat2索引或者构建(耗时!)
hg19 hg38
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz
tar -zxvf hg19.tar.gz
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg38.tar.gz
tar -zxvf hg38.tar.gz
(rna) vip05 09:49:43 /teach/database
$ ls -lh
total 16K
drwxrwxr-x 2 qmcui qmcui 4.0K Feb 23 09:01 GATK
drwxrwxr-x 2 qmcui qmcui 4.0K Feb 23 08:58 genome
drwxrwxr-x 2 qmcui qmcui 4.0K Feb 23 09:01 gtf
drwxrwxr-x 5 qmcui qmcui 4.0K Feb 23 09:00 index

先拿一个搞一下

###2.single sample mapping(nohup ... &)
cd 4.mapping
id=SRR1039510
hisat2 -p 10 -x /teach/database/index/hisat/hg38/genome -1 ${id}_1_val_1.fq.gz   -2 ${id}_2_val_2.fq.gz  -S ${id}.hisat.sam

22145156 reads; of these:
22145156 (100.00%) were paired; of these:
844619 (3.81%) aligned concordantly 0 times
19941827 (90.05%) aligned concordantly exactly 1 time
1358710 (6.14%) aligned concordantly >1 times
----
844619 pairs aligned concordantly 0 times; of these:
122345 (14.49%) aligned discordantly 1 time
----
722274 pairs aligned 0 times concordantly or discordantly; of these:
1444548 mates make up the pairs; of these:
856438 (59.29%) aligned 0 times
499198 (34.56%) aligned exactly 1 time
88912 (6.16%) aligned >1 times
98.07% overall alignment rate

批量代码,这里是三个比对软件,我们选择了hisat2,速度比较快。

cd $wkd/clean 
ls *gz|cut -d"_" -f 1 |sort -u |while read id;do
ls -lh ${id}_1_val_1.fq.gz   ${id}_2_val_2.fq.gz 
hisat2 -p 10 -x /public/reference/index/hisat/hg38/genome -1 ${id}_1_val_1.fq.gz   -2 ${id}_2_val_2.fq.gz  -S ${id}.hisat.sam
subjunc -T 5  -i /public/reference/index/subread/hg38 -r ${id}_1_val_1.fq.gz -R ${id}_2_val_2.fq.gz -o ${id}.subjunc.sam
bowtie2 -p 10 -x /public/reference/index/bowtie/hg38  -1 ${id}_1_val_1.fq.gz   -2 ${id}_2_val_2.fq.gz  -S ${id}.bowtie.sam
bwa mem -t 5 -M  /public/reference/index/bwa/hg38   ${id}_1_val_1.fq.gz   ${id}_2_val_2.fq.gz > ${id}.bwa.sam
done

-rw-rw-r-- 1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_2_val_2.fq.gz
21413393 reads; of these:
21413393 (100.00%) were paired; of these:
726325 (3.39%) aligned concordantly 0 times
19489193 (91.01%) aligned concordantly exactly 1 time 1197875 (5.59%) aligned concordantly >1 times
----
726325 pairs aligned concordantly 0 times; of these:
123425 (16.99%) aligned discordantly 1 time
----
602900 pairs aligned 0 times concordantly or discordantly; of these:
1205800 mates make up the pairs; of these:
624452 (51.79%) aligned 0 times
498730 (41.36%) aligned exactly 1 time
82618 (6.85%) aligned >1 times
98.54% overall alignment rate

-rw-rw-r-- 1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_2_val_2.fq.gz
26947191 reads; of these:
26947191 (100.00%) were paired; of these:
643564 (2.39%) aligned concordantly 0 times
24747121 (91.84%) aligned concordantly exactly 1 time
1556506 (5.78%) aligned concordantly >1 times
----
643564 pairs aligned concordantly 0 times; of these:
128943 (20.04%) aligned discordantly 1 time
----
514621 pairs aligned 0 times concordantly or discordantly; of these:
1029242 mates make up the pairs; of these:
531081 (51.60%) aligned 0 times
425913 (41.38%) aligned exactly 1 time
72248 (7.02%) aligned >1 times

sam格式

① '\t'分割每列
② @是头文件
③ 比对行=必需11列+一
④ 与bam内容一模一样

比对后的sam文件比较大,需要转bam文件!

-rw-rw-r--  1 vip05 vip05 1.3G Feb 27 22:15 SRR1039510_1_val_1.fq.gz
-rw-rw-r--  1 vip05 vip05 1.3G Feb 27 22:15 SRR1039510_2_val_2.fq.gz
-rw-rw-r--  1 vip05 vip05  13G Feb 28 11:16 SRR1039510.hisat.sam
-rw-rw-r--  1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_1_val_1.fq.gz
-rw-rw-r--  1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_2_val_2.fq.gz
-rw-rw-r--  1 vip05 vip05  12G Feb 28 11:22 SRR1039511.hisat.sam
-rw-rw-r--  1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_1_val_1.fq.gz
-rw-rw-r--  1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_2_val_2.fq.gz
-rw-rw-r--  1 vip05 vip05  15G Feb 28 11:29 SRR1039512.hisat.sam
  • sam文件转bam
ls *.sam|while read id ;do (samtools sort -O bam -@ 5  -o $(basename ${id} ".sam").bam   ${id});done
rm *.sam #删除sam文件
bam特点

① bam文件为其二进制文件
② 占内存小
③ 兼容下游软件

-rw-rw-r--  1 vip05 vip05 2.1G Feb 28 11:37 SRR1039510.hisat.bam
-rw-rw-r--  1 vip05 vip05  13G Feb 28 11:16 SRR1039510.hisat.sam
-rw-rw-r--  1 vip05 vip05 2.0G Feb 28 11:42 SRR1039511.hisat.bam
-rw-rw-r--  1 vip05 vip05  12G Feb 28 11:22 SRR1039511.hisat.sam
-rw-rw-r--  1 vip05 vip05 2.5G Feb 28 11:47 SRR1039512.hisat.bam
-rw-rw-r--  1 vip05 vip05  15G Feb 28 11:29 SRR1039512.hisat.sam
  • 为bam文件建立索引
ls *.bam |xargs -i samtools index {}
-rw-rw-r--  1 vip05 vip05 2.1G Feb 28 11:37 SRR1039510.hisat.bam
-rw-rw-r--  1 vip05 vip05 2.7M Feb 28 12:00 SRR1039510.hisat.bam.bai
-rw-rw-r--  1 vip05 vip05 2.0G Feb 28 11:42 SRR1039511.hisat.bam
-rw-rw-r--  1 vip05 vip05 2.6M Feb 28 12:01 SRR1039511.hisat.bam.bai
-rw-rw-r--  1 vip05 vip05 2.5G Feb 28 11:47 SRR1039512.hisat.bam
-rw-rw-r--  1 vip05 vip05 2.8M Feb 28 12:02 SRR1039512.hisat.bam.bai
  • reads的比对情况统计
    奇怪的是结果没有出来,也没有报错,暂时不知道原因。
ls *.bam |xargs -i samtools flagstat -@ 2  {}  >
ls *.bam |while read id ;do ( samtools flagstat -@ 1 $id >  $(basename ${id} ".bam").flagstat  );done
source deactivate

step5: counts reads计数

zless -S /teach/database/gtf/gencode.v29.annotation.gtf.gz |less -S 
##description: evidence-based annotation of the human genome (GRCh38), version 29 (Ensembl 94)
##provider: GENCODE
##contact: [email protected]
##format: gtf
##date: 2018-08-30
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; lev
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unproc
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  transcript      12010   13670   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unproc
chr1    HAVANA  exon    12010   12057   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12179   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12613   12697   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
featureCounts -T 5 -p -t exon -g gene_id -a /teach/database/gtf/gencode.v29.annotation.gtf.gz -o ~/all.id.txt *.bam
-rw-rw-r--  1 vip05 vip05   33M Feb 28 13:31 all.id.txt
-rw-rw-r--  1 vip05 vip05   491 Feb 28 13:31 all.id.txt.summary
-rw-------  1 vip05 vip05   26K Feb 28 14:31 .bash_history

查看下文件

$ head all.id.txt
# Program:featureCounts v1.6.3; Command:"featureCounts" "-T" "5" "-p" "-t" "exon" "-g" "gene_id" "-a" "/teach/database/gtf/gencode.v29.annotation.gtf.gz" "-o" "/trainee1/vip05/all.id.txt" "SRR1039510.hisat.bam" "SRR1039511.hisat.bam" "SRR1039512.hisat.bam" 
Geneid  Chr Start   End Strand  Length  SRR1039510.hisat.bam    SRR1039511.hisat.bam    SRR1039512.hisat.bam
ENSG00000223972.5   chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1    11869;12010;12179;12613;12613;12975;13221;13221;13453   12227;12057;12227;12721;12697;13052;13374;14409;13670   +;+;+;+;+;+;+;+;+   17350
ENSG00000227232.5   chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1  14404;15005;15796;16607;16858;17233;17606;17915;18268;24738;29534   14501;15038;15947;16765;17055;17368;17742;18061;18366;24891;29570   -;-;-;-;-;-;-;-;-;-;-   1351    21  19  24
ENSG00000278267.1   chr1    17369   17436   -   68  0   3   5
ENSG00000243485.5   chr1;chr1;chr1;chr1;chr1    29554;30267;30564;30976;30976   30039;30667;30667;31109;31097   +;+;+;+;+   1021    0   0   0
ENSG00000284332.1   chr1    30366   30503   +   138 0   0   0
ENSG00000237613.2   chr1;chr1;chr1;chr1;chr1    34554;35245;35277;35721;35721   35174;35481;35481;36073;36081   -;-;-;-;-   1219    0   0   0
ENSG00000268020.3   chr1    52473   53312   +   840 0   0   0
ENSG00000240361.2   chr1;chr1;chr1;chr1 57598;58700;62916;62949 57653;58856;64116;63887 +;+;+;+ 14140

选出自己需要的数据,也可以使用R语言处理。

less -S all.id.txt|grep -v "^#"|cut -f 1,7-| less -S > all.id3.txt

最后看下文件大小,尽量删除不必要文件,避免占用过多的服务器空间。

du -sh *

后面的差异分析,我们采用R语言进行。参考技能树https://www.jianshu.com/p/a84cd44bac67可以完成。

注意:作为初学者,一定要注意文件的路径,当时第一次练习时就是被路径给搞晕了。另外,这个练习是在两台服务器上完成的,请注意路径!不当之处多包涵!

再次感谢生信技能树的各位老师,谢谢你们的指导!
崔老师地址: https://www.jianshu.com/u/9153eddebf9c,有惊喜哟!

参考文献:

  1. 原创10000+生信教程大神给你的RNA实战视频演练
  2. RNA-seq转录组分析
  3. 生信技能树 -
  4. https://github.com/jmzeng1314/my-R/blob/master/10-RNA-seq-3-groups/hisat2_mm10_htseq.R
  5. 陈巍学基因视频1:Illumina测序化学原理

生信技能树公益视频合辑:学习顺序是linux,r,软件安装,geo,小技巧,ngs组学!
B站链接
YouTube链接
生信工程师入门最佳指南
学徒培养

你可能感兴趣的:(转录组测序流程-生信技能树学习笔记)