以Illumina测序仪说明二代测序的一般流程:
(1)文库制备
将DNA用雾化或超声波随机片段化成几百碱基或更短的小片段。用聚合酶和外切核酸酶把DNA片段切成平末端,紧接着磷酸化并增加一个核苷酸黏性末端。然后将Illumina测序接头与片段连接。
(2)簇的创建
将模板分子加入芯片用于产生克隆簇和测序循环。芯片有8个纵向泳道的硅基片。每个泳道内芯片表面有无数的被固定的单链接头。上述步骤得到的带接头的DNA 片段变性成单链后与测序通道上的接头引物结合形成桥状结构,以供后续的预扩增使用。通过不断循环获得上百万条成簇分布的双链待测片段。
(3)测序
分三步:DNA聚合酶结合荧光可逆终止子,荧光标记簇成像,在下一个循环开始前将结合的核苷酸剪切并分解。
(4)数据分析
可以通过陈巍学基因视频1:Illumina测序化学原理学习,讲解得很好!
数据分析流程:
1. 安装Miniconda3
mkdir -p ~/biosoft
cd ~/biosoft
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
2.设置conda镜像
source ~/.bashrc
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
conda config --set show_channel_urls yes
3.调用conda 安装软件
质控
fastqc , multiqc, trimmomatic, cutadapt ,trim-galore
比对
star, hisat2, bowtie2, tophat, bwa, subread
计数
htseq, bedtools, deeptools, salmon
为了避免污染linux工作环境,推荐在conda中创建各个流程的安装环境,比如:
source ~/miniconda3/bin/activate
conda create -n rna python=2 #创建名为rna的软件环境
conda info --envs #查看当前conda环境
source activate rna #激活conda的RNA环境
###软件安装
conda install -y aspera connect
conda install -y sra-tools #38M
conda install -y trimmomatic#175M
conda install -y cutadapt multiqc#129.5M
conda install -y trim-galore #9.6 MB
conda install -y star hisat2 bowtie2 #16.2MB+5.8MB+13.7MB
conda install -y subread tophat htseq bedtools deeptools
conda install -y salmon
source deactivate #注销当前的rna环境
conda update -n base -c defaults conda#conda软件更新
如果你对一个软件不了解的话,那么安装之前在https://bioconda.github.io/recipes.html,检索该软件包是否存在,或者使用 conda search packagename进行检索。
sra-tools
sra-tools 这个软件,主要用途是把NGS序列原始数据从 sra 格式转换到 fastq 格式,以便于后续的数据分析。https://www.cnblogs.com/OA-maque/p/4799074.html
trimmomatic
NGS 原始数据过滤对后续分析至关重要,去除一些无用的序列也可以提高后续分析的准确率和效率。Trimmomatic 是一个功能强大的数据过滤软件, 支持多线程,处理数据速度快,主要用来去除 Illumina 平台的 Fastq 序列中的接头,并根据碱基质量值对 Fastq 进行修剪。软件有两种过滤模式,分别对应 SE 和 PE 测序数据,同时支持 gzip 和 bzip2 压缩文件。详见 NGS 数据过滤之 Trimmomatic 详细说明 - https://www.jianshu.com/p/a8935adebaae
cutadapt
当我们是双端测序数据的时候,去除接头时,也会丢掉太短的reads,就容易导致左右两端测序文件reads数量不平衡,有一个比较好的软件能解决这个问题,Jimmy大神比较喜欢的是cutadapt软件的PE模式来去除接头!尤其是做基因组或者转录组de novo 组装的时候,尤其要去掉接头,去的干干净净!详见用cutadapt软件来对双端测序数据去除接头 https://www.jianshu.com/p/1a3ca70fb326
multiqc
功能:把多个测序结果的qc结果整合成一个报告。支持fastqc、trimmomatic、bowtie、STAR等多种软件结果的整合。详见青山屋主笔记 批量显示QC结果的利器http://fbb84b26.wiz03.com/share/s/3XK4IC0cm4CL22pU-r1HPcQQ1iRTvV2GwkwL2AaxYi2fXHP7
详细说明 homepage: http://multiqc.info
trim-galore
软件先去除低质量reads,再调用cutadapt去除接头(默认调用illumina接头),最后可选择调用fastqc看看read质量情况。详见Trim_galore使用(2018-05-25) - https://www.jianshu.com/p/1925a3356071
STAR
基于一种以前未描述的RNA-seq比对算法开发了STAR(Spliced Transcripts Alignments to a Reference,STAR)软件,该算法使用了未压缩后缀阵列中的连续最大可比对种子搜索,接着种子聚类和缝合过程。
STAR在比对速度上胜过其他比对器50多倍,在一个普通的12核服务器上,每小时比对5.5亿2 x 76 bp双端片段到人类基因组上,同时改进了比对敏感性和准确性。除了典型剪接的非偏从头检测外,STAR能够发现非典型拼接和嵌合(融合)转录本,并能够比对全长RNA序列。https://www.plob.org/article/10220.html
HISAT2
HISAT2是TopHat2/Bowti2的继任者,使用改进的BWT算法,实现了更快的速度和更少的资源占用,作者推荐TopHat2/Bowti2和HISAT的用户转换到HISAT2。
RNA-Seq基因组比对工具HISAT2://www.plob.org/article/10380.html
Bowtie2
Bowtie2是一个超高速的,节约内存且灵活与成熟的短序列比对软件,比较适合下一代测序技术。通常 使 用 全 文 分 索 引 (FM-index) 以 及 Burrows-Wheeler变换(BWT)索引基因组使得比对非常快速且内存高效,但是这种方法不适合于找到较长的、带缺口的序列比对。
Bowtie2使用方法与参数详细介绍 https://www.plob.org/article/4540.html
bowtie简单使用: www.bio-info-trainee.com/398.html
Bowtie2中文使用手册Bowtie2-Manual https://cncbi.github.io/Bowtie2-Manual-CN/
BWA
BWA 主要应用二代测序后的大量短小片段与参考基因组之间的定位比对。 需要先对参考序列建建立索引,BWA 也是基于 BWT 和 FM-INDEX 理论来对参考基因组做索引。 根据测序方法的不同,有单末端序列(Single-end, SE)比对和双末端序列(Pair-end, PE)比对。
参考文献: 四种常用的生物序列比对软件比较 陈凤珍,李 玲, 操利超, 严志祥∗2016
4. 转录组流程
step1:srafastq
下载SRA数据
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778
新建一个名为SRR_Acc_List.txt的文档,将SRR号码保存在文档内,一个号码占据一行。文件可以 在Jimmy老师GitHub下载,下载地址:https://github.com/jmzeng1314/GEO/blob/master/airway_RNAseq/SRR_Acc_List.txt
1 SRR1039508
2 SRR1039509
3 SRR1039510
4 SRR1039511
5 SRR1039512
6 SRR1039513
7 SRR1039514
8 SRR1039515
9 SRR1039516
10 SRR1039517
11 SRR1039518
12 SRR1039519
13 SRR1039520
14 SRR1039521
15 SRR1039522
16 SRR1039523
- prefetch下载数据
wkd=/home/jmzeng/project/airway/ #设置工作目录
source activate rna #激活conda的RNA环境
wget https://github.com/jmzeng1314/GEO/blob/master/airway_RNAseq/SRR_Acc_List.txt
cat SRR_Acc_List.txt | while read id; do (prefetch ${id} &); done
Aspera Connect命令行工具
参考:RNA-seq(2)-1:原始数据下载的几种方法 - https://www.jianshu.com/p/8dca09077df3
首先,goto Aspera connect,选择linux版本,复制链接地址(这个需要代理下载)
wget http://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz
#解压缩
tar zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz
# install
bash aspera-connect-3.7.4.147727-linux-64.sh
# check the .aspera directory
cd # go to root directory
ls -a # if you could see .aspera, the installation is OK
# add environment variable
echo 'export PATH=~/.aspera/connect/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
#密钥备份到/home/的家目录(后面会用,否则报错)
cp ~/.aspera/connect/etc/asperaweb_id_dsa.openssh ~/
# check help file
ascp --help
恭喜,安装完成,开始下载,感受飞一般的速度,为了节省时间,避免占用过多的服务器存储,学习时就用了3组数据:SRR1039510.sra、SRR1039511.sra、SRR1039512.sra
for ((i=10;i<=12;i++));do ascp -QT -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l200m [email protected]:/sra/sra-instant/reads/ByRun/sra/SRR/SRR103/SRR10395${i}/SRR10395${i}.sra .;done
# vip05 13:26:38 /teach/project/1.rna/1.sra_data
$ ls -lh
total 5.1G
-rwxr-xr-x 1 qmcui qmcui 1.6G Feb 23 09:07 SRR1039510.sra
-rwxr-xr-x 1 qmcui qmcui 1.5G Feb 23 09:07 SRR1039511.sra
-rwxr-xr-x 1 qmcui qmcui 2.1G Feb 23 09:07 SRR1039512.sraa
格式转换
for ((i=10;i<=12;i++));do fastq-dump --gzip --split-3 -A SRR10395$i.sra -O .;done
ll -h
total 8.3G
-rw-rw-r-- 1 qmcui qmcui 1.3G Feb 23 09:06 SRR1039510_1.fastq.gz
-rw-rw-r-- 1 qmcui qmcui 1.3G Feb 23 09:06 SRR1039510_2.fastq.gz
-rw-rw-r-- 1 qmcui qmcui 1.3G Feb 23 09:05 SRR1039511_1.fastq.gz
-rw-rw-r-- 1 qmcui qmcui 1.3G Feb 23 09:06 SRR1039511_2.fastq.gz
-rw-rw-r-- 1 qmcui qmcui 1.7G Feb 23 09:06 SRR1039512_1.fastq.gz
-rw-rw-r-- 1 qmcui qmcui 1.7G Feb 23 09:06 SRR1039512_2.fastq.gz
查看fastq文件格式,取SRR1039510_1.fastq.gz前12行
#vip05 21:17:28 /teach/project/1.rna/2.raw_fq
$ zless -S /teach/project/1.rna/2.raw_fq/SRR1039510_1.fastq.gz|head -n 12
@SRR1039510.1 HWI-ST177:290:C0TECACXX:1:1101:1373:2104 length=63
TGGGAGGCTGAGGCAGGAGAATCACTTAAACCTGGGAGGCAGAGGTTACAGTGAGCCGAGATT
+SRR1039510.1 HWI-ST177:290:C0TECACXX:1:1101:1373:2104 length=63
HJJJIJJJJJJJJIJJJGHHIJIIIIIIJJEHGGIJGIJIJJIJHHHGGFFDFFFDEDDDBDC
@SRR1039510.2 HWI-ST177:290:C0TECACXX:1:1101:1340:2124 length=63
AAAGAAGGCGACAGTGAGAAGGAGTCCGAGAAGAGTGATGGAGACCCAATAGTCGATCCTGAG
+SRR1039510.2 HWI-ST177:290:C0TECACXX:1:1101:1340:2124 length=63
HJJJJJJJJJJJIJIIGIJJJJGJHJJJHHDFFFE@CEEEDDDDDDDDDDDDDDDBDDDDDDD
@SRR1039510.3 HWI-ST177:290:C0TECACXX:1:1101:1273:2183 length=63
CTGCTGGGCCCCAAGGTCCTCCTGGTCCCAGTGGTGAAGAAGGAAAGAGAGGCCCTAATGGGG
+SRR1039510.3 HWI-ST177:290:C0TECACXX:1:1101:1273:2183 length=63
HJJJJJJJJJJJJJJJGIIIJJJJJHIJJJJHIJFHGIJJJJJJJHHHHHFFFDDDEDDDDDD
step2: check quality of sequence reads
质控
质控目的:
• 了解数据质量、大小、
• 过滤数据
• 去除接头
• 去除两端低质量碱基(-q 25)
• 最大允许错误率(默认-e 0.1)
• 去除<36的reads(--length 36)
• 切除index的overlap>3的碱基
• reads去除以对为单位(--paired)
软件:
• fastqc
• cutadapt
• Trim Galore
fastqc生成质控报告,multiqc将各个样本的质控报告整合为一个。
# 路径 vip05 21:25:33 /teach/project/1.rna/2.raw_fq
source activate rna #激活rna环境
fastqc -t 2 -o ~/project/1.rna ./*.fastq.gz
multiqc ./ # 整合结果
每个id fastqc.html都是一个质量报告,multiqcreport.html是所有样本的整合报告
(rna) vip05 21:51:54 ~/project/1.rna/multiqcreport
$ ll -h
total 6.2M
drwxrwxr-x 2 vip05 vip05 4.0K Feb 27 21:50 ./
drwxrwxr-x 5 vip05 vip05 4.0K Feb 27 21:50 ../
-rw-rw-r-- 1 vip05 vip05 1.2M Feb 27 21:47 multiqc_report_1.html
-rw-rw-r-- 1 vip05 vip05 617K Feb 25 12:01 SRR1039510_1_100000.rawfq_fastqc.html
-rw-rw-r-- 1 vip05 vip05 647K Feb 27 21:40 SRR1039510_1_fastqc.html
-rw-rw-r-- 1 vip05 vip05 619K Feb 25 11:50 SRR1039510_2_100000.rawfq_fastqc.html
-rw-rw-r-- 1 vip05 vip05 651K Feb 27 21:40 SRR1039510_2_fastqc.html
-rw-rw-r-- 1 vip05 vip05 645K Feb 27 21:42 SRR1039511_1_fastqc.html
-rw-rw-r-- 1 vip05 vip05 645K Feb 27 21:42 SRR1039511_2_fastqc.html
-rw-rw-r-- 1 vip05 vip05 648K Feb 27 21:44 SRR1039512_1_fastqc.html
-rw-rw-r-- 1 vip05 vip05 642K Feb 27 21:44 SRR1039512_2_fastqc.html
step3: filter the bad quality reads and remove adaptors.
运行如下代码,得到名为config的文件,包含两列数据
我使用的代码
mkdir clean
cd clean
trim_galore --phred33 -q 25 -e 0.1 --length 36 --stringency 3 --paired -o ~/project/1.rna/clean *.fastq.gz # 过滤数据
# vip05 22:59:06 ~/project/1.rna/clean
$ ls -lh
total 7.9G
-rw-rw-r-- 1 vip05 vip05 1.3G Feb 27 22:15 SRR1039510_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.3G Feb 27 22:15 SRR1039510_2_val_2.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_2_val_2.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_2_val_2.fq.gz
课件中的批量过滤数据代码
cat $1 |while read id
do
arr=(${id})
fq1=${arr[0]}
fq2=${arr[1]}
$bin_trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o $dir $fq1 $fq2
done
打开文件 qc.sh ,并且写入如下内容
trim_galore,用于去除低质量和接头数据
fastqc和cutadapt,一样的安装方法。然后运行trim_galore,这个软件先去除低质量reads,再调用cutadapt去除接头(默认调用illumina接头),最后可选择调用fastqc看看read质量情况。
step4: alignment比对 HISAT2 mapping
比对目的:
• 将打断测序的reads比回参考基因组
• samtools将比对结果排序后得到sort的bam,用于后续分析
比对策略:
• 跨外显子比对
软件:
• hisat2、samtools
参数:
• -p、 -@分别设置线程数
• -x 索引
• -1、-2 双端reads1、reads2的fq文件
• -o 输出结果
• - 传入管道符前的缓存内容
参考基因组:
• hg38.fa
###建立路径
mkdir 0.config 4.mapping
###1.make index
# 下载hisat2索引或者构建(耗时!)
hg19 hg38
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz
tar -zxvf hg19.tar.gz
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg38.tar.gz
tar -zxvf hg38.tar.gz
(rna) vip05 09:49:43 /teach/database
$ ls -lh
total 16K
drwxrwxr-x 2 qmcui qmcui 4.0K Feb 23 09:01 GATK
drwxrwxr-x 2 qmcui qmcui 4.0K Feb 23 08:58 genome
drwxrwxr-x 2 qmcui qmcui 4.0K Feb 23 09:01 gtf
drwxrwxr-x 5 qmcui qmcui 4.0K Feb 23 09:00 index
先拿一个搞一下
###2.single sample mapping(nohup ... &)
cd 4.mapping
id=SRR1039510
hisat2 -p 10 -x /teach/database/index/hisat/hg38/genome -1 ${id}_1_val_1.fq.gz -2 ${id}_2_val_2.fq.gz -S ${id}.hisat.sam
22145156 reads; of these:
22145156 (100.00%) were paired; of these:
844619 (3.81%) aligned concordantly 0 times
19941827 (90.05%) aligned concordantly exactly 1 time
1358710 (6.14%) aligned concordantly >1 times
----
844619 pairs aligned concordantly 0 times; of these:
122345 (14.49%) aligned discordantly 1 time
----
722274 pairs aligned 0 times concordantly or discordantly; of these:
1444548 mates make up the pairs; of these:
856438 (59.29%) aligned 0 times
499198 (34.56%) aligned exactly 1 time
88912 (6.16%) aligned >1 times
98.07% overall alignment rate
批量代码,这里是三个比对软件,我们选择了hisat2,速度比较快。
cd $wkd/clean
ls *gz|cut -d"_" -f 1 |sort -u |while read id;do
ls -lh ${id}_1_val_1.fq.gz ${id}_2_val_2.fq.gz
hisat2 -p 10 -x /public/reference/index/hisat/hg38/genome -1 ${id}_1_val_1.fq.gz -2 ${id}_2_val_2.fq.gz -S ${id}.hisat.sam
subjunc -T 5 -i /public/reference/index/subread/hg38 -r ${id}_1_val_1.fq.gz -R ${id}_2_val_2.fq.gz -o ${id}.subjunc.sam
bowtie2 -p 10 -x /public/reference/index/bowtie/hg38 -1 ${id}_1_val_1.fq.gz -2 ${id}_2_val_2.fq.gz -S ${id}.bowtie.sam
bwa mem -t 5 -M /public/reference/index/bwa/hg38 ${id}_1_val_1.fq.gz ${id}_2_val_2.fq.gz > ${id}.bwa.sam
done
-rw-rw-r-- 1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_2_val_2.fq.gz
21413393 reads; of these:
21413393 (100.00%) were paired; of these:
726325 (3.39%) aligned concordantly 0 times
19489193 (91.01%) aligned concordantly exactly 1 time 1197875 (5.59%) aligned concordantly >1 times
----
726325 pairs aligned concordantly 0 times; of these:
123425 (16.99%) aligned discordantly 1 time
----
602900 pairs aligned 0 times concordantly or discordantly; of these:
1205800 mates make up the pairs; of these:
624452 (51.79%) aligned 0 times
498730 (41.36%) aligned exactly 1 time
82618 (6.85%) aligned >1 times
98.54% overall alignment rate
-rw-rw-r-- 1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_2_val_2.fq.gz
26947191 reads; of these:
26947191 (100.00%) were paired; of these:
643564 (2.39%) aligned concordantly 0 times
24747121 (91.84%) aligned concordantly exactly 1 time
1556506 (5.78%) aligned concordantly >1 times
----
643564 pairs aligned concordantly 0 times; of these:
128943 (20.04%) aligned discordantly 1 time
----
514621 pairs aligned 0 times concordantly or discordantly; of these:
1029242 mates make up the pairs; of these:
531081 (51.60%) aligned 0 times
425913 (41.38%) aligned exactly 1 time
72248 (7.02%) aligned >1 times
sam格式
① '\t'分割每列
② @是头文件
③ 比对行=必需11列+一
④ 与bam内容一模一样
比对后的sam文件比较大,需要转bam文件!
-rw-rw-r-- 1 vip05 vip05 1.3G Feb 27 22:15 SRR1039510_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.3G Feb 27 22:15 SRR1039510_2_val_2.fq.gz
-rw-rw-r-- 1 vip05 vip05 13G Feb 28 11:16 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.2G Feb 27 22:32 SRR1039511_2_val_2.fq.gz
-rw-rw-r-- 1 vip05 vip05 12G Feb 28 11:22 SRR1039511.hisat.sam
-rw-rw-r-- 1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_1_val_1.fq.gz
-rw-rw-r-- 1 vip05 vip05 1.6G Feb 27 22:59 SRR1039512_2_val_2.fq.gz
-rw-rw-r-- 1 vip05 vip05 15G Feb 28 11:29 SRR1039512.hisat.sam
- sam文件转bam
ls *.sam|while read id ;do (samtools sort -O bam -@ 5 -o $(basename ${id} ".sam").bam ${id});done
rm *.sam #删除sam文件
bam特点
① bam文件为其二进制文件
② 占内存小
③ 兼容下游软件
-rw-rw-r-- 1 vip05 vip05 2.1G Feb 28 11:37 SRR1039510.hisat.bam
-rw-rw-r-- 1 vip05 vip05 13G Feb 28 11:16 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip05 vip05 2.0G Feb 28 11:42 SRR1039511.hisat.bam
-rw-rw-r-- 1 vip05 vip05 12G Feb 28 11:22 SRR1039511.hisat.sam
-rw-rw-r-- 1 vip05 vip05 2.5G Feb 28 11:47 SRR1039512.hisat.bam
-rw-rw-r-- 1 vip05 vip05 15G Feb 28 11:29 SRR1039512.hisat.sam
- 为bam文件建立索引
ls *.bam |xargs -i samtools index {}
-rw-rw-r-- 1 vip05 vip05 2.1G Feb 28 11:37 SRR1039510.hisat.bam
-rw-rw-r-- 1 vip05 vip05 2.7M Feb 28 12:00 SRR1039510.hisat.bam.bai
-rw-rw-r-- 1 vip05 vip05 2.0G Feb 28 11:42 SRR1039511.hisat.bam
-rw-rw-r-- 1 vip05 vip05 2.6M Feb 28 12:01 SRR1039511.hisat.bam.bai
-rw-rw-r-- 1 vip05 vip05 2.5G Feb 28 11:47 SRR1039512.hisat.bam
-rw-rw-r-- 1 vip05 vip05 2.8M Feb 28 12:02 SRR1039512.hisat.bam.bai
- reads的比对情况统计
奇怪的是结果没有出来,也没有报错,暂时不知道原因。
ls *.bam |xargs -i samtools flagstat -@ 2 {} >
ls *.bam |while read id ;do ( samtools flagstat -@ 1 $id > $(basename ${id} ".bam").flagstat );done
source deactivate
step5: counts reads计数
zless -S /teach/database/gtf/gencode.v29.annotation.gtf.gz |less -S
##description: evidence-based annotation of the human genome (GRCh38), version 29 (Ensembl 94)
##provider: GENCODE
##contact: [email protected]
##format: gtf
##date: 2018-08-30
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; lev
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unproc
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unproc
chr1 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1 HAVANA exon 12179 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1 HAVANA exon 12613 12697 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
featureCounts -T 5 -p -t exon -g gene_id -a /teach/database/gtf/gencode.v29.annotation.gtf.gz -o ~/all.id.txt *.bam
-rw-rw-r-- 1 vip05 vip05 33M Feb 28 13:31 all.id.txt
-rw-rw-r-- 1 vip05 vip05 491 Feb 28 13:31 all.id.txt.summary
-rw------- 1 vip05 vip05 26K Feb 28 14:31 .bash_history
查看下文件
$ head all.id.txt
# Program:featureCounts v1.6.3; Command:"featureCounts" "-T" "5" "-p" "-t" "exon" "-g" "gene_id" "-a" "/teach/database/gtf/gencode.v29.annotation.gtf.gz" "-o" "/trainee1/vip05/all.id.txt" "SRR1039510.hisat.bam" "SRR1039511.hisat.bam" "SRR1039512.hisat.bam"
Geneid Chr Start End Strand Length SRR1039510.hisat.bam SRR1039511.hisat.bam SRR1039512.hisat.bam
ENSG00000223972.5 chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1 11869;12010;12179;12613;12613;12975;13221;13221;13453 12227;12057;12227;12721;12697;13052;13374;14409;13670 +;+;+;+;+;+;+;+;+ 17350
ENSG00000227232.5 chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1 14404;15005;15796;16607;16858;17233;17606;17915;18268;24738;29534 14501;15038;15947;16765;17055;17368;17742;18061;18366;24891;29570 -;-;-;-;-;-;-;-;-;-;- 1351 21 19 24
ENSG00000278267.1 chr1 17369 17436 - 68 0 3 5
ENSG00000243485.5 chr1;chr1;chr1;chr1;chr1 29554;30267;30564;30976;30976 30039;30667;30667;31109;31097 +;+;+;+;+ 1021 0 0 0
ENSG00000284332.1 chr1 30366 30503 + 138 0 0 0
ENSG00000237613.2 chr1;chr1;chr1;chr1;chr1 34554;35245;35277;35721;35721 35174;35481;35481;36073;36081 -;-;-;-;- 1219 0 0 0
ENSG00000268020.3 chr1 52473 53312 + 840 0 0 0
ENSG00000240361.2 chr1;chr1;chr1;chr1 57598;58700;62916;62949 57653;58856;64116;63887 +;+;+;+ 14140
选出自己需要的数据,也可以使用R语言处理。
less -S all.id.txt|grep -v "^#"|cut -f 1,7-| less -S > all.id3.txt
最后看下文件大小,尽量删除不必要文件,避免占用过多的服务器空间。
du -sh *
后面的差异分析,我们采用R语言进行。参考技能树https://www.jianshu.com/p/a84cd44bac67可以完成。
注意:作为初学者,一定要注意文件的路径,当时第一次练习时就是被路径给搞晕了。另外,这个练习是在两台服务器上完成的,请注意路径!不当之处多包涵!
再次感谢生信技能树的各位老师,谢谢你们的指导!
崔老师地址: https://www.jianshu.com/u/9153eddebf9c,有惊喜哟!
参考文献:
- 原创10000+生信教程大神给你的RNA实战视频演练
- RNA-seq转录组分析
- 生信技能树 -
- https://github.com/jmzeng1314/my-R/blob/master/10-RNA-seq-3-groups/hisat2_mm10_htseq.R
- 陈巍学基因视频1:Illumina测序化学原理
生信技能树公益视频合辑:学习顺序是linux,r,软件安装,geo,小技巧,ngs组学!
B站链接
YouTube链接
生信工程师入门最佳指南
学徒培养