RNA-seq 学习之一 :Tophat + cufflinks 套装分析转录组


一、数据集准备

选择GSE132693 : RNAseq of prostate cancer cells (PC3 and DU145) treated with 2.5 uM pentamidine or vehicle.作为分析数据

数据下载
$ for (( i=2576;i<=2587;i++)) ; do wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR929/SRR929${i}/SRR929${i}.sra;done
转换数据格式
$ ls *.sra | while read id ;do fastq-dump --split-3 $id ; done 
$  ls *.fastq | while read id ; do gzip $id ; done

二、数据质量控制

Fastqc 软件查看数据质量
$ ls *.gz | while read id ; do fastqc $id -o Fastqc_results/ -q ;done
用Trimmomatic进行质量控制
$for (( i= 576; i<=587;i++)) ; do java -jar ~/bio_softs/Trimmomatic-0.38/trimmomatic-0.38.jar PE \
SRR9292${i}_1.fastq.gz SRR9292${i}_2.fastq.gz \
QC/Trimmomatic_results/paired_data/SRR9292${i}_1_paired.fastq.gz \
QC/Trimmomatic_results/unpaired_data/SRR9292${i}_1_unpaired.fastq.gz \
QC/Trimmomatic_results/paired_data/SRR9292${i}_2_paired.fastq.gz \
QC/Trimmomatic_results/unpaired_data/SRR9292${i}_2_unpaired.fastq.gz \
ILLUMINACLIP:~/bio_softs/Trimmomatic-0.38/adapters/TruSeq3-PE-2.fa:2:30:10:2:keepBothReads\
 LEADING:3 TRAILING:3 MINLEN:36 HEADCROP:9;done

三、准备基因组以及注释文件

Ensembl_database

##基因组文件 
wget ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz

## 利用脚本合并 命令如下
open B,">Ensembl_hg38_genome.fa";
for my $a (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,"X","Y","MT"){
open C, "Homo_sapiens.GRCh38.dna.chromosome.$a.fa";
while (){
print B "$_";
}
print "$a---ok";
}
close B;

##注释文件
wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz

UCSC_database

## 基因组文件
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

## 注释文件通过table Browser获取

GENCODE_database

## 基因组文件
wget https://www.gencodegenes.org/human/GRCh38.p12.genome.fa.gz

## 注释文件
wget https://www.gencodegenes.org/human/gencode.v31.annotation.gtf.gz

四、序列比对

对基因组建立索引

##此次学习选用Ensembl数据库下载的参考基因组与注释文件
建立索引命令如下:
$ bowtie-build human_genome.fa index_genome/hg38_bowtie.index

Tophat进行序列比对

for ((i=2576;i<=2587;i++));do tophat -p 10 -G \
~/data/human_genome/Ensembl_DB/Homo_sapiens.GRCh38.94.gtf \
-o align_results/tophat_output/SRR929${i}_tophat_align \
~/data/human_genome/Ensembl_DB/bowtie2_index/Ensemble_human_genome_hg38.index \
SRR929${i}_1_paired.fastq.gz SRR929${i}_2_paired.fastq.gz;done

cufflinks 转录本组装

$ for ((i=2576;i<=2587;i++));do cufflinks -p 10 \
-g ~/data/human_genome/Ensembl_DB/Homo_sapiens.GRCh38.94.gtf \
-o cufflinks_results/SRR929${i}_cufflinks \
align_results/tophat_output/SRR929${i}_tophat_align/accepted_hits.bam;done

建立assemblies.txt,包含每个样品转录本路径

./cufflinks_results/SRR9292576_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292577_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292578_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292579_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292580_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292581_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292582_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292583_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292584_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292585_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292586_cufflinks/transcripts.gtf
./cufflinks_results/SRR9292587_cufflinks/transcripts.gtf

run cuffmerge 将所有样品转录本merge到一个文件中

$ cuffmerge -g ~/data/human_genome/Ensembl_DB/Homo_sapiens.GRCh38.94.gtf \
-s ~/data/human_genome/Ensembl_DB/Ensemble_human_genome_hg38.fasta\
-p 6 -o cuffmerge_results/merged_out assemblies.txt

cuffdiff: 差异计算

cuffdiff -p 10 -o cuffdiff_results/DU145_diffout \
-b ~/data/human_genome/Ensembl_DB/Ensemble_human_genome_hg38.fasta \
-L con,ptm -u cuffmerge_results/merged_out/merged.gtf \
align_results/tophat_output/SRR9292582_tophat_align/accepted_hits.bam,\
align_results/tophat_output/SRR9292583_tophat_align/accepted_hits.bam,\
align_results/tophat_output/SRR9292584_tophat_align/accepted_hits.bam \
align_results/tophat_output/SRR9292585_tophat_align/accepted_hits.bam,\
align_results/tophat_output/SRR9292586_tophat_align/accepted_hits.bam,\
align_results/tophat_output/SRR9292587_tophat_align/accepted_hits.bam

你可能感兴趣的:(RNA-seq 学习之一 :Tophat + cufflinks 套装分析转录组)