cellranger使用的初步探索(1)

本篇笔记是按照单细胞天地公众号教程里的代码练习的,因为之前了解到cell ranger运行需要大量的内存,无奈自己的电脑配置不行,现在有了服务器,那就要把这一节补起来~
参考文章:
1.单细胞实战(二) cell ranger使用前注意事项
2.单细胞实战(三) Cell Ranger使用初探
3.单细胞实战(四) Cell Ranger流程概览

练习数据:GSE117988。首先在GEO网站上看一下作者对数据的处理方法:

对于这几个原始数据来说,其中包括:Read1-26个碱基(UMI),read2-8个碱基(Index),Read2-98个碱基(RNA read)。对原始的BCL文件使用cell ranger进行mkfastq,然后用cell ranger的count进行分析,使用STAR进行比对(基因组GRCh38和HM011556.1)。

(一)下载sra数据

#!/bin/bash
#module load sratoolkit/2.9.1(这是我调用服务器里的软件的命令,如果是自己的电脑,请忽略这一行)
cat SRR_Acc_List.txt | while read i
do
  prefetch $i -O `pwd` && echo "**${i}.sra done**"
done

(二)提取fastq文件

#!/bin/bash
# module load sratoolkit/2.9.1
for i in SRR77229*
do                                                                                                                                                                                                                                                 
  fastq-dump --gzip --split-files ./$i                                                                                   
done               

这一步每一个SRA文件都会生成3个fastq文件。分别标注_1, _2, _3。现在来看看每一个fastq文件都是什么样子的:

$ zless -SN SRR7722942_1.fastq.gz | head  #这是每一个_1的read,这里显示每一条read都是8个碱基长度,那么就是上面我们提到的Index。
@SRR7722942.1 SN367:911:HKMNCBCXY:2:1101:1192:1900 length=8
CGCTATGT
+SRR7722942.1 SN367:911:HKMNCBCXY:2:1101:1192:1900 length=8
GGGGGIII
@SRR7722942.2 SN367:911:HKMNCBCXY:2:1101:1112:1988 length=8
CGCTATGT
+SRR7722942.2 SN367:911:HKMNCBCXY:2:1101:1112:1988 length=8
GGGGGIII
@SRR7722942.3 SN367:911:HKMNCBCXY:2:1101:1404:1952 length=8
CGCTATGT

$ zless -SN SRR7722942_2.fastq.gz | head #标注_2的fastq文件长度是26个碱基,也就是上面说的细胞barcode(UMI)
@SRR7722942.1 SN367:911:HKMNCBCXY:2:1101:1192:1900 length=26
CGTTGGGGTCTGCGGTAAATAGGCCA
+SRR7722942.1 SN367:911:HKMNCBCXY:2:1101:1192:1900 length=26
GAGGGIGGIIGGIIIIIGIGGGGGGI
@SRR7722942.2 SN367:911:HKMNCBCXY:2:1101:1112:1988 length=26
CTTAACTTCTCGATGAAGGGGTCTCG
+SRR7722942.2 SN367:911:HKMNCBCXY:2:1101:1112:1988 length=26
GGGGGIIIIIIIIIIGIIIIIIGIII
@SRR7722942.3 SN367:911:HKMNCBCXY:2:1101:1404:1952 length=26
AAGACCTTCGCCCTTATACCGGTCCC

$ zless -SN SRR7722942_3.fastq.gz | head #标注_3的fastq文件每一个Read是98个碱基,也就是RNA的read
@SRR7722942.1 SN367:911:HKMNCBCXY:2:1101:1192:1900 length=98
NNNNNGTGGTATCAACGCAGAGTACATGGGGCNCTTACCGCCATCTTGGCTCCTGTGGNTGNCTGCTGGGAACGGGACTTCTAAAAGNNNNTATGTCT
+SRR7722942.1 SN367:911:HKMNCBCXY:2:1101:1192:1900 length=98
#####<.<<

(三)修改fastq文件名

根据10x官网上的说明,在后续处理数据之前,最好把fastq文件的名字改一下:here
改成如下格式:

这里就可以对应上面文章里描述的,index文件改成I1,Read1(UMI)改成R1,Read2(RNA read)改成R2。所以这里fastq_1文件应该改成I1, fastq_2文件改成R1,fastq_3文件改成R2:

$ cat SRR_Acc_List.txt
SRR7722937
SRR7722941
SRR7722938
SRR7722939
SRR7722940
SRR7722942
$ cat SRR_Acc_List.txt | while read i ;do (mv ${i}_1*.gz ${i}_S1_L001_I1_001.fastq.gz;mv ${i}_2*.gz ${i}_S1_L001_R1_001.fastq.gz;mv ${i}_3*.gz ${i}_S1_L001_R2_001.fastq.gz);done

改后文件名就都变成了下面这样:

(四)fastqc质量检查

因为数据较多,所以做fastqc要批量做,首先把

$ find ./ -name '*R1*.gz' > P2586_4_id_1.txt #把R1文件的文件名提取出来
$ cat P2586_4_id_1.txt
./SRR7722942_S1_L001_R1_001.fastq.gz
./SRR7722939_S1_L001_R1_001.fastq.gz
./SRR7722940_S1_L001_R1_001.fastq.gz
./SRR7722937_S1_L001_R1_001.fastq.gz
./SRR7722938_S1_L001_R1_001.fastq.gz
./SRR7722941_S1_L001_R1_001.fastq.gz

$ find ./ -name '*R2*.gz' > P2586_4_id_2.txt #把R2文件的文件名提取出来
$ cat P2586_4_id_2.txt
./SRR7722938_S1_L001_R2_001.fastq.gz
./SRR7722942_S1_L001_R2_001.fastq.gz
./SRR7722939_S1_L001_R2_001.fastq.gz
./SRR7722937_S1_L001_R2_001.fastq.gz
./SRR7722941_S1_L001_R2_001.fastq.gz
./SRR7722940_S1_L001_R2_001.fastq.gz

$ cat P2586_4_id_1.txt P2586_4_id_2.txt >P2586_4_id_all.txt #合并两个文件里的文件名
$ cat P2586_4_id_all.txt 
./SRR7722942_S1_L001_R1_001.fastq.gz
./SRR7722939_S1_L001_R1_001.fastq.gz
./SRR7722940_S1_L001_R1_001.fastq.gz
./SRR7722937_S1_L001_R1_001.fastq.gz
./SRR7722938_S1_L001_R1_001.fastq.gz
./SRR7722941_S1_L001_R1_001.fastq.gz
./SRR7722938_S1_L001_R2_001.fastq.gz
./SRR7722942_S1_L001_R2_001.fastq.gz
./SRR7722939_S1_L001_R2_001.fastq.gz
./SRR7722937_S1_L001_R2_001.fastq.gz
./SRR7722941_S1_L001_R2_001.fastq.gz
./SRR7722940_S1_L001_R2_001.fastq.gz

然后进行fastqc质检:

$ cat P2586_4_id_all.txt| xargs fastqc -t 20 -o ./fastqc/

生成的质检报告是html结尾的文件,可以用浏览器打开。随便先打开两个SRR7722937的R1和R2的质检报告,也就是barcode和RNA read的质检报告:

这是barcode的fastqc文件
这是RNA read的fastqc报告,没有接头序列

(五)cellranger构建基因组

关于cellranger的简介,在单细胞天地的公众号文章里(单细胞实战(三) Cell Ranger使用初探)有很详细的介绍,我这里就不copy人家的了。主要看一下这个练习数据的文献作者是怎么处理的:

Raw base BCL files were demultiplexed using the Cell Ranger mkfastq pipeline into sample-specific FASTQ files.
首先,bcl文件使用cellranger进行mkfastq处理,按照index生成样品对应的fastq文件。(然而这个练习里,我们用不到这一步,因为作者并没有把最原始的bcl文件上传到网上。
FASTQ files were processed individually using the Cell Ranger count pipeline, which made use of the STAR software to align cDNA reads to the GRCh38 and the Merkel cell polyomavirus sequence (HM011556.1).
fastq文件分别独立的进行cellranger count分析,参考基因组gRCh38 HM011556.1(默克尔细胞多瘤病毒序列)。
Tumor and PBMC samples were respectively aggregated together using the Cell Ranger aggr pipeline resulting in two gene-barcode count matrices (tumor and PBMC) to be used for downstream analyses.
肿瘤和PBMC样品分别使用cellranger的aggr整合功能生成count矩阵用于下游分析。

关于cellranger的安装,在我之前的一篇文章里有写过(10×单细胞测序分析练习(一))。

基因组下载,这里注意,10*的基因组是有专门下载的,不是平时我们使用的那些比对用的基因组,是需要过滤的:

#你可以下载这个文件,是cellranger官网上构建好的基因组
$ curl -O https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
$ tar -xzvf refdata-gex-GRCh38-2020-A.tar.gz

也可以自己试着构建:

# 基因组下载
$ wget ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# 下载注释
$ wget ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
$ gunzip Homo_sapiens.GRCh38.84.gtf.gz
# 利用cellranger构建注释(基因类型)
# 格式:cellranger mkgtf   [--attribute=KEY:VALUE...]
$ cellranger mkgtf Homo_sapiens.GRCh38.84.gtf Homo_sapiens.GRCh38.84.filtered.gtf \
                --attribute=gene_biotype:protein_coding \
                --attribute=gene_biotype:lincRNA \
                --attribute=gene_biotype:antisense \
                --attribute=gene_biotype:IG_LV_gene \
                --attribute=gene_biotype:IG_V_gene \
                --attribute=gene_biotype:IG_V_pseudogene \
                --attribute=gene_biotype:IG_D_gene \
                --attribute=gene_biotype:IG_J_gene \
                --attribute=gene_biotype:IG_J_pseudogene \
                --attribute=gene_biotype:IG_C_gene \
                --attribute=gene_biotype:IG_C_pseudogene \
                --attribute=gene_biotype:TR_V_gene \
                --attribute=gene_biotype:TR_V_pseudogene \
                --attribute=gene_biotype:TR_D_gene \
                --attribute=gene_biotype:TR_J_gene \
                --attribute=gene_biotype:TR_J_pseudogene \
                --attribute=gene_biotype:TR_C_gene
#会显示下面这样的运行提示:
# Writing new genes GTF file (may take 10 minutes for a 1GB input GTF file)...
# ...done

看一下构建好的基因类型gtf文件:

$ less Homo_sapiens.GRCh38.84.filtered.gtf
#!genome-build GRCh38.p5
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.20
#!genebuild-last-updated 2015-10
1       havana  gene    29554   31109   .       +       .       gene_id "ENSG00000243485"; gene_version "3"; gene_name "RP11-34P13.3"; gene_source "havana"; gene_biotype "lincRNA"; havana_gene "OTTHUMG00000000959"; havana_gene_version "2";
1       havana  transcript      29554   31097   .       +       .       gene_id "ENSG00000243485"; gene_version "3"; transcript_id "ENST00000473358"; transcript_version "1"; gene_name "RP11-34P13.3"; gene_source "havana"; gene_biotype "lincRNA"; havana_gene "OTTHUMG00000000959"; havana_gene_version "2"; transcript_name "RP11-34P13.3-001"; transcript_source "havana"; transcript_biotype "lincRNA"; havana_transcript "OTTHUMT00000002840"; havana_transcript_version "1"; tag "basic"; transcript_support_level "5";
1       havana  exon    29554   30039   .       +       .       gene_id "ENSG00000243485"; gene_version "3"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "1"; gene_name "RP11-34P13.3"; gene_source "havana"; gene_biotype "lincRNA"; havana_gene "OTTHUMG00000000959"; havana_gene_version "2"; transcript_name "RP11-34P13.3-001"; transcript_source "havana"; transcript_biotype "lincRNA"; havana_transcript "OTTHUMT00000002840"; havana_transcript_version "1"; exon_id "ENSE00001947070"; exon_version "1"; tag "basic"; transcript_support_level "5";
1       havana  exon    30564   30667   .       +       .       gene_id "ENSG00000243485"; gene_version "3"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "2"; gene_name "RP11-34P13.3"; gene_source "havana"; gene_biotype "lincRNA"; havana_gene "OTTHUMG00000000959"; havana_gene_version "2"; transcript_name "RP11-34P13.3-001"; transcript_source "havana"; transcript_biotype "lincRNA"; havana_transcript "OTTHUMT00000002840"; havana_transcript_version "1"; exon_id "ENSE00001922571"; exon_version "1"; tag "basic"; transcript_support_level "5";
1       havana  exon    30976   31097   .       +       .       gene_id "ENSG00000243485"; gene_version "3"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "3"; gene_name "RP11-34P13.3"; gene_source "havana"; gene_biotype "lincRNA"; havana_gene "OTTHUMG00000000959"; havana_gene_version "2"; transcript_name "RP11-34P13.3-001"; transcript_source "havana"; transcript_biotype "lincRNA"; havana_transcript "OTTHUMT00000002840"; havana_transcript_version "1"; exon_id "ENSE00001827679"; exon_version "1"; tag "basic"; transcript_support_level "5";
1       havana  transcript      30267   31109   .       +       .       gene_id "ENSG00000243485"; gene_version "3"; transcript_id "ENST00000469289"; transcript_version "1"; gene_name "RP11-34P13.3"; gene_source "havana"; gene_biotype "lincRNA"; havana_gene "OTTHUMG00000000959"; havana_gene_version "2"; transcript_name "RP11-34P13.3-002"; transcript_source "havana"; transcript_biotype "lincRNA"; havana_transcript "OTTHUMT00000002841"; havana_transcript_version "2"; tag "basic"; transcript_support_level "5";

然后可以根据教程(单细胞实战(三) Cell Ranger使用初探)里的代码察看一下构建好的基因类型的gtf文件里有多少种基因:

$ cat Homo_sapiens.GRCh38.84.filtered.gtf |grep -v "#" |awk -v FS='gene_biotype ' 'NF>1{print $2}'|awk -F ";" '{print $1}'|sort | uniq -c
    213 "IG_C_gene"
     33 "IG_C_pseudogene"
    152 "IG_D_gene"
     76 "IG_J_gene"
      9 "IG_J_pseudogene"
   1209 "IG_V_gene"
    646 "IG_V_pseudogene"
    125 "TR_C_gene"
     16 "TR_D_gene"
    316 "TR_J_gene"
     12 "TR_J_pseudogene"
    848 "TR_V_gene"
    110 "TR_V_pseudogene"
  45662 "antisense"
  58181 "lincRNA"
2337766 "protein_coding"

然后利用我们构建好的基因类型注释,去构建基因组(这个过程非常慢,现在理解了为什么说运行cellranger需要大量的内存了。。。):

$ cellranger mkref --genome=GRCh38 \
                --fasta=Homo_sapiens.GRCh38.dna.primary_assembly.fa \
                --genes=Homo_sapiens.GRCh38.84.filtered.gtf \
                --ref-version=2.2.0 #注意这里的版本设置,原教程里设置的是1.2.0,然而我们学校服务器里安装的cellranger最低版本是2.2.0,我调用的也是2.2.0

这个构建基因组的时间太长了,任务运行完打开log文件,看看运行的过程:

$ cat cellranger_8798259.log #光是这一步,我调用了服务器里128G的内存,还是从下午的5点20运行到晚上的11点多。。。电脑内存不行的同学还是老老实实下载现成的吧~

Jul 30 17:20:39 ..... Started STAR run
Jul 30 17:20:41 ... Starting to generate Genome files
Jul 30 17:22:28 ... starting to sort  Suffix Array. This may take a long time...
Jul 30 17:22:37 ... sorting Suffix Array chunks and saving them to disk...
Jul 30 21:26:01 ... loading chunks from disk, packing SA...
Jul 30 21:26:32 ... Finished generating suffix array
Jul 30 21:26:32 ... Generating Suffix Array index
Jul 30 21:28:43 ... Completed Suffix Array index
Jul 30 21:28:43 ..... Processing annotations GTF
Jul 30 21:30:15 ..... Inserting junctions into the genome indices
Jul 30 21:37:42 ... writing Genome to disk ...
Jul 30 22:09:57 ... writing Suffix Array to disk ...
Jul 30 23:03:34 ... writing SAindex to disk
Jul 30 23:09:07 ..... Finished successfully
Creating new reference folder at /gpfs/home/downloads/10_genomics_genome/GRCh38
...done

Writing genome FASTA file into reference folder...
...done

Computing hash of genome FASTA file...
...done

Writing genes GTF file into reference folder...
...done

Computing hash of genes GTF file...
...done

Writing genes index file into reference folder (may take over 10 minutes for a 3Gb genome)...
...done

Writing genome metadata JSON file into reference folder...
...done

Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
...done.

>>> Reference successfully created! <<<

那么运行了这么久,得到了哪些文件呢:

$ tree
.
|-- fasta
|   `-- genome.fa
|-- genes
|   `-- genes.gtf
|-- pickle
|   `-- genes.pickle
|-- reference.json
`-- star
    |-- Genome
    |-- SA
    |-- SAindex
    |-- chrLength.txt
    |-- chrName.txt
    |-- chrNameLength.txt
    |-- chrStart.txt
    |-- exonGeTrInfo.tab
    |-- exonInfo.tab
    |-- geneInfo.tab
    |-- genomeParameters.txt
    |-- sjdbInfo.txt
    |-- sjdbList.fromGTF.out.tab
    |-- sjdbList.out.tab
    `-- transcriptInfo.tab

4 directories, 19 files

(六)celllranger count

这里我先举个例子,用我们之前得到的fastq文件来进行cellranger count的流程,其实文献作者是分别用GRCh38和virus的基因组进行分析的,但是我这里没有下载virus基因组,就先做个练习而已~

如果你想练习完整的过程,可以参考cellranger官网,如何给你的基因组里添加基因,可以参考:here。(文献里的HM011556.1病毒基因组,可以从NCBI上下载:here)

#比如这样:
cat virus.gtf >> customref-GRCh38-2020-A.gtf
cat virus.fa >> customref-GRCh38-2020-A.fa
#添加基因组后,你需要用新的mix的基因组重新跑上面的mkref,有兴趣且有时间的同学可以试一试~
$ cellranger count \
          --id=Tumor \ #指定输出文件的文件夹名称
          --transcriptome=/gpfs/home/practice/10_genomics_genome/GRCh38 \ # 你构建的基因组的文件所在位置,这里注意如果你是直接下载的构建好的基因组,那么一般是这样的路径:/opt/refdata-gex-GRCh38-2020-A
          --fastqs=/gpfs/home/practice \ #fastq文件所在位置
          --sample=SRR7722938 \ #文件名
          --expect-cells=10000 \ #你样品里细胞数
          --localcores=64 \
          --localmem=200 \
          --nosecondary #只获得表达矩阵,不进行后续的降维、聚类和可视化分析(因为后期会自行用R包去做),你也可以不加这一行

在运行的时候,会弹出很多运行的状态:

Martian Runtime - '2.2.0-v2.3.3'
Serving UI at http://42700=ekX6-BNJ2NhnFQUkZ6nmX268SliZN4TUyQaHD2EFHQg

Running preflight checks (please wait)...
Checking sample info...
Checking FASTQ folder...
Checking reference...
Checking reference_path (/gpfs/home/practice/10_genomics_genome/GRCh38) on cn-0043...
Checking chemistry...
Checking optional arguments...
mrc: '2.2.0-v2.3.3'

mrp: '2.2.0-v2.3.3'

Anaconda: Python 2.7.13 :: Continuum Analytics, Inc.
numpy: 1.13.1
scipy: 0.19.1
pysam: 0.9.1
h5py: 2.7.0
pandas: 0.20.2
STAR: STAR_2.5.1b
samtools: samtools 1.6
Using htslib 1.6
Copyright (C) 2017 Genome Research Ltd.

2020-08-01 17:15:14 [runtime] (ready)           ID.Tumor937.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY
2020-08-01 17:15:14 [runtime] (run:local)       ID.Tumor937.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY.fork0.split
2020-08-01 17:15:17 [runtime] (split_complete)  ID.Tumor937.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY
2020-08-01 17:15:17 [runtime] (run:local)       ID.Tumor937.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY.fork0.chnk0.main
2020-08-01 17:15:21 [runtime] (progress)        ID.Tumor937.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY.fork0.chnk0: Indexing genome...
2020-08-01 17:15:39 [runtime] (progress)        ID.Tumor937.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY.fork0.chnk0: Building transcriptome...
2020-08-01 17:19:33 [runtime] (progress)        ID.Tumor937.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY.fork0.chnk0: Building kmer index...
2020-08-01 17:25:19 [runtime] (update)          ID.Tumor937.SC_RNA_COUNTER_CS.SC_RNA_COUNTER.CHEMISTRY_DETECTOR.DETECT_CHEMISTRY.fork0 chunks_running
...... # 这个运行过程取决于你的细胞数量、测序深度,运行时间几个小时到几天不等~ cellranger的运行是非常非常缓慢的

根据cellranger的官网的介绍,一个成功运行后的cellranger count流程应该包含下面这些输出文件:

截止到这篇笔记的结尾,我尝试了运行12小时、24小时,仍然没有运行完这个cellranger count的任务,我使用的服务器内存是200G,cores使用了64个。之后想试试让它运行个3-5天,看看能不能跑完。

你可能感兴趣的:(cellranger使用的初步探索(1))