利用TRUST4从bulk RNA-seq中重构免疫组数据

昨天老师发给我一篇生信女神Shirley Liu的文章,看了里面的内容之后感觉很兴奋~它可以不做免疫组测序,直接从Bulk RNA-seq或者scRNA-seq里面重构得到免疫组的信息。


中文翻译

文章要点

  1. Although less sensitive than TCR-seq and BCR-seq, TRUST is able to identify the abundantly expressed and potentially more clonally expanded TCRs/BCRs in the RNA-seq data that are more likely to be involved in antigen binding
  2. Recent years have also seen other computational methods introduced for immune repertoire construction from RNA-seq data, including V’DJer, MiXCR, CATT and ImRep. These methods focus on reconstruction of complementary-determining region3 (CDR3), with limited ability to assemble full-length V(D)J receptor sequences, although CDR1 and CDR2 on the V sequence still contribute considerably to anti- gen recognition and binding.

TRUST4和其他重构算法相比,它的特点:

  1. 可利用FASTQ或BAM文件
  2. 可重构更长,甚至全长的TCR或BCR序列
  3. 更快更敏感

虽然TRUST4也可以从单细胞数据中重构,今天我主要想试一试从Bulk中重构

1. 安装

git clone https://github.com/liulab-dfci/TRUST4.git
make
#我想添加环境变量,但不知道问什么总是失败
#所以决定再目标文件夹对run-trust4文件创建软链接
ln -s /home/user/myh/install/TRUST4/run-trust4 /home/user/myh/**/TRUST4_outs
cd /home/user/myh/**/TRUST4_outs
./run-trust4
#可以使用

2.用法

官方Usage

Usage: ./run-trust4 [OPTIONS]
    Required:
        -b STRING: path to bam file
        -1 STRING -2 STRING: path to paired-end read files
        -u STRING: path to single-end read file
        -f STRING: path to the fasta file coordinate and sequence of V/D/J/C genes
    Optional:
        --ref STRING: path to detailed V/D/J/C gene reference file, such as from IMGT database. (default: not used). (recommended) 
        -o STRING: prefix of output files. (default: inferred from file prefix)
        --od STRING: the directory for output files. (default: ./)
        -t INT: number of threads (default: 1)
        --barcode STRING: if -b, bam field for barcode; if -1 -2/-u, file containing barcodes (defaul: not used)
        --barcodeRange INT INT CHAR: start, end(-1 for lenght-1), strand in a barcode is the true barcode (default: 0 -1 +)
        --barcodeWhitelist STRING: path to the barcode whitelist (default: not used)
        --read1Range INT INT: start, end(-1 for length-1) in -1/-u files for genomic sequence (default: 0 -1)
        --read2Range INT INT: start, end(-1 for length-1) in -2 files for genomic sequence (default: 0 -1)
        --UMI STRING: if -b, bam field for UMI; if -1 -2/-u, file containing UMIs (default: not used)
        --umiRange INT INT CHAR: start, end(-1 for lenght-1), strand in a UMI is the true UMI (default: 0 -1 +)
        --mateIdSuffixLen INT: the suffix length in read id for mate. (default: not used)
        --skipMateExtension: do not extend assemblies with mate information, useful for SMART-seq (default: not used)
        --abnormalUnmapFlag: the flag in BAM for the unmapped read-pair is nonconcordant (default: not set)
        --noExtraction: directly use the files from provided -1 -2/-u to assemble (default: extraction first)
        --repseq: the data is from TCR-seq or BCR-seq (default: not set)
        --outputReadAssignment: output read assignment results to the prefix_assign.out file (default: no output)
        --stage INT: start TRUST4 on specified stage (default: 0)
            0: start from beginning (candidate read extraction)
            1: start from assembly
            2: start from annotation
            3: start from generating the report table

我的数据是小鼠的数据,先用一个Fastq文件试一试

./run-trust4 -f /home/user/myh/install/TRUST4/mouse/GRCm38_bcrtcr.fa --ref /home/user/myh/install/TRUST4/mouse/mouse_IMGT+C.fa -1 /home/user/myh/raw_data/AEKIBULK/inputs/clean_data/KI_T/KIT11_1.clean.fq.gz -2 /home/user/myh/raw_data/AEKIBULK/inputs/clean_data/KI_T/KIT11_2.clean.fq.gz -o KIT11

可以通过-t调节可用的线程数

看到这里表示已经跑完了

因为我的数据里面是分选了T细胞和B细胞的,但我用T细胞的数据跑也能重构到BCR的结果,Emmm

注意一下TRUST4跑完是不会主动生成文件夹的,所有的结果都散在那里……

XX_report.tsv里面有如下信息:

可直接用于immunarch

还会生成airr文件,也可用于immunarch分析

  • "airr" - adaptive immune receptor repertoire (AIRR) data format. http://docs.airr-community.org/en/latest/datarep/overview.html

对于T细胞的结果,我把BCR链删掉后,用immunarch进行后续分析

补充一点关于用VDJtools分析的内容
下载好VDJtools后
参考

1.Basic analysis
1.1 CalcBasicStats

java -jar /home/user/myh/install/VDJtools/vdjtools-1.2.1/vdjtools-1.2.1.jar CalcBasicStats -m /home/user/myh/raw_data/AEKIBULK/vdjtools/inputs/metadata.txt /home/user/myh/raw_data/AEKIBULK/vdjtools/outs
# /path to vdjtools/:  vdjtolls的安装路径
#output_prefix: 输出路径

VDJtools的格式
注意在CDR3aa里面,要删除out_of_frame的内容,不然vdjtools无法识别

1.2 CalcSegmentUsage

java -jar /home/user/myh/install/VDJtools/vdjtools-1.2.1/vdjtools-1.2.1.jar CalcSegmentUsage -p -f "group" -m /home/user/myh/raw_data/AEKIBULK/vdjtools/inputs/metadata.txt /home/user/myh/raw_data/AEKIBULK/vdjtools/outs 

#-p : 画图,依赖于R包
#-f  : 指定分组依据,分组信息在metadata文件中
#--plot-type png 输出png图片

1.3 CalcSpectratype
Calculates spectratype, that is, histogram of read counts by CDR3 nucleotide length.

java -jar /home/user/myh/install/VDJtools/vdjtools-1.2.1/vdjtools-1.2.1.jar CalcSpectratype -a -m /home/user/myh/raw_data/AEKIBULK/vdjtools/inputs/metadata.txt /home/user/myh/raw_data/AEKIBULK/vdjtools/outs
#-a :Will use CDR3 amino acid sequences for calculation instead of nucleotide ones

1.4 PlotFancySpectratype
Plots a spectratype that also displays CDR3 lengths for top N clonotypes in a given sample.This plot allows to detect the highly-expanded clonotypes.

java -jar /home/user/myh/install/VDJtools/vdjtools-1.2.1/vdjtools-1.2.1.jar PlotFancySpectratype -t 5 /home/user/myh/raw_data/AEKIBULK/vdjtools/inputs/AE_T_5.txt /home/user/myh/raw_data/AEKIBULK/vdjtools/outs
#-t:Number of top clonotypes to visualize. Should not exceed 20, default is 10
#单一样本

下面这个不知道为啥没跑出来

java -jar /home/user/myh/install/VDJtools/vdjtools-1.2.1/vdjtools-1.2.1.jar CalcPairwiseDistances -p -m /home/user/myh/raw_data/AEKIBULK/vdjtools/inputs/metadata.txt /home/user/myh/raw_data/AEKIBULK/vdjtools/outs
#-p: plot

如果要看单细胞的数据:

./run-trust4 -b /home/user/myh/raw_data/***/possorted_genome_bam.bam -f /home/user/myh/install/TRUST4/human/hg38_bcrtcr.fa --ref /home/user/myh/install/TRUST4/human/human_IMGT+C.fa --barcode CB -o XXX

你可能感兴趣的:(利用TRUST4从bulk RNA-seq中重构免疫组数据)