一、需求背景

10X V5转录组vdj分析中，因为有一段特殊基因序列比对不上参考基因组，所以bam文件没有这个barcode信息。需要把R2的这段序列，根据序列ID从R1中找出R2序列与R1 barcode和UMI的对应关系，然后做UMI去重，统计数量，然后再映射到细胞聚类图中去看看这个序列的UMI表达量

二、思路分析

根据文库结构

图片来源：https://blog.csdn.net/herokoking/article/details/103629141

将目标序列作为比对参考模板，用R2文件进行blast比对，从比对结果中筛选出从3'到5’方向的，因为10X V5转录组的R2的测序方向就是3’到5’，然后再筛选出比对上序列长度大于等于85nt的子集，获得比对上fastq R2测序文件的reads的ID信息，再通过seqtk工具查询在 fsatq R1文件根据ID提取出相应的序列信息。

从查询出来的序列中提取序列前1到16nt获取barcode，17到28nt获得UMI信息，经过barcode和UMI的序列去重、统计、分群的标签映射

流程图：

序列模板---->R2 blast比对----->过滤&提取ID---->seqtk提取R1序列--->barcode和UMI的序列去重-->数据统计--->分群的标签映射---->UMI表达量可视化分析

三、代码

3.1 blast比对

https://www.jianshu.com/p/415cd1658157

#fastq转换为fasta
nohup less -S ../sample-*_L3_2.fq.gz |awk '{if(NR%4 == 1){print ">" substr($0, 2)}}{if(NR%4 == 2){print}}' > ./sample-*_L3_2.fasta &


# 数据库构建
makeblastdb \
 -dbtype nucl \
 -in artificial.fasta \
 -input_type fasta \
 -parse_seqids \
 -out artificial.blastdb

# 比对
##这里的*号表示4条R2序列的编号省略
blastn \
 -query sample-*_L3_2.fasta \
 -db artificial.blastdb \
 -out sample-*_results2.xls \
 -outfmt 6 \
 -num_threads 8

3.2 过滤比对上的reads2序列

cat ../sample-1_results2.xls |sort -nr -k4|awk '$4>=85 && $9>$10{print $0}' > map_sample-1.xls
cat ../sample-2_results2.xls |sort -nr -k4|awk '$4>=85 && $9>$10{print $0}' > map_sample-2.xls
cat ../sample-3_results2.xls |sort -nr -k4|awk '$4>=85 && $9>$10{print $0}' > map_sample-3.xls
cat ../sample-4_results2.xls |sort -nr -k4|awk '$4>=85 && $9>$10{print $0}' > map_sample-4.xls

3.3 R2 序列id获取

ls *.xls |cat |xargs awk '{print $1}' >query_ids2

3.3 合并R1序列方便后续查找操作

ls sample*_1.fq.gz |xargs less >sample_R1.fastq

3.4 使用seqtk工具根据R2 ID提取R1序列子集

seqtk安装、部署、使用方法参考：

https://www.jianshu.com/p/309b79238553

https://www.jianshu.com/p/2671198ae625

# 输出fastq格式
$seqtk subseq   sample_R1.fastq query_ids2 >seqtk_result.tsv
# -t 参数：输出一种\t分隔格式，更方便后面的提取barcode操作
$seqtk subseq  -t sample_R1.fastq query_ids2 >seqtk_result2.tsv

3.5 barcode去重，用于生成有无表达特定序列的分组信息标签

cat seqtk_result2.tsv |cut -f3 |awk '{print substr($1, 1, 28)}' |sort |uniq |awk '{print substr($1,1,16)}' |sort |uniq >uniq_barcode.tsv

3.6 与关注细胞细分亚群的barcode进行取交集操作，过滤掉无关的序列

cellType<-readRDS('cellType.rds')
write.table(gsub('-.*', '', rownames([email protected])), file = 'cellType_barcode.tsv', sep = '\t', quote = F, col.names = F, row.names = F)

Linux:

# 取交集
sort cellType_barcode.tsv  uniq_barcode.tsv |uniq -d >cellType_intersect.tsv

3.7 根据barcode和UMI一起去重

cat seqtk_result2.tsv |cut -f3 |awk '{print substr($1, 1, 28)}' |sort |uniq >uniq_barcode_umi.tsv

3.8 统计每个细胞barcode相应的关注序列的UMI

cat uniq_barcode_umi.tsv |awk '{hash[substr($1, 1, 16)]+=1}END{for (i in hash){printf("%s\t%d\n", i, hash[i])}}' >count_umi.tsv

3.9 将分组信息和UMI信息映射到关注的分群聚类图中（数据可视化）

library(plyr)
library(Seurat)
library(ggplot2)

#read barcode group label
interBarcodes<-read.table('cellType_intersect.tsv', sep = '\t', header = F)

#read UMI label
umi_count<-read.table('count_umi.tsv', sep='\t', header = F, col.names = c('barcode', 'umi'))

#mapping group label
cellType$artificial<-ifelse(gsub('-.*', '', rownames([email protected]))%in%interBarcodes$V1, 
       'artificial_gene', 'no_artificial_gene')

#mapping UMI label
cellType$artificial_gene<-mapvalues(gsub('-.*', '', rownames([email protected])), as.character(umi_count$barcode), as.character(umi_count$umi))
cellType$artificial_gene<-as.numeric(cellType$artificial_gene)

#change the no expression UMI label from NA to 0
cellType$artificial_gene<-ifelse(is.na(cellType$artificial_gene), 0, cellType$artificial_gene)
head([email protected])

# Visualize
DimPlot(cellType, group.by = 'artificial', reduction = 'tsne')
FeaturePlot(cellType, features = 'artificial_gene', reduction = 'tsne', label=T, cols = c('grey','red'), pt.size = 1)

参考文章

https://blog.csdn.net/herokoking/article/details/103629141
https://www.jieandze1314.com/post/cnposts/scrna-2/
https://www.jianshu.com/p/309b79238553
https://www.jianshu.com/p/2671198ae625

【单细胞转录组】将序列UMI映射到细胞聚类分群