1.数据准备
用10x数据做scVelo,输入数据分2部分:
- Seurat 分析后的数据: 聚类、基因、细胞信息。
- velocyto run 输出的loom文件,记录 spliced/unspliced 的RNA;
(1) Seurat 分析
假设你已经做好 Seurat 降维、分群、注释了。
velocyto 需要输入细胞的id。
一般多样本合并后的Seurat,需要按bam来源取子集,然后去掉cell id后缀再保存。参考一下几个文件:
> head(Cells(scObj_neu))
[1] "AAACCCAAGCCTCCAG-1_1" "AAACCCATCGGCATAT-1_1" "AAACGAATCCTGGGAC-1_1" "AAAGAACTCTTTACAC-1_1"
[5] "AAAGGATAGCGCAATG-1_1" "AAAGGATGTAGCTTTG-1_1"
tmp1=sapply(Cells(scObj_neu), function(x){
(strsplit(x, "_")[[1]])[1]
}, USE.NAMES = F, simplify = T)
head(tmp1)
# [1] "AAACCCAAGCCTCCAG-1" "AAACCCATCGGCATAT-1"
writeLines(tmp1, paste0("scVelo/velocyto/WT2_barcodes.tsv") )
在 shell 中查看:
$ head ../scVelo/velocyto/WT2_barcodes.tsv
AAACCCACACACAGCC-1
AAACCCACAGACCAGA-1
AAACGAAGTCTTCGAA-1
包装好的函数如下: 半成品,因为函数第一行需要获取的是细胞和bam文件的编号,这两个信息每个人估计有自己的命名,根据自己情况改吧。
#' get cell barcode from sce, split by sample0
#'
#' @param sce
#'
#' @return
#' @export
#'
#' @examples
getCid_list=function(sce){
a1=FetchData(sce, vars=c("cell", "sample0"))
#head(a1)
strsplit(a1$cell, "_")[[1]]
a1$cid=sapply(a1$cell, function(x){
(strsplit(x, "_")[[1]])[1]
}, USE.NAMES = F, simplify = T)
#head(a1)
#
sapply(
unique(a1$sample0),
function(x){
a1[which(a1$sample0==x),]$cid
}, USE.NAMES = T
)
}
# 测试1: 最终效果,就是一个cell id文件对应一个bam文件。
# APC是样本类型一样,但是来自不同的bam文件
sc_APC=subset(scObj_nue, origin=="APC")
sc_APC
# 默认按照 sample0 一列,也就是bam来源生成cell id 的list。
cid_APC=getCid_list(sc_APC)
str(cid_APC)
# 保存
lapply(names(cid_APC), function(x){
message(x, "\t", length(cid_APC[[x]]), "\t", head(cid_APC[[x]]) )
writeLines(cid_APC[[x]], paste0("./scVelo/velocyto/",x,"_barcodes.tsv") )
})
# 测试 2:最终效果,就是一个cell id文件对应一个bam文件。
writeLines(cid_WT[["WT1"]], "./202202newTag/scVelo/WT/WT1_barcodes.tsv")
writeLines(cid_WT[["WT2"]], "./202202newTag/scVelo/WT/WT2_barcodes.tsv")
(2) velocyto run 由bam输出loom文件(耗时)
http://velocyto.org/velocyto.py/tutorial/cli.html
Step 0: Constructing spliced and unspliced counts matrices
默认的 velocyto run10x 命令是使用的 filtered文件夹中的cell id,结果不理想。我还是喜欢原生的 velocyto run,更灵活,可控性更强。
需要准备几个输入文件
- cell id 文件,格式和可用代码见上文。
- 输出文件夹
- [可选] mask gtf 文件,一般从 UCSC 下载: expressed repeats annotation 猜测是染色体上过滤掉的区域。
- cell ranger 输出的bam文件
- 最后是 gtf 文件,和 cell range 用的gtf 基因组版本号要一致。
$ cd /home/wangjl/data/xx/scVelo/
$ velocyto run \
-@ 100 --samtools-memory 1000 \
-b WT/WT2_barcodes.tsv \
-o WT/WT2 \
-m ref/mouse_M25/mm10_rmsk.gtf \
cellranger/WT2/outs/possorted_genome_bam.bam \
ref/mouse_M25/gencode.vM25.annotation.gtf
#0:39-->02:01, 1.5h 只有600个细胞
WT/WT2/possorted_genome_bam_X3HXW.loom #22M
(3) 从 Seurat 输出数据
从Seurat对象导出细胞、基因、表达信息,供 scVelo 使用。
这是一个可用的函数,不用修改。
#' output Seurat obj info to a out dir/
#'
#' V0.1
#'
#' @param sce Seurat obj
#' @param outputRoot a dir
#'
#' @return nothing
#' @export
#'
#' @examples
Seurat2scVelo=function(sce, outputRoot){
message("output to:", outputRoot)
# save metadata table:
sce$barcode <- colnames(sce)
sce$UMAP_1 <- sce@[email protected][,1]
sce$UMAP_2 <- sce@[email protected][,2]
#write.csv(data.frame([email protected])[,c("barcode","seurat_clusters", "UMAP_1","UMAP_2")],
write.csv([email protected],
file=paste0(outputRoot, 'metadata.csv'), quote=F, row.names=F)
# write expression counts matrix
library(Matrix)
counts_matrix <- GetAssayData(sce, assay='RNA', slot='counts')
writeMM(counts_matrix, file=paste0(outputRoot, 'counts.mtx'))
# write dimesnionality reduction matrix, in this example case pca matrix
write.csv(sce@[email protected],
file=paste0(outputRoot, 'pca.csv'), quote=F, row.names=F)
# write gene names
write.table(
data.frame('gene'=rownames(counts_matrix)),
file=paste0(outputRoot, 'gene_names.csv'),
quote=F,row.names=F,col.names=F
)
return(invisible(NULL))
}
# 测试:
Seurat2scVelo(sc_WT, "202202newTag/scVelo/Seurat_dataset/WT/")
2. 使用 Python 预处理数据
使用 Jupyter notebook.
(1) 读取 Seurat 导出的数据
import os
os.chdir("/home/wangjl/data/neu/scRNA/202202newTag/scVelo/")
os.getcwd()
import scanpy as sc
import anndata
from scipy import io
from scipy.sparse import coo_matrix, csr_matrix
import numpy as np
import os
import pandas as pd
type="mergedWT"
# pip install anndata --force-reinstall
#anndata 0.7.8
#(1) load sparse matrix:
X = io.mmread(f"Seurat_dataset/{type}/counts.mtx")
# create anndata object
adata = anndata.AnnData(
X=X.transpose().tocsr()
)
#(2) load cell metadata:
cell_meta = pd.read_csv(f"Seurat_dataset/{type}/metadata.csv")
# load gene names:
with open(f"Seurat_dataset/{type}/gene_names.csv", 'r') as f:
gene_names = f.read().splitlines()
# set anndata observations and index obs by barcodes, var by gene names
adata.obs = cell_meta
adata.obs.index = adata.obs['barcode']
adata.var.index = gene_names
#(3) load dimensional reduction:
pca = pd.read_csv(f"Seurat_dataset/{type}/pca.csv")
pca.index = adata.obs.index
# set pca and umap
adata.obsm['X_pca'] = pca.to_numpy()
adata.obsm['X_umap'] = np.vstack((adata.obs['UMAP_1'].to_numpy(), adata.obs['UMAP_2'].to_numpy())).T
# plot a UMAP colored by sampleID to test:
sc.pl.umap(adata, color='seurat_clusters', frameon=False, save=True)
(2) 读取 loom 数据(unsplied/spliced data)
import scvelo as scv
import scanpy as sc
#import cellrank as cr
import numpy as np
import pandas as pd
import anndata as ad
import matplotlib.pyplot as plt
scv.settings.verbosity = 3
scv.settings.set_figure_params('scvelo', facecolor='white', dpi=100,
fontsize=6, color_map='viridis',
frameon=False)
#cr.settings.verbosity = 2
#adata = sc.read_h5ad('my_data.h5ad')
#adata=sc.read_loom(f'out/{type}.my_data.loom')
#adata.obsm['X_pca'] = pca.to_numpy()
#adata.obsm['X_umap'] = np.vstack((adata.obs['UMAP_1'].to_numpy(), adata.obs['UMAP_2'].to_numpy())).T
#adata
# 这里展示了多个样本的处理方式。
# load loom files for spliced/unspliced matrices for each sample:
sample_loom_1="./velocyto/WT1/possorted_genome_bam_HA9DP.loom"
ldata1 = scv.read( sample_loom_1, cache=True)
ldata1
sample_loom_2="./velocyto/WT2/possorted_genome_bam_SOJRI.loom"
ldata2 = scv.read( sample_loom_2, cache=True)
ldata2
(3) merge loom data
ldata1.obs.index[0:2]
加上cell id后缀,和Seurat 中的一致。
# WT_1 -1_1
barcodes = [bc.split(':')[1] for bc in ldata1.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '-1_1' for bc in barcodes]
ldata1.obs.index = barcodes
ldata1.obs.index[0:5]
# WT2 -1_2
barcodes = [bc.split(':')[1] for bc in ldata2.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '-1_2' for bc in barcodes]
ldata2.obs.index = barcodes
ldata2.obs.index[0:5]
ldata1.var.head()
ldata1.var_names_make_unique()
ldata2.var_names_make_unique()
合并2个loom文件
# merge
ldata = ldata1.concatenate([ldata2])
ldata
和Seurat数据合并
# merge matrices into the original adata object
adata = scv.utils.merge(adata, ldata)
adata
保存数据
adata.write_h5ad(f'data/{type}.adata_ldata.h5ad')
(4) scVelo 预处理
adata = sc.read(f'data/{type}.adata_ldata.h5ad')
adata
#scv.pp.filter_and_normalize(adata, min_shared_counts=5, min_shared_cells=3, log=True)
scv.pp.filter_and_normalize(adata)
可选步骤:
## clean some genes
import re
flag = [not bool(re.match('^Rp[ls]', i)) for i in adata.var_names]
adata = adata[:,flag]
#
adata = adata[:,adata.var_names != "Malat1"]
adata
3.scVelo
scv.pp.moments(adata, n_neighbors=30, n_pcs=30)
# this step will take a long while
import gc
gc.collect()
#
temp_pre= f"{type}_nue.in_process2"
if False==os.path.exists(f"{temp_pre}.velo.gz.h5ad"):
scv.tl.recover_dynamics(adata, var_names='all', n_jobs=64)
scv.tl.velocity(adata, mode='dynamical')
adata.write(f"{temp_pre}.velo.gz.h5ad", compression='gzip')
print(">>Write to file")
else:
adata = sc.read(f"{temp_pre}.velo.gz.h5ad", compression='gzip', ext="h5ad")
print(">>read from file")
scv.tl.velocity_graph(adata)
scv.tl.velocity_embedding(adata, basis="umap")
(2) 可视化
add color
## add colSet
adata.obs["cluster_names"] = adata.obs["cluster_names"].astype('category')
color = pd.read_csv(f"../merge/data/mergeR3.neutrophil.color.txt", #compression="gzip",
sep=" ", header=0, index_col=0)
#color.index=color.index.astype("str")
color.index=color["cluster_names"]
# 对数据 按分类因子 排序
adata.obs["cluster_names"].cat.reorder_categories(color["cluster_names"], inplace=True)
# 获取颜色列
color_used = list(color.loc[adata.obs["cluster_names"].cat.categories,"color"])
adata.uns['cluster_names_colors'] = color_used
adata
embedding
scv.pl.velocity_embedding(adata, basis = 'umap', title="WT",
save=f"{type}.embedding.pdf",
color="seurat_clusters")
grid
scv.settings.set_figure_params('scvelo', dpi=300, dpi_save=300)
fig, ax = plt.subplots()
ax.set_aspect(1)
scv.pl.velocity_embedding_grid(adata, basis='umap',color='cluster_names', title='WT',
arrow_size=1, arrow_length=2, arrow_color="#D2691E",
alpha=0.1,
#density=0.9,
legend_loc='right margin',legend_fontsize=5,
show=True,
save=f"{type}.grid.pdf", #figsize=(10,10),
xlim=[-10,10],ylim=[-10,10], ax=ax)
# 默认
scv.settings.set_figure_params('scvelo', dpi=300, dpi_save=300)
fig, ax = plt.subplots()
ax.set_aspect(1)
scv.pl.velocity_embedding_grid(adata, basis='umap',color='cluster_names', title='WT',
arrow_size=1, arrow_length=2, arrow_color="#D2691E",
alpha=0.1,
#density=0.9,
legend_loc='right margin',legend_fontsize=5,
show=True,
save=f"{type}.grid.pdf", #figsize=(10,10),
xlim=[-10,10],ylim=[-10,10], ax=ax)
stream
%matplotlib inline
import matplotlib.pyplot as plt
scv.settings.set_figure_params('scvelo', dpi=300, dpi_save=300)
#fig, ax = plt.subplots()
#ax.set_aspect(1)
scv.pl.velocity_embedding_stream(adata, basis='umap',color='cluster_names', title='WT',
#arrow_size=1, ##arrow_length=2,
#arrow_color="#D2691E",
#alpha=0.01, density=0.9,
legend_loc='right margin',legend_fontsize=5,
show=True,
save=f"{type}.stream.pdf")
#, #figsize=(10,10),
#xlim=[-10,10],ylim=[-10,10],
# ax=ax)
scv.pl.velocity_embedding_stream(adata, basis="umap", title='WT',
save=f"{type}.stream2.pdf",
color="cluster_names")
for each sample
# for each cancerType (only T)
for i in ['WT1','WT2']:
flag = [j == i for j in adata.obs.loc[:,'sample0']]
adata_sub = adata[flag,]
#flag2 = [j == "T" for j in adata_sub.obs.loc[:,'loc']]
#adata_sub = adata_sub[flag2,]
#
scv.settings.set_figure_params('scvelo', dpi=300, dpi_save=300)
fig, ax = plt.subplots()
ax.set_aspect(1)
scv.pl.velocity_embedding_grid(adata_sub, basis='umap',color='cluster_names', title=i,
arrow_size=1, arrow_length=2, arrow_color="#D2691E", alpha=0.1,
legend_loc='right margin',legend_fontsize=5,
density=0.8,
save=f"{type}_Sup.{i}.grid.pdf",# figsize=(10,9),
xlim=[-10,10],ylim=[-10,10], ax=ax)
latent time
scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time',
save=f"{type}.latent_time.pdf",# figsize=(10,9),
color_map='gnuplot', size=80)
heatmap
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata, var_names=top_genes, sortby='latent_time', col_color='cluster_names',
save=f"{type}.heatmap.pdf",# figsize=(10,9),
n_convolve=100)
Top-likelihood genes
# 查看top15的基因
#top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(adata, basis=top_genes[:15], ncols=5,
save=f"{type}.Driver_genes.pdf",# figsize=(10,9),
frameon=False)
更多可视化,参考
- https://scvelo.readthedocs.io/
- https://scvelo.readthedocs.io/VelocityBasics/
- https://scvelo.readthedocs.io/DynamicalModeling/
- https://scvelo.readthedocs.io/DifferentialKinetics/
ref
https://www.jianshu.com/p/fb1cf5806912
https://www.jianshu.com/p/bfff8a4cf611