User-defined receptor-ligand datasets
Our system allows users to create their own lists of curated proteins and complexes. In order to do so, the format of the users’ lists must be compatible with the input files. Users can submit their lists using the Python package version of CellPhoneDB, and then send them via email, the cellphonedb.org form, or a pull request to the CellPhoneDB data repository (https://github.com/Teichlab/cellphonedb-data).
Database structure
Information is stored in an SQLite relational database (https://www.sqlite.org). SQLAlchemy (www.sqlalchemy.org) and Python 3 were used to build the database structure and the query logic. The application is designed to allow analysis on potentially large count matrices to be performed in parallel. This requires an efficient database design, including optimisation for query times, indices and related strategies. All application code is open source and uploaded both to github and the web server.
Statistical inference of receptor-ligand specificity(显著性)
To assess cellular crosstalk between different cell types, we use our repository in a statistical framework for inferring cell–cell communication networks from single-cell transcriptome data. We predict enriched receptor–ligand interactions between two cell types based on expression of a receptor by one cell type and a ligand by another cell type, using scRNA-seq data. To identify the most relevant interactions between cell types, we look for the cell-type specific interactions between ligands and receptors. Only receptors and ligands expressed in more than a user-specified threshold percentage of the cells in the specific cluster are considered significant (default is 0.1).
We then perform pairwise comparisons between all cell types. First, we randomly permute the cluster labels of all cells (1,000 times as a default) and determine the mean of the average receptor expression level in a cluster and the average ligand expression level in the interacting cluster. For each receptor–ligand pair in each pairwise comparison between two cell types, this generates a null distribution. By calculating the proportion of the means which are as or higher than the actual mean, we obtain a p-value for the likelihood of cell-type specificity of a given receptor–ligand complex. We then prioritize interactions that are highly enriched between cell types based on the number of significant pairs, so that the user can manually select biologically relevant ones. For the multi-subunit heteromeric complexes, we require that all subunits of the complex are expressed (using a user-specified threshold), and therefore we use the member of the complex with the minimum average expression to perform the random shuffling.
Cell subsampling for accelerating analyses(下采样)
Technological developments and protocol improvements have enabled an exponential growth of the number of cells obtained from scRNA-seq experiments. Large-scale datasets can profile hundreds of thousands cells, which presents a challenge for the existing analysis methods in terms of both memory usage and runtime. In order to improve the speed and efficiency of our protocol and facilitate its broad accessibility, we integrated subsampling as described in Hie et al.28. This "geometric sketching" approach aims to maintain the transcriptomic heterogeneity within a dataset with a smaller subset of cells. The subsampling step is optional, enabling users to perform the analysis either on all cells, or with other subsampling methods of their choice.
Table. Description of the output files means.csv, pvalues.csv, significant_means.csv and relevant_interactions.txt
Identifier | Definition | Output file | Example |
id_cp_interaction | Unique CellPhoneDB identifier for each interaction stored in the database. | means.csv; pvalues.csv; significant_means.csv | CPI-SS096F3E0F2 |
interacting_pair | Name of the interacting pairs separated by "|". | means.csv; pvalues.csv; significant_means.csv | JAG2|NOTCH4 |
partner A or B | Identifier for the first interacting partner (A) or the second (B). It could be: UniProt (prefix simple:) or complex (prefix complex:) | means.csv; pvalues.csv; significant_means.csv | simple:Q9Y219 |
gene A or B | Gene identifier for the first interacting partner (A) or the second (B). The identifier will depend on the input user list. | means.csv; pvalues.csv; significant_means.csv | ENSG00000184916 |
secreted | True if one of the partners is secreted. | means.csv; pvalues.csv; significant_means.csv | FALSE |
Receptor A or B | True if the first interacting partner (A) or the second (B) is annotated as a receptor in our database. | means.csv; pvalues.csv; significant_means.csv | FALSE |
annotation_strategy | Curated if the interaction was annotated by the CellPhoneDB developers. Otherwise, the name of the database where the interaction has been downloaded from. | means.csv; pvalues.csv; significant_means.csv | curated |
is_integrin | True if one of the partners is integrin. | means.csv; pvalues.csv; significant_means.csv | FALSE |
rank | Total number of significant p-values for each interaction divided by the number of cell type-cell type comparisons. | significant_means.csv | 0.25 |
means | Mean values for all the interacting partners: mean value refers to the total mean of the individual partner average expression values in the corresponding interacting pairs of cell types. If one of the mean values is 0, then the total mean is set to 0. | means.csv | 0.53 |
p.values | p-values for the all the interacting partners: p.value refers to the enrichment of the interacting ligand-receptor pair in each of the interacting pairs of cell types. | pvalues.csv | 0.01 |
significant_mean | Significant mean calculation for all the interacting partners. If p.value < 0.05, the value will be the mean. Alternatively, the value is set to 0. | significant_means.csv | 0.53 |
relevant_interactions | Indicates if the interaction is relevant (1) or not (0). If a gene in the interaction is a DEG, and all the participants are expressed, the interaction will be classified as relevant. Alternatively, the value is set to 0. | relevant_interactions.txt | 1 or 0 |
Table. Description of the output file deconvoluted.csv
Identifier | Definition | Output file | Example |
gene_name | Gene identifier for one of the subunits that are participating in the interaction defined in "means.csv" file. The identifier will depend on the input of the user list. | deconvoluted.csv | JAG2 |
uniprot | UniProt identifier for one of the subunits that are participating in the interaction defined in "means.csv" file. | deconvoluted.csv | Q9Y219 |
is_complex | True if the subunit is part of a complex. Single if it is not, complex if it is. | deconvoluted.csv | FALSE |
protein_name | Protein name for one of the subunits that are participating in the interaction defined in "means.csv" file. | deconvoluted.csv | JAG2_HUMAN |
complex_name | Complex name if the subunit is part of a complex. Empty if not. | deconvoluted.csv | a10b1 complex |
id_cp_interaction | Unique CellPhoneDB identifier for each of the interactions stored in the database. | deconvoluted.csv | CPI-SS0DB3F5A37 |
mean | Mean expression of the corresponding gene in each cluster. | 0.9 |
if (!requireNamespace("devtools", quietly = TRUE))
if (!requireNamespace("BiocManager", quietly = TRUE))
devtools::install_github('zktuong/ktplots', dependencies = TRUE)
This function seems like it's the most popular so I moved it up! Please see below for alternative visualisation options.
Generates a dot plot after CellPhoneDB analysis via specifying the query celltypes and genes. The difference compared to the original cellphonedb plot is that this is totally customizable!
The plotting is largely determined by the format of the meta file provided to CellPhoneDB analysis.
For the split.by option to work, the annotation in the meta file must be defined in the following format:
so to set up an example vector, it would be something like:
annotation <- paste0(kidneyimmune$Experiment, '_', kidneyimmune$celltype)
To run, you will need to load in the means.txt and pvals.txt from the analysis. If you are using results from cellphonedb version 3, the pvalues.txt is relevant_interactions.txt and also add version3 = TRUE into all the functions below.
# pvals <- read.delim("pvalues.txt", check.names = FALSE)
# means <- read.delim("means.txt", check.names = FALSE)
# I've provided an example dataset
plot_cpdb(cell_type1 = 'B cell', cell_type2 = 'CD4T cell', scdata = kidneyimmune,
idents = 'celltype', # column name where the cell ids are located in the metadata
split.by = 'Experiment', # column name where the grouping column is. Optional.
means = means, pvals = pvals,
genes = c("XCR1", "CXCL10", "CCL5")) +
small_axis(fontsize = 3) + small_grid() + small_guide() + small_legend(fontsize = 2) # some helper functions included in ktplots to help with the plotting
plot_cpdb(cell_type1 = 'B cell', cell_type2 = 'CD4T cell', scdata = kidneyimmune,
idents = 'celltype', means = means, pvals = pvals, split.by = 'Experiment',
gene.family = 'chemokines') + small_guide() + small_axis() + small_legend(keysize=.5)
plot_cpdb(cell_type1 = 'B cell', cell_type2 = 'CD4T cell', scdata = kidneyimmune,
idents = 'celltype', means = means, pvals = pvals, split.by = 'Experiment',
gene.family = 'chemokines', col_option = "maroon", highlight = "blue") + small_guide() + small_axis() + small_legend(keysize=.5)
plot_cpdb(cell_type1 = 'B cell', cell_type2 = 'CD4T cell', scdata = kidneyimmune,
idents = 'celltype', means = means, pvals = pvals, split.by = 'Experiment',
gene.family = 'chemokines', col_option = viridis::cividis(50)) + small_guide() + small_axis() + small_legend(keysize=.5)
plot_cpdb(cell_type1 = 'B cell', cell_type2 = 'CD4T cell', scdata = kidneyimmune,
idents = 'celltype', means = means, pvals = pvals, split.by = 'Experiment',
gene.family = 'chemokines', noir = TRUE) + small_guide() + small_axis() + small_legend(keysize=.5)
plot_cpdb(cell_type1 = 'B cell', cell_type2 = 'CD4T cell', scdata = kidneyimmune,
idents = 'celltype', means = means, pvals = pvals, split.by = 'Experiment',
gene.family = 'chemokines', default_style = FALSE) + small_guide() + small_axis() + small_legend(keysize=.5)
sce <- Seurat::as.SingleCellExperiment(kidneyimmune)
p <- plot_cpdb2(cell_type1 = 'B cell', cell_type2 = 'CD4T cell',
scdata = sce,
idents = 'celltype', # column name where the cell ids are located in the metadata
means = means2,
pvals = pvals2,
deconvoluted = decon2, # new options from here on specific to plot_cpdb2
desiredInteractions = list(
c('CD4T cell', 'B cell'),
c('B cell', 'CD4T cell')),
interaction_grouping = interaction_annotation,
edge_group_colors = c(
"Activating" = "#e15759",
"Chemotaxis" = "#59a14f",
"Inhibitory" = "#4e79a7",
"Intracellular trafficking" = "#9c755f",
"DC_development" = "#B07aa1",
"Unknown" = "#e7e7e7"
node_group_colors = c(
"CD4T cell" = "red",
"B cell" = "blue"),
keep_significant_only = TRUE,
standard_scale = TRUE,
remove_self = TRUE
# code example but not using the example datasets
adata = ad$read_h5ad('rna.h5ad')
counts <- Matrix::t(adata$X)
row.names(counts) <- row.names(adata$var)
colnames(counts) <- row.names(adata$obs)
sce <- SingleCellExperiment(list(counts = counts), colData = adata$obs, rowData = adata$var)
means <- read.delim('out/means.txt', check.names = FALSE)
pvalues <- read.delim('out/pvalues.txt', check.names = FALSE)
deconvoluted <- read.delim('out/deconvoluted.txt', check.names = FALSE)
interaction_grouping <- read.delim('interactions_groups.txt')
# > head(interaction_grouping)
# interaction role
# 1 ALOX5_ALOX5AP Activating
# 2 ANXA1_FPR1 Inhibitory
# 3 BTLA_TNFRSF14 Inhibitory
# 4 CCL5_CCR5 Chemotaxis
# 5 CD2_CD58 Activating
# 6 CD28_CD86 Activating
test <- plot_cpdb2(cell_type1 = "CD4_Tem|CD4_Tcm|CD4_Treg", # same usage style as plot_cpdb
cell_type2 = "cDC",
idents = 'fine_clustering',
split.by = 'treatment_group_1',
scdata = sce,
means = means,
pvals = pvalues,
deconvoluted = deconvoluted, # new options from here on specific to plot_cpdb2
gene_symbol_mapping = 'index', # column name in rowData holding the actual gene symbols if the row names is ENSG Ids. Might be a bit buggy
desiredInteractions = list(c('CD4_Tcm', 'cDC1'), c('CD4_Tcm', 'cDC2'), c('CD4_Tem', 'cDC1'), c('CD4_Tem', 'cDC2 '), c('CD4_Treg', 'cDC1'), c('CD4_Treg', 'cDC2')),
interaction_grouping = interaction_grouping,
edge_group_colors = c("Activating" = "#e15759", "Chemotaxis" = "#59a14f", "Inhibitory" = "#4e79a7", " Intracellular trafficking" = "#9c755f", "DC_development" = "#B07aa1"),
node_group_colors = c("CD4_Tcm" = "#86bc86", "CD4_Tem" = "#79706e", "CD4_Treg" = "#ff7f0e", "cDC1" = "#bcbd22" ,"cDC2" = "#17becf"),
keep_significant_only = TRUE,
standard_scale = TRUE,
remove_self = TRUE)
file <- system.file("extdata", "covid_cpdb.tar.gz", package = "ktplots")
# copy and unpack wherever you want this to end up
file.copy(file, ".")
system("tar -xzf covid_cpdb.tar.gz")
It requires 1) an input table like so:
sample_id cellphonedb_folder sce_file
1 MH9143325 covid_cpdb/MH9143325/out covid_cpdb/MH9143325/sce.rds
2 MH9143320 covid_cpdb/MH9143320/out covid_cpdb/MH9143320/sce.rds
3 MH9143274 covid_cpdb/MH9143274/out covid_cpdb/MH9143274/sce.rds
4 MH8919226 covid_cpdb/MH8919226/out covid_cpdb/MH8919226/sce.rds
5 MH8919227 covid_cpdb/MH8919227/out covid_cpdb/MH8919227/sce.rds
6 newcastle49 covid_cpdb/newcastle49/out covid_cpdb/newcastle49/sce.rds
7 MH9179822 covid_cpdb/MH9179822/out covid_cpdb/MH9179822/sce.rds
8 MH8919178 covid_cpdb/MH8919178/out covid_cpdb/MH8919178/sce.rds
9 MH8919177 covid_cpdb/MH8919177/out covid_cpdb/MH8919177/sce.rds
10 MH8919176 covid_cpdb/MH8919176/out covid_cpdb/MH8919176/sce.rds
11 MH8919179 covid_cpdb/MH8919179/out covid_cpdb/MH8919179/sce.rds
12 MH9179826 covid_cpdb/MH9179826/out covid_cpdb/MH9179826/sce.rds
where each row is a sample, the location of the out folder generated by cellphonedb and a single-cell object used to generate the cellphonedb result. If cellphonedb was ran with a .h5ad the sce_file would be the path to the .h5ad file.
and 2) a sample metadata data frame for the tests to run:
sample_id Status_on_day_collection_summary
MH9143325 MH9143325 Severe
MH9143320 MH9143320 Severe
MH9143274 MH9143274 Severe
MH8919226 MH8919226 Healthy
MH8919227 MH8919227 Healthy
newcastle49 newcastle49 Severe
MH9179822 MH9179822 Severe
MH8919178 MH8919178 Healthy
MH8919177 MH8919177 Healthy
MH8919176 MH8919176 Healthy
MH8919179 MH8919179 Healthy
MH9179826 MH9179826 Severe
重点So if I want to compare between Severe vs Healthy, I would specify the function as follows:
## set up the levels
covid_sample_metadata$Status_on_day_collection_summary <- factor(covid_sample_metadata$Status_on_day_collection_summary, levels = c('Healthy', 'Severe'))
out <- compare_cpdb(cpdb_meta = covid_cpdb_meta,
sample_metadata = covid_sample_metadata,
celltypes = c("B_cell", "CD14", "CD16", "CD4", "CD8", "DCs", "MAIT", "NK_16hi", "NK_56hi", "Plasmablast", "Platelets", "Treg", "gdT", "pDC"), # the actual celltypes you want to test
celltype_col = "initial_clustering", # the column that holds the cell type annotation in the sce object
groupby = "Status_on_day_collection_summary") # the column in the sample_metadata that holds the column where you want to do the comparison. In this example, it's Severe vs Healthy
This returns a list of dataframes (for each contrast found) with which you can use to plot the results.
plot_compare_cpdb is a simple function to achieve that but you can always just make a custom plotting function based on what you want.
plot_compare_cpdb(out) # red is significantly increased in Severe compared to Healthy.
If there are multiple contrasts and groups, you can facet the plot by specifying groups = c('group1', 'group2')
# let's mock up a new contrast like this
covid_sample_metadata$Status_on_day_collection_summary <- c(rep('Severe', 3), rep('Healthy', 2), rep('notSevere', 2), rep('Healthy', 4), 'notSevere')
out <- compare_cpdb(cpdb_meta = covid_cpdb_meta,
sample_metadata = covid_sample_metadata,
celltypes = c("B_cell", "CD14", "CD16", "CD4", "CD8", "DCs", "MAIT", "NK_16hi", "NK_56hi", "Plasmablast", "Platelets", "Treg", "gdT", "pDC"), # the actual celltypes you want to test
celltype_col = "initial_clustering", # the column that holds the cell type annotation in the sce object
groupby = "Status_on_day_collection_summary")
plot_compare_cpdb(out, alpha = .1, groups = names(out)) # there's no significant hit at 0.05 in this dummy example
The default method uses a pairwise wilcox.test. Alternatives are pairwise Welch's t.test or a linear mixed model with lmer.
To run the linear mixed effect model, it expects that the input data is paired (i.e an individual with multiple samples corresponding to multiple timepoints):
# just as a dummy example, lets say the samples are matched where there are two samples per individual
covid_sample_metadata$individual <- rep(c("A", "B", "C", "D", "E", "F"), 2)
# actually run it
out <- compare_cpdb(cpdb_meta = covid_cpdb_meta,
sample_metadata = covid_sample_metadata,
celltypes = c("B_cell", "CD14", "CD16", "CD4", "CD8", "DCs", "MAIT", "NK_16hi",
"NK_56hi", "Plasmablast", "Platelets", "Treg", "gdT", "pDC"),
celltype_col = "initial_clustering",
groupby = "Status_on_day_collection_summary",
formula = "~ Status_on_day_collection_summary + (1|individual)", # formula passed to lmer
method = "lmer")
plot_compare_cpdb(out, contrast = 'Status_on_day_collection_summarySevere') # use the colnames(out) to pick the right column.
Specifying cluster = TRUE will move the rows and columns to make it look a bit more ordered.
plot_compare_cpdb(out, contrast = 'Status_on_day_collection_summarySevere', cluster = TRUE)
# Note, this conflicts with tidyr devel version
geneDotPlot(scdata = kidneyimmune, # object
genes = c("CD68", "CD80", "CD86", "CD74", "CD2", "CD5"), # genes to plot
idents = "celltype", # column name in meta data that holds the cell-cluster ID/assignment
split.by = 'Project', # column name in the meta data that you want to split the plotting by. If not provided, it will just plot according to idents
standard_scale = TRUE) + # whether to scale expression values from 0 to 1. See ?geneDotPlot for other options
theme(strip.text.x = element_text(angle=0, hjust = 0, size =7)) + small_guide() + small_legend()
scRNAseq <- Seurat::SCTransform(scRNAseq, verbose = FALSE) %>% Seurat::RunPCA(., verbose = FALSE) %>% Seurat::RunUMAP(., dims = 1:30, verbose = FALSE)
anchors <- Seurat::FindTransferAnchors(reference = scRNAseq, query = spatial, normalization.method = "SCT")
predictions.assay <- Seurat::TransferData(anchorset = anchors, refdata = scRNAseq$label, dims = 1:30, prediction.assay = TRUE, weight.reduction = spatial[["pca"]])
spatial[["predictions"]] <- predictions.assay
Seurat::DefaultAssay(spatial) <- "predictions"
Seurat::DefaultAssay(spatial) <- 'SCT'
pa <- Seurat::SpatialFeaturePlot(spatial, features = c('Tnfsf13b', 'Cd79a'), pt.size.factor = 1.6, ncol = 2, crop = TRUE) + viridis::scale_fill_viridis()
Seurat::DefaultAssay(spatial) <- 'predictions'
pb <- Seurat::SpatialFeaturePlot(spatial, features = 'Group1-3', pt.size.factor = 1.6, ncol = 2, crop = TRUE) + viridis::scale_fill_viridis()
p1 <- correlationSpot(spatial, genes = c('Tnfsf13b', 'Cd79a'), celltypes = 'Group1-3', pt.size.factor = 1.6, ncol = 2, crop = TRUE) + scale_fill_gradientn( colors = rev(RColorBrewer::brewer.pal(12, 'Spectral')),limits = c(-1, 1))
p2 <- correlationSpot(spatial, genes = c('Tnfsf13b', 'Cd79a'), celltypes = 'Group1-3', pt.size.factor = 1.6, ncol = 2, crop = TRUE, average_by_cluster = TRUE) + scale_fill_gradientn(colors = rev(RColorBrewer::brewer.pal(12, 'Spectral')),limits = c(-1, 1)) + ggtitle('correlation averaged across clusters')
cowplot::plot_grid(pa, pb, p1, p2, ncol = 2)
features <- c("CD79A", "MS4A1", "CD8A", "CD8B", "LYZ", "LGALS3", "S100A8", "GNLY", "NKG7", "KLRB1", "FCGR3A", "FCER1A", "CST3")
StackedVlnPlot(kidneyimmune, features = features) + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8))
rainCloudPlot(data = [email protected], groupby = "celltype", parameter = "n_counts") + coord_flip()