hello,大家好,今天我们来分享一个很好的做细胞通讯的分析软件,在原有细胞通讯软件的基础上更上一层,利用单细胞数据从头构建信号网络,今天我们就开参透它,看看这个软件主要的功能和适用环境。
其实关于细胞通讯的软件已经分享了很多了,每个软件都有其特点和优劣势,这里列举出来,大家有兴趣的可以参考
10X单细胞(10X空间转录组)通讯分析之NicheNet
10X单细胞(10X空间转录组)通讯分析CellChat之多样本通讯差异分析
10X单细胞(10X空间转录组)通讯分析之CellChat
10X单细胞通讯分析之scMLnet(配受体与TF,差异基因(靶基因)网络通讯分析)
10X单细胞之细胞通讯篇章-----Connectome
10X单细胞通讯分析之CrosstalkR(特异性和通讯强度的变化都很重要)
10X单细胞通讯分析之ICELLNET
10X空间转录组通讯分析章节3
空间通讯分析章节2
10X空间转录组做细胞通讯的打开方式
细胞通讯软件RNAMagnet
单细胞数据细胞通讯分析软件NATMI
好了,开始我们今天的分享,文章在CytoTalk: De novo construction of signal transduction networks using single-cell RNA-Seq data,今年刚才发表于Science Advances,影响因子13分,我们先来看看文章的内容,最后看一看示例代码
Abstract
Single-cell technology has opened the door for studying signal transduction in a complex tissue at unprecedented resolution. However, there is a lack of analytical methods for de novo construction of signal transduction pathways using single-cell omics data.(一开始就抛出问题,从头构建信号转导),于是作者就开发了一个新的方法,CytoTalk
- CytoTalk first constructs intracellular and intercellular(细胞内和细胞间) gene-gene interaction networks(这里指配受体) using an information-theoretic measure between two cell types。
- Candidate signal transduction pathways in the integrated network are identified using the prizecollecting Steiner forest algorithm.(信号识别,这个算法我们在方法中看一下)。
We applied CytoTalk to a single-cell RNA-Seq data set on mouse visual cortex and evaluated predictions using high-throughput spatial transcriptomics data generated from the same tissue.(这个地方注意,10X单细胞和10X空间转录组的数据都用到了),Compared to published methods, genes in our inferred signaling pathways have significantly higher spatial expression correlation only in cells that are spatially closer to each other, suggesting improved accuracy of CytoTalk(嗯,效果不错,挑选出来的配受体有明显的空间区域性,配受体在空间上都是在邻近区域交流,很赞),Furthermore, using single-cell RNA-Seq data with receptor gene perturbation, we found that predicted pathways are enriched for differentially expressed genes between the receptor knockout and wild type cells, further validating the accuracy of CytoTalk(这部分在结果中看看),In summary, CytoTalk enables de novo construction of signal transduction pathways and facilitates comparative analysis of these pathways across tissues and conditions.
Introduction,这部分我们提炼一下
- Signal transduction is the primary mechanism for cell-cell communication
- Signaling pathways are highly dynamic and crosstalk among them is prevalent.(信号通路是高度动态的,并且它们之间的串扰很普遍。 )。
重点来了,Due to these two features, simply examining expression levels of ligand and receptor genes cannot reliably capture the overall activities of signaling pathways and interactions among them。这里提到了[### NicheNet,配受体和靶基因的网络分析。 - 这些方法的缺陷,However, these methods are based on known annotations of signaling pathways.
To our knowledge, currently no method exists to perform de novo prediction of the entire signal transduction pathways emanating from the ligand-receptor pairs.,这个思路跟我之前分享的文章 10X单细胞通讯分析之scMLnet(配受体与TF,差异基因(靶基因)网络通讯分析)应该是一样的。
Here we describe the CytoTalk algorithm for de novo construction of signaling network (union of multiple signaling pathways) between two cell types using scRNASeq data. - The algorithm first constructs an integrated network consisting of intracellular and inter-cellular functional gene interactions.
- It then identifies the signaling network by solving a prize-collecting Steiner forest problem.(这个专有名词我们在方法中介绍)。
- We demonstrate the performance of the algorithm using high throughput spatial transcriptomics(空间转录组数据) data and scRNA-Seq data(单细胞数据) with perturbation(摄动; 微扰) to the receptor genes in a signaling pathway。
Results
结果1 、 Wiring of signaling pathways is highly cell type-dependent 信号通路的"接线"与细胞类型高度相关
A hallmark of signal transduction pathways is their high level of cell-type specific wiring pattern.(hallmark
大家应该不陌生吧),Single-cell transcriptome data allows us to examine the cell typespecific activity of individual signaling pathways beyond just ligand and receptor genes.(这个地方大家注意一下,信号通路的活性高低是可以通过富集的方式计算出来,但是某个信号通路表达水平高低的受到配受体信号的调控)。To this end, we examined the canonical fibroblast growth factor receptor 2 (FGFR2) signaling pathway in two tissue types, mammary gland and skin.(为此,我们检查了乳腺和皮肤两种组织中的典型成纤维细胞生长因子受体2(FGFR2)信号传导途径。 看来读文献对英文水平也很有帮助哈 ),我们就不着重介绍这个生理过程了,看软件带给了我们什么,我们需要知道的是一些受体的激活,导致了一些通路基因的上调,从而改变了一些生物学的功能。
对于一个公共的单细胞数据,这个数据当然是进行注释过的,计算表达特意分数,preferential expression measure (PEM) (有关PEM的计算我们在方法中讨论),for each pathway gene in each involved cell type,发现同一受体(FGFR2)下游的四个典型亚通路显示惊人的细胞类型特异性活性,具体取决于所涉及的细胞类型。 那也就是说,其实对于相同的受体,不同细胞类型激活的信号通路上是有差别的,The PI3K/AKT pathway is most active for signaling between fibroblasts and luminal epithelial cells in the mammary gland. In contrast, The JAK-STAT pathway is most active for signaling between keratinocyte stem cells and basal cells in skin.
To evaluate the extent of cell type-specific wiring of signaling pathways, we examined all manually annotated signaling pathways in the Reactome database。For each pathway, we computed its cell type-specific activity score。We found that the majority of pathways exhibit high degree of cell typespecific activities(这个我感觉应该就是这样的吧,不算什么新的发现)。
This is true even for the same cell types but located in different tissues(这个地方是需要格外注意的),In summary, these results highlight the need for analytical tools for de novo construction of complete signaling pathways (instead of ligand-receptor pairs) using single-cell transcriptome data.(确实是这样)。
结果2 Overview of the CytoTalk algorithm 我们提炼一下
CytoTalk is designed for de novo construction of a signal transduction network between two cell types,which is defined as the union of multiple signal transduction pathways.
- It first constructs a weighted integrated gene network comprised of both intracellular and intercellular functional gene-gene interactions(也就是配受体网络)。Intracellular functional gene interactions are computed and weighted using mutual information(共同信息) between two genes.Two intracellular networks are connected via crosstalk edges。Ligand-receptor pairs with higher cell-type-specific(细胞类型特异性) gene expression but lower correlated expression within the same cell type (thus more likely to be involved in crosstalk instead of self talk) are assigned higher crosstalk weights.(这个地方重点理解一下,一个配体或者受体gene随便表达水平较低,但是细胞类型特异性很强,说明这个gene参与了网络的CrossTalk,不可能是自身随意产生,这种情况给予更高的权重,很合理)。集成网络中的节点通过其细胞类型特定的基因表达和与网络中配体/受体基因的接近程度相结合来加权。 (看来涉及到很多的算法了),We use a network propagation procedure to determine the closeness of a gene to the ligand/receptor gene.With the integrated network as the input, we formulate the identification of signaling network as a prizecollecting Steiner forest (PCSF) problem(这个地方很陌生,大家可以参考文章PRODIGY: personalized prioritization of driver genes)。使用PCSF算法的基本原理是找到一个最佳子网络,其中包括具有高水平细胞类型特异性表达并与高得分配体-受体对紧密相连的基因。(我们需要知道这个)This optimal subnetwork is defined as the signaling network between the two cell types. The statistical significance of the candidate signaling network is computed using a null score distribution of signaling networks generated using degreepreserving randomized networks.(显著性检验,这部分结果需要在方法中重点关注一下了)。
结果3 Performance evaluation using spatial transcriptomics data(用到小鼠皮层的数据)
We identified signaling networks between the three pairs of cell types, endothelial-microglia (EndoMicro), endothelial-astrocyte (EndoAstro) and astrocyte-neuron (AstroNeuro), respectively。The predicted cell-type-specific signaling networks consist of 481, 404, and 1051 genes and involves 51, 44, and 35 ligand-receptor interactions (crosstalk edges), respectively。Compared to PCSFs identified using 1000 randomized input networks(置换检验), all predicted signaling networks have significantly smaller objective function scores and larger fractions of crosstalk edges (empirical p-values < 0.001)
Several predicted ligandreceptor pairs are known to mediate signal transduction between the three cell types.
接下来借助空间数据,这个时候的网络会考虑到的细胞之间的距离
Our rationale is that cells that are close together are more likely to signal to each other.(这个在10X空间转录组上也是同样适用)。因此,signaling pathway genes are expected to have higher spatial expression correlation in these cells than cells that are further apart.
首先是方法之间的比较
we first asked what fractions of the predicted ligand-receptor pairs are shared among the six methods.(六个方法共同预测的配受体对)。We reason that a more accurate method will have on average a larger fraction of overlapped predictions with all other methods(按照这个说法,作者的软件最好 )
然后是对空间数据的研究发现,邻近的细胞类型更容易发生交流,距离远的细胞交流较少,其他的方法越没有这样的特点。
However, pathways predicted by NicheNet and SoptSC also show significantly larger PCCs compared to random gene pairs among intermediate and distant cell pairs, suggesting that those predictions are false positive predictions.
Taken together, these results demonstrate that CytoTalk has significant improvement over published methods.
结果4 Performance evaluation using scRNA-Seq data without receptor gene expression(受体基因被敲除)。
这种条件下, 作者发现了新的信号通路,当然了,作者的软件预测准确性最高。
Discussion
We introduce a computational method, CytoTalk, for the construction of cell-typespecific signal transduction pathways using scRNA-Seq data.The input to CytoTalk are scRNA-Seq data and known ligand-receptor interactions. Unlike previous methods using known pathway annotations , CytoTalk constructs full pathways .
反正效果就是好。
In summary, CytoTalk provides a much-needed means for de novo construction of complete cell-type-specific signaling pathways. Comparative analysis of signaling pathways will lead to a better understanding of cell-cell communication in healthy and diseased tissues.
Method
方法1 Construction of intracellular functional gene interaction network
基因共表达网络,成对基因之间的关系,算法比较陌生,大家可以查一下
2、Crosstalk score of a ligand-receptor pair between two cell types
define a crosstalk score between gene i in cell type A and gene j in cell type B as below. Genes i and j encode a ligand and a receptor or vice versa.
3、Construction of an integrated network between two cell types
我们构建了一个集成的网络,该网络由通过已知的配体-受体相互作用连接的两个细胞内网络组成。 We collected 1,941 manually annotated ligand-receptor interactions,if the ligand gene and the receptor gene are present in the two intracellular networks, we connect them and denote the edge as a crosstalk edge.
4、重点 De novo identification of signaling network between two cell types
We formulate the identification of a signaling network between two cell types as a prize-collecting Steiner forest (PCSF) problem. Because the forest is a disjoint set of trees, PCSF problem is a generalization of the classical prize-collecting Steiner tree (PCST) problem. The individual signaling pathways are represented as trees, the collection of which (forest) represents the entire signaling network between two cell types.
We define edge costs and node prizes in the integrated network as follows. The z-score normalized edge weights of the integrated network are first scaled to the range of [0, 1]. Edge cost is then defined as 1 − ℎ. Node prize is defined based on both PEM value of a gene and its closeness to the ligand/receptor genes in the network in order to identify signaling networks centered around the crosstalk edges. To capture the closeness, we use a network propagation procedure to calculate a relevance coefficient for each gene in an intracellular network.
where is the relevance coefficient vector for all genes in the intracellular network at iteration t. is the initial value of the relevance coefficient vector such that 2() = 1 if gene i is a ligand or receptor. Otherwise, 2() = 0. ′ is a normalized edge weight matrix for an intracellular network, which is defined as 3 = ////. Here, W is set to the original mutual information matrix and D is defined as a diagonal matrix such that (, ) is the sum of row i of the matrix W. This network propagation procedure is equivalent to a random walk with restart on the network. is a tuning parameter that controls the balance between prior information (known ligands or receptors) and network smoothing. Node prize of a gene is defined as the product of its PEM value and the relevance coefficient to capture both the cell-type-specificity and the closeness of this gene to the ligand or receptor gene in the network. To avoid extremely large node prizes for ligand or receptor genes, we used = 0.9 in this study.
The PCSF algorithm identifies an optimal forest in a network that maximizes the total amount of node prizes and minimizes the total amount of edge costs in the forest. While PCSF problem is NP-hard and often needs a high computational cost, we employ a PCSF formulation established in and use a highly efficient prizecollecting Steiner tree (PCST) algorithm to identify the PCSF. The objective function of the PCSF problem is defined as below.
where F represents a forest (i.e. multiple disconnected trees) in the integrated network. () denotes the sum of edge costs in the forest F and ( c ) denotes the sum of node prizes of the remaining subnetwork excluding the forest F from the network. We modify the integrated network by introducing an artificial node and a number of artificial edges to the original network. The artificial edges connect the artificial node to all genes in the original network. The costs of all artificial edges are the same and are defined as , which influences the number of trees, k, in the resulting PCSF. is a parameter for balancing the edge costs and node prizes, which influences the size of the resulting PCSF. By tuning parameters and , multiple PCSTs can be identified with the artificial node as the root node. For each identified PCST, a PCSF can be obtained by removing the artificial node and artificial edges from the PCST.
We identify the signaling network between two cell types by searching for a robust PCSF across the full parameter space . For each identified PCSF, we compute the occurrence of each edge in all identified PCSFs to construct a background distribution of edge occurrence frequency. Next, we calculate a p-value for each PCSF by comparing the edge occurrence frequency distribution of this PCSF to the distribution of all other identified PCSFs using one-sided Kolmogorov-Smirnov test. The PCSF with the minimum p-value is considered as the most robust signaling network predicted by CytoTalk.
To further evaluate the statistical significance of the identified PCSF, we construct null distributions for the objective function and for the fraction of crosstalk edges in a PCSF using 1000 null PCSFs identified from randomized integrated networks. To generated the randomized networks, we separately shuffle the edges of the two intracellular networks while preserving the node degree distribution, node prizes and crosstalk edges as the original integrated network.
算法理解起来有点难,头都有点疼了。
我们看看示例代码
看来脚本都已经封好了,直接用
Input files
A comma-delimited “.csv” file containing scRNA-Seq data for each cell type under study. Each file contains the ln-transformed normalized scRNA-Seq data for a cell type with rows as genes (GENE SYMBOL) and columns as cells. The files should be named as: scRNAseq_Fibroblasts.csv, scRNAseq_Macrophages.csv, scRNAseq_EndothelialCells.csv, scRNAseq_CellTypeName.csv …
A “TwoCellTypes.txt” file indicating the two cell types between which the signaling network is predicted. Please make sure that the cell type names should be consistent with scRNA-Seq data files above.
A “LigandReceptor_Human.txt” or "LigandReceptor_Mouse.txt" file listing all known ligand-receptor pairs. The first column (ligand) and the second column (receptor) are separated by a tab (\t). Currently, 1942 and 1855 ligand-receptor pairs are provided for human and mouse, respectively.
A “Species.txt” file indicating the species from which the scRNA-Seq data are generated. Currently, “Human” and “Mouse” are supported.
A “Cutoff_GeneFilter.txt” file indicating the cutoff for removing lowly-expressed genes in the processing of scRNA-Seq data. The default cutoff value is 0.1, which means that genes expressed in less than 10% of all cells of a given type are removed.
A “BetaUpperLimit.txt” file indicating the upper limit of the test values of the algorithm parameter β, which is inversely proportional to the total number of genes in a given cell-type pair after removing lowly-expressed genes in the processing of scRNA-Seq data. Based on preliminary tests, the upper limit of β value is suggested to be 100 (default) if the total number of genes in a given cell-type pair is above 10,000. However, if the total number of genes is below 5000, it is necessary to increase the upper limit of β value to 500.
Please download "CytoTalk_package_v2.0.zip". All example input files are in the /Input/ folder and should be customized and copied into the /CytoTalk/ folder before running. The /CytoTalk/ folder can only be used ONCE for a given cell-type pair. Please use a new /CytoTalk/ folder for analysis of other cell-type pairs.
Run CytoTalk
Copy the input file-added “/CytoTalk/” folder to your working directory and execute the following script:
bash InferSignalingNetwork.sh
[Alternative way] The whole computation above may take 5.5 hours (2.3 GHz 8-Core Intel Core i9, 14 logical cores for parallel computation), of which 4 hours are used for computing pair-wise mutual information between genes in the construction of intracellular networks for the given two cell types. Considering that users may have alternative ways for constructing cell-type-specific intracellular networks, we divide the whole computation into two steps below.
bash InferIntracellularNetwork_part1.sh # around 4 hours
bash InferIntercellularNetwork_part2.sh # around 1.5 hours
The outputs of the script "part1.sh" are two comma-delimited files "IntracellularNetwork_TypeA.txt" and "IntracellularNetwork_TypeB.txt", containing the adjacency matrices of two intracellular networks for the given two cell types, respectively. These two files are the inputs of the script "part2.sh", which can generate the final predicted signaling network.
CytoTalk output
The output folder, “/CytoTalk/IllustratePCSF/”, contains a network topology file and six attribute files that are ready for import into Cytoscape for visualization and further analysis of the predicted signaling network between the given two cell types.
Network topology | Edge attribute | Node attribute |
---|---|---|
PCSF_edgeSym.sif | PCSF_edgeCellType.txt PCSF_edgeCost.txt | PCSF_geneCellType.txt,PCSF_geneExp.txt PCSF_genePrize.txt,PCSF_geneRealName.txt |
大家不妨试一下吧, 生活很好,有你更好