Review of Deconvolution Methods

1. (Review 2018) Computational deconvolution of transcriptomics data from mixed cell populations


Objective:
cell type composition or tissue is complex and confounding factor, gene expression analyses of bulk tissues often resulting in a loss of signal from lowly abundant cell types.
The field of single-cell genomics has grown exponentially during the past few years, while promising, single-cell technologies have labour-intensive protocols and require expensive and specialized resources, currently hindering their establishment in a clinical setting.
So computational deconvolution methods to infer the abundance of different cell types and/or cell type-specific expression profiles in heterogeneous samples without performing physical cell sorting is very important.

Mathematical approaches to solve the decon- volution problem
(1) minimize the sum of squares (supervised)
ordinary least squares (OLS),
linear least squares (LLS)
simply least squares (LS)
(2) SVR (supervised)
support vector regression approaches with linear kernel (n-SVR) , including CIBERSORT and ImmuCC
(3) unsupervised dimensionality reduction problem
PCA
(4) unsupervised (jointly estimate)
jointly estimate the gene expression of pure populations (C) and the mixing percentages (P) only starting from the ex- pression data (T), without any prior information.unsupervised non-negative matrix factorization (NMF or NNMF) and different Bayesian ap- proaches.

Deconvolution methods readily available as webtools

§
CellPred (Wang et al., 2010): Allows estimation of cell type propor- tions using Affymetrix microarray data as input. Available at http://webarraydb.org/webarray/index.html (CellPred tab).
§
TIMER (Li et al., 2016): A great resource containing the proportions of B cells, CD4+ and CD8+ T cells, macrophages, neutrophils and dendritic cells across 11,509 samples corresponding to 32 cancer types from The Cancer Genome Atlas (TCGA). Available at https://cistrome.shinyapps.io/timer/. Users can download the TIMER method from https://github.com/hanfeisun/TIMER to run it on their own samples.
§ 
DSection (Erkkilä et al., 2010): Estimation of cell type-specific ex- pression profiles, corrected cell type proportions and differential gene expression using microarray data. Available at: http://infor- matics.systemsbiology.net/DSection/
§ 
DCQ (Altboum et al., 2014) and CoD (Frishberg et al., 2015) are two tools from the Irit Gat-Viks lab allowing the estimation of cell type quantities to identify disease-relevant cell types using microar- ray or RNA-seq data. Available at: http://www.dcq.tau.ac.il/ (de- tailed information:
http://dcq.tau.ac.il/application.html) and http://www.csgi.tau.ac.il/CoD/
(detailed information: http://www.csgi.tau.ac.il/CoD/application.html)
§
ESTIMATE (Yoshihara et al., 2013): Allows quick access to relative stromal and immune cell type composition across all samples avail- able at TCGA (microarray and RNA-seq data). Available at: http://bioinformatics.mdanderson.org/estimate/
§
CIBERSORT (Newman et al., 2015): Given microarray or RNA- seq data from heterogeneous samples and selecting pre-built or cus- tom-made matrices with cell type-specific expression profiles, it generates proportions of up to 22 cell types. Available at: https://cibersort.stanford.edu/runcibersort.php

Reference:
Reviews related deconvolution problem using transcriptomics data

Shen,Q. et al. (2016) contamDE: differential expression analysis of RNA-seq data for contaminated tumor samples. Bioinforma. Oxf. Engl., 32, 705–712.
Shen-Orr,S.S. et al. (2010) Cell type-specific gene expression differences in com- plex tissues. Nat. Methods, 7, 287–289.
Shen-Orr,S.S. and Gaujoux,R. (2013) Computational Deconvolution: Extracting Cell Type-Specific Information from Heterogeneous Samples. Curr. Opin. Immunol., 25
Mohammadi,S. et al. (2017) A Critical Survey of Deconvolution Methods for Sep- arating Cell Types in Complex Tissues. Proc. IEEE, 105, 340–366
Yadav,V.K. and De,S. (2015) An assessment of computational methods for esti- mating purity and clonality using genomic data derived from heteroge- neous tumor tissue samples. Brief. Bioinform., 16, 232–241.

2. CellMix: a comprehensive toolbox for gene expression deconvolution

Overview of CellMix

CellMix is a R package released in 2013, which integrate 7 gene expression deconvolution algorithms, 8 marker gene lists, 11 public datasets, and facilitates the estimation of cell type proportions and/or cell-specific differential expression in gene expression experiments.

Download and library CellMix

source('http://bioconductor.org/biocLite.R')
biocLite("CellMix", siteRepos ="http://web.cbio.uct.ac.za/~renaud/CRAN")
biocLite("GEOquery")
library("CellMix")

Background and objectives

Gene expression deconvolution is naturally expressed as matrix decomposition problem.It use global gene expression data including supervised and unsupervised methods, for supervised deconvolution it need to combine with known signatures or marker genes.
Objectives:

  • cell proportions: what is the proportion of each cell type?
  • are there differences in proportion that are associated with some disease status/covariate?
  • cell-specific expression: what is the expression profile of each cell type?
  • which genes are differently expressed in each cell types between groups of samples?

Estimating cell proportions from known signatures

Blood samples

  • known cell-specific expression signatures generated in independent studies (Abbas et al. 2009; Gong et al. 2011).
  • tutorial use the dataset GSE20300
# load data (normally requires an internet connection to GEO) 
acr <- ExpressionMix("GSE20300", verbose = 2)
# estimate proportions using signatures from Abbas et al. (2009) 
res <- gedBlood(acr, verbose = TRUE)
# proportions are stored in the coefficient matrix 
dim(coef(res))
coef(res)[1:3, 1:4]
# cell type names 
basisnames(res)
# basis signatures (with converted IDs) 
basis(res)[1:5, 1:3]
# aggregate into CBC 
cbc <- asCBC(res) dim(cbc)
# plot against actual CBC 
profplot(acr, cbc) 
# plot cell proportion differences between groups 
boxplotBy(res, acr$Status, main = "Cell proportions vs Transplant status")

Building/filtering basis signatures

select genes based on their cell type specificity, and build a basis
signature matrix that provides the “maximum” deconvolution power.

  • From marker genes only
    no pure sample expression profile,deconvolution and esti-
    mation of cell type proportions can be still be performed, using sets of marker genes.
# check if data is in log scale 
is_logscale(mix)
# compute mean expression profiles within each cell type p <- ged(expb(mix, 2), sel, "meanProfile") 
# plot against known proportions (p is by default not scaled) 
profplot(mix, p, scale = TRUE, main = "meanProfile - Linear scale")
# compute mean expression profiles within each cell type
lp <- ged(mix, sel, "meanProfile") 
# plot against known proportions (p is by default not scaled) 
profplot(mix, lp, scale = TRUE, main = "meanProfile - Log scale")
# compute proportions using DSA methods 
pdsa <- ged(mix[sel], sel, "DSA", verbose = TRUE)
profplot(mix, pdsa, main = "DSA - Linear scale") 
pdsa <- ged(mix[sel], sel, "DSA", log = FALSE) 
profplot(mix, pdsa, main = "DSA - Log scale")

Estimating differential cell-specific expression

  • From measured proportions: csSAM
  • From proportion priors: DSection

Complete deconvolution using marker genes

A priori: enforce marker expression patterns

# generate random data with 5 markers per cell type 
x <- rmix(3, 200, 20, markers = 5) 
m <- getMarkers(x)
# deconvolve using KL-divergence metric 
kl <- ged(x, m, "ssKL", log = FALSE, rng = 1234, nrun = 10)
# plot against known proportions 
profplot(x, kl) 
# check consistency of most expressing cell types in known basis signatures 
basismarkermap(basis(x), kl) 
# correlation with known signatures 
basiscor(x, kl)

A posteriori: assign signatures to cell types

# deconvolve using KL divergence metric 
dec <- ged(x, m, "deconf", rng = 1234, nrun = 10)
# plot against known proportions 
profplot(x, dec) 
# check consistency of most expressing cell types in known signatures basismarkermap(basis(x), dec) 
# correlation with known signatures 
basiscor(x, dec)

3. DeconRNASeq:A Statistical Framework for Deconvolution of Heterogeneous Tissue Samples Based on mRNA-Seq data

Overview of DeconRNASeq

It uses nonnegative decomposition algorithm through quadratic programming for estimating the mixing proportions of distinctive tissue types in next generation sequencing data. It requires two R data frame input:

  • datasets : the raw mRNA expression data matrix ( genes by samples);
    datasets = signature *A
  • signatures : known signatures of specific cell types or tissues (genes by cell types)
  • A : the cell type concentration matrix(Cell type by samples)

Pipeline of using DeconRNASeq

## install deconRNASeq package
source("https://bioconductor.org/biocLite.R")
biocLite("DeconRNASeq")
library(DeconRNASeq)
##view documentation
browseVignettes("DeconRNASeq")

## run the example
## multi_tissue: expression profiles for 10 mixing samples from multiple tissues
data(multi_tissue)
datasets <- x.data[,2:11] 
## datasets matrix contain 28745 genes and 10 samples.
## tissue-specific signatures for different human tissues including 1570 genes for the five tissues
signatures <- x.signature.filtered.optimal[,2:6] 
proportions <- fraction
#attributes(signatures)[c(1,2)]
DeconRNASeq(datasets, signatures, proportions, checksig=FALSE,
              known.prop = TRUE, use.scale = TRUE, fig = TRUE)
## plot of the condition number vs. the number of genes from the gene signature in the deconvolution experiments by setting checksig to be true.
DeconRNASeq(datasets, signatures, proportions, checksig=TRUE, 
              known.prop = TRUE, use.scale = TRUE, fig = TRUE)

4. CIBERSORT

Overview of CIBERSORT

CIBERSORT is a deconvolution method for complex tissues especially for human leukocyte subsets,based on linear support vector regression((SVR) from gene expression profiles. It performed better than other six gene expression profiles () deconvolution methods with respect to noise, unknown mixture content and closely related cell types.

  • linear least-squares regression (LLSR),
  • quadratic programming (QP),
  • perturba- tion model for gene expression deconvolution (PERT),
  • robust linear regression (RLR),
  • microarray microdissection with analysis of differences (MMAD) ,
  • digital sorting algorithm (DSA))

Methods

CIBERSORT requires an input matrix of reference gene expression signatures, like leukocyte gene signature matrix, termed LM22. which offers a web-based software to using(http://cibersort.stanford.edu/).
General pipeline like below:

  1. register this software and then can sign in it.
  2. Prepare Data
  • Mixture File
    The mixture file is the gene expression profiles of the samples to be analyzed , column is sample, each row is gene.
  • Signature Genes File
    It is a matrix of reference gene expression signatures, or signature matrix. Each data column is a component reference sample cell population class, and each row is a barcode gene.
  1. upload data files
  2. run CIBERSORT

Limitations

  • the fidelity of refer- ence profiles, which could deviate in cells undergoing heterotypic interactions, phenotypic plasticity or disease-induced dysregula- tion(all signature gene–based methods).
  • CIBERSORT does not provide P values for detection limits of individual cell types.
  • despite CIBERSORT show- ing a considerably lower estimation bias than other approaches, it systematically over- or underestimated some cell types
  • Although CIBERSORT was not explicitly tested on RNA-seq data, the linearity assumptions made by our method are likely to hold.

5. UNDO: An open-source software tool for unsupervised deconvolution of tumor-stromal mixed expressions

Overview of UNDO

The R package UNDO(Unsupervised Deconvolution of Tumor-stromal Mixed Expressions) is an unsupervised deconvolution method and can exploit the existence of marker genes that do not need to be known in advance, but it is used on tumor-stroma mixed expression profiles.

你可能感兴趣的:(Review of Deconvolution Methods)