https://academic.oup.com/bib/article/22/4/bbaa238/5917082?login=

Abstract

Despite impressive improvement in the next-generation sequencing technology, reliable detection of indels is still a difficult endeavour. Recognition of true indels is of prime importance in many applications, such as personalized health care, disease genomics and population genetics. Recently, advanced machine learning techniques have been successfully applied to classification problems with large-scale data. In this paper, we present SICaRiO, a gradient boosting classifier for the reliable detection of true indels, trained with the gold-standard dataset from ‘Genome in a Bottle’ (GIAB) consortium. Our filtering scheme significantly improves the performance of each variant calling pipeline used in GIAB and beyond. SICaRiO uses genomic features that can be computed from publicly available resources, i.e. it does not require sequencing pipeline-specific information (e.g. read depth). This study also sheds lights on prior genomic contexts responsible for the erroneous calling of indels made by sequencing pipelines. We have compared prediction difficulty for three categories of indels over different sequencing pipelines. We have also ranked genomic features according to their predictivity in determining false positives.
key: NGS，call variant, SNP，indel, short indel(1~50bp), GATK, Samtools ,false positive, 过滤indel,
data: GENOME IN A BOTTLE
feature:

model: XGBoost, 集成boosting决策树

https://academic.oup.com/bib/article/22/4/bbaa228/5917045

Abstract

Long noncoding RNAs (lncRNAs) play significant roles in various physiological and pathological processes via their interactions with biomolecules like DNA, RNA and protein. The existing in silico methods used for predicting the functions of lncRNA mainly rely on calculating the similarity of lncRNA or investigating whether an lncRNA can interact with a specific biomolecule or disease. In this work, we explored the functions of lncRNA from a different perspective: we presented a tool for predicting the interaction biomolecule type for a given lncRNA. For this purpose, we first investigated the main molecular mechanisms of the interactions of lncRNA–RNA, lncRNA–protein and lncRNA–DNA. Then, we developed an ensemble deep learning model: lncIBTP (lncRNA Interaction Biomolecule Type Prediction). This model predicted the interactions between lncRNA and different types of biomolecules. On the 5-fold cross-validation, the lncIBTP achieves average values of 0.7042 in accuracy, 0.7903 and 0.6421 in macro-average area under receiver operating characteristic curve and precision–recall curve, respectively, which illustrates the model effectiveness. Besides, based on the analysis of the collected published data and prediction results, we hypothesized that the characteristics of lncRNAs that interacted with DNA may be different from those that interacted with only RNA.
key:

lncRNA-RNA interaction

The lncRNA–RNA interaction types. (A) lncRNA acts as the host to small ncRNA. (B) lncRNA directly interacts with RNA. (C) lncRNA regulates mRNA through miRNA in the ceRNA network. (D) lncRNA regulates mRNA by competing with miRNA on binding sites.

mRNA: coding RNA. regulate its splicing, editing, subcellular distribution and stability
ncRNA:
snoRNAs: Small nucleolar RNAs. Guide chemical modifications of other RNA
miRNA： microRNA. functions in RNA silencing and post-transcriptional regulation of gene expression. regulates the expression and stability of lncRNA.
piRNA: Piwi-interacting RNA. epigenetic and post-transcriptional silencing

lncRNA-protein interaction
lncRNA acts as signals, guides, scaffolds and decoys in the relationship with protein
lncRNA-DNA interaction

(The lncRNA–DNA interaction types. A) lncRNA interacts with DNA via the DNA-binding protein. (B) R-loop. (C) DNA:RNA triplex

CIRNN predicts the interaction between lncRNA and miRNA.
GONMF identifies the lncRNA–mRNA coexpression
HLPI-Ensemble predicts the lncRNA–protein interaction
dataset: sequence data, NPInter v4.0, lncRInter
feature: k-mer, g-gap,g-bigap

The architecture of lncIBTP. Class 0–3 represents lncRNAs that interacted with only RNA, only protein, RNA & protein and DNA-related, respectively.

Abstract

N6-methyladenosine (m⁶A) modification can regulate a variety of biological processes. However, the implications of m⁶A modification in lung adenocarcinoma (LUAD) remain largely unknown. Here, we systematically evaluated the m⁶A modification features in more than 2400 LUAD samples by analyzing the multi-omics features of 23 m⁶A regulators. We depicted the genetic variation features of m⁶A regulators, and found mutations of FTO and YTHDF3 were linked to worse overall survival. Many m⁶A regulators were aberrantly expressed in tumors, among which FTO, IGF2BP3, YTHDF1 and RBM15 showed consistent alteration features across 11 independent cohorts. Besides, the regulator-pathway interaction network demonstrated that m⁶A modification was associated with various biological pathways, including immune-related pathways. The correlation between m⁶A regulators and tumor microenvironment was also assessed. We found that LRPPRC was negatively correlated with most tumor-infiltrating immune cells. On the other hand, we established a scoring tool named m6Sig, which was positively correlated with PD-L1 expression and could reflect both the tumor microenvironment characterization and prognosis of LUAD patients. Comparison of CNV between high and low m6Sig groups revealed differences on chromosome 7. Application of m6Sig on an anti-PD-L1 immunotherapy cohort confirmed that the high m6Sig group demonstrated therapeutic advantages and clinical benefits. Our study indicated that m⁶A modification is involved in many aspects of LUAD and contributes to tumor microenvironment formation. A better understanding of m⁶A modification will provide more insights into the molecular mechanisms of LUAD and facilitate developing more effective personalized treatment strategies. A web application was built along with this study (http://www.bioinfo-zs.com/luadexpress/).
key: 从多组学的层面谈讨m6a (23 regulators) 与肺腺癌的关系.mutation,CNV,gene expression, DNA methylation， cell survival, clinical
data: m6A regulators and tumor microenvironment-related genes (TMRGs)
model: LASSO Cox regression analysis. survival ~ gene
result: m6Sig score was positively correlated with PD-L1, predicting response in anti-PD-L1 immunotherapy.

肿瘤微环境

https://academic.oup.com/bib/article/22/4/bbaa222/5916940

Abstract

Motivation
The advancements of single-cell sequencing methods have paved the way for the characterization of cellular states at unprecedented resolution, revolutionizing the investigation on complex biological systems. Yet, single-cell sequencing experiments are hindered by several technical issues, which cause output data to be noisy, impacting the reliability of downstream analyses. Therefore, a growing number of data science methods has been proposed to recover lost or corrupted information from single-cell sequencing data. To date, however, no quantitative benchmarks have been proposed to evaluate such methods.

Results
We present a comprehensive analysis of the state-of-the-art computational approaches for denoising and imputation of single-cell transcriptomic data, comparing their performance in different experimental scenarios. In detail, we compared 19 denoising and imputation methods, on both simulated and real-world datasets, with respect to several performance metrics related to imputation of dropout events, recovery of true expression profiles, characterization of cell similarity, identification of differentially expressed genes and computation time. The effectiveness and scalability of all methods were assessed with regard to distinct sequencing protocols, sample size and different levels of biological variability and technical noise. As a result, we identify a subset of versatile approaches exhibiting solid performances on most tests and show that certain algorithmic families prove effective on specific tasks but inefficient on others. Finally, most methods appear to benefit from the introduction of appropriate assumptions on noise distribution of biological processes.
key: 单细胞去噪(去除生物和技术上的噪声)和推断(恢复missing data)方法，在模拟数据和4个真实数据的基础上，比较了19种方法，评价指标有imputation of dropout events, recovery of true expression profiles, characterization of cell similarity, identification of differentially expressed genes and computation time等5个指标。

https://academic.oup.com/bib/article/22/4/bbaa219/5916936

Abstract

Due to the high cost of flow and mass cytometry, there has been a recent surge in the development of computational methods for estimating the relative distributions of cell types from the gene expression profile of a bulk of cells. Here, we review the five common ‘digital cytometry’ methods: deconvolution of RNA-Seq, cell-type identification by estimating relative subsets of RNA transcripts (CIBERSORT), CIBERSORTx, single sample gene set enrichment analysis and single-sample scoring of molecular phenotypes deconvolution method. The results show that CIBERSORTx B-mode, which uses batch correction to adjust the gene expression profile of the bulk of cells (‘mixture data’) to eliminate possible cross-platform variations between the mixture data and the gene expression data of single cells (‘signature matrix’), outperforms other methods, especially when signature matrix and mixture data come from different platforms. However, in our tests, CIBERSORTx S-mode, which uses batch correction for adjusting the signature matrix instead of mixture data, did not perform better than the original CIBERSORT method, which does not use any batch correction method. This result suggests the need for further investigations into how to utilize batch correction in deconvolution methods.
key:
流式(immunohistochemistry, flow cytometry and mass cytometry)成本高，从bulk 细胞的基因表达谱估计细胞类型的相对分布。
常用方法：去卷积，CIBERSORT，CIBERSORTx (ComBat), single sample gene set enrichment analysis (ssGSEA DM), single-sample scoring of molecular phenotypes deconvolution method (SingScore DM)
结果：cybersort, cybersortx Bmode

https://academic.oup.com/bib/article/22/4/bbaa216/5916937

Abstract

Single-cell RNA-sequencing (scRNA-seq) data widely exist in bioinformatics. It is crucial to devise a distance metric for scRNA-seq data. Almost all existing clustering methods based on spectral clustering algorithms work in three separate steps: similarity graph construction; continuous labels learning; discretization of the learned labels by k-means clustering. However, this common practice has potential flaws that may lead to severe information loss and degradation of performance. Furthermore, the performance of a kernel method is largely determined by the selected kernel; a self-weighted multiple kernel learning model can help choose the most suitable kernel for scRNA-seq data. To this end, we propose to automatically learn similarity information from data. We present a new clustering method in the form of a multiple kernel combination that can directly discover groupings in scRNA-seq data. The main proposition is that automatically learned similarity information from scRNA-seq data is used to transform the candidate solution into a new solution that better approximates the discrete one. The proposed model can be efficiently solved by the standard support vector machine (SVM) solvers. Experiments on benchmark scRNA-Seq data validate the superior performance of the proposed model. Spectral clustering with multiple kernels is implemented in Matlab, licensed under Massachusetts Institute of Technology (MIT) and freely available from the Github website, https://github.com/Cuteu/SMSC/.
key: 聚类问题
cluster: center-based (e.g. k-means), hierarchical clustering and methods that view clustering as a graph partitioning problem (e.g. spectral clustering ) : similarity graph construction; continuous labels learning; discretization of the learned labels by k-means clustering
data: 11 publicly available scRNA-Seq datasets
clustering criteria: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Accuracy (ACC), and Purity

The SMSC framework for the clustering of scRNA-seq data. The SMSC framework is composed of four stages. The first stage obtains dataset expression matrices. The Cells-Genes matrix was raw data. The second stage is to calculate the kernel matrices. They were obtained by processing the original data with effective fusion multiple kernel functions. The kernel matrices can be used for spectral clustering in the third stage, and they can be obtained by any reasonable kernel functions. Finally, cells are clustering by K-means algorithm.

Benchmarking of SMSC against four clustering methods. (A) Comparison of ARI among SMSC and four clustering methods using 11 datasets. (B) Comparison of ACC among SMSC and four clustering methods using 11 datasets. (C) Comparison of NMI among SMSC and four clustering methods using 11 datasets. (D) Comparison of Purity among SMSC and four clustering methods using 11 datasets. Color boxes represent different methods, which were noted below the figure. And y-axis means clustering performances of criteria. The larger the better.

https://academic.oup.com/bib/article/22/4/bbaa229/5916939

Abstract

DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene regulation and management. For the past few decades, researchers have been working on designing new efficient and accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural network (RNN) based models, and hybrid CNN–RNN based models. We introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models. Through the analysis and comparison of existing deep learning methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP), computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help researchers understand this field.
key: 利用深度学习挖掘motif
method: enumeration approach, probabilistic method, machine learning
比较9个模型

https://academic.oup.com/bib/article/22/4/bbaa226/5917048

Abstract

Deep generative models can be trained to represent the joint distribution of data, such as measurements of single nucleotide polymorphisms (SNPs) from several individuals. Subsequently, synthetic observations are obtained by drawing from this distribution. This has been shown to be useful for several tasks, such as removal of noise, imputation, for better understanding underlying patterns, or even exchanging data under privacy constraints. Yet, it is still unclear how well these approaches work with limited sample size. We investigate such settings specifically for binary data, e.g. as relevant when considering SNP measurements, and evaluate three frequently employed generative modeling approaches, variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs). This includes conditional approaches, such as when considering gene expression conditional on SNPs. Recovery of pair-wise odds ratios (ORs) is considered as a primary performance criterion. For simulated as well as real SNP data, we observe that DBMs generally can recover structure for up to 300 variables, with a tendency of over-estimating ORs when not carefully tuned. VAEs generally get the direction and relative strength of pairwise relations right, yet with considerable under-estimation of ORs. GANs provide stable results only with larger sample sizes and strong pair-wise relations in the data. Taken together, DBMs and VAEs (in contrast to GANs) appear to be well suited for binary omics data, even at rather small sample sizes. This opens the way for many potential applications where synthetic observations from omics data might be useful.
key:当样本数量不够时，生成样本
三种深度生成模型：VAE, DBM, GAN

https://academic.oup.com/bib/article/22/4/bbaa224/5923107
Abstract
Cholangiocarcinoma (CCA) is a type of cancer with limited treatment options and a poor prognosis. Although some important genes and pathways associated with CCA have been identified, the relationship between coexpression and phenotype in CCA at the systems level remains unclear. In this study, the relationships underlying the molecular and clinical characteristics of CCA were investigated by employing weighted gene coexpression network analysis (WGCNA). The gene expression profiles and clinical features of 36 patients with CCA were analyzed to identify differentially expressed genes (DEGs). Subsequently, the coexpression of DEGs was determined by using the WGCNA method to investigate the correlations between pairs of genes. Network modules that were significantly correlated with clinical traits were identified. In total, 1478 mRNAs were found to be aberrantly expressed in CCA. Seven coexpression modules that significantly correlated with clinical characteristics were identified and assigned representative colors. Among the 7 modules, the green and blue modules were significantly related to tumor differentiation. Seventy-eight hub genes that were correlated with tumor differentiation were found in the green and blue modules. Survival analysis showed that 17 hub genes were prognostic biomarkers for CCA patients. In addition, we found five new targets (ISM1, SULT1B1, KIFC1, AURKB and CCNB1) that have not been studied in the context of CCA and verified their differential expression in CCA through experiments. Our results not only promote our understanding of the relationship between the transcriptome and clinical data in CCA but will also guide the development of targeted molecular therapy for CCA.
key: 胆管癌构建38个病人差异基因的共表达网络(WGCNA R package)，鉴定出7个模块，2个模块与肿瘤分化显著相关，鉴定了17个hub gene肿瘤预后的marker gene

https://academic.oup.com/bib/article/22/4/bbaa247/5924101

Abstract

In order to extract useful information from a huge amount of biological data nowadays, simple and convenient tools are urgently needed for data analysis and modeling. In this paper, an automatic data mining tool, termed as ABCModeller (Automatic Binary Classification Modeller), with a user-friendly graphical interface was developed here, which includes automated functions as data preprocessing, significant feature extraction, classification modeling, model evaluation and prediction. In order to enhance the generalization ability of the final model, a consistent voting method was built here in this tool with the utilization of three popular machine-learning algorithms, as artificial neural network, support vector machine and random forest. Besides, Fibonacci search and orthogonal experimental design methods were also employed here to automatically select significant features in the data space and optimal hyperparameters of the three algorithms to achieve the best model. The reliability of this tool has been verified through multiple benchmark data sets. In addition, with the advantage of a user-friendly graphical interface of this tool, users without any programming skills can easily obtain reliable models directly from original data, which can reduce the complexity of modeling and data mining, and contribute to the development of related research including but not limited to biology. The excitable file of this tool can be downloaded from http://lishuyan.lzu.edu.cn/ABCModeller.rar.
key:
数据挖掘工具，包括数据预处理，特征提取，分类，模型验证，模型预测。投票系统，结合人工神经网络，SVM，随机森林。

Framework of the data mining process of ABCModeller

https://academic.oup.com/bib/article/22/4/bbaa212/5922326

Abstract

Motivation: The functional changes of the genes, RNAs and proteins will eventually be reflected in the metabolic level. Increasing number of researchers have researched mechanism, biomarkers and targeted drugs by metabolites. However, compared with our knowledge about genes, RNAs, and proteins, we still know few about diseases-related metabolites. All the few existed methods for identifying diseases-related metabolites ignore the chemical structure of metabolites, fail to recognize the association pattern between metabolites and diseases, and fail to apply to isolated diseases and metabolites. Results: In this study, we present a graph deep learning based method, named Deep-DRM, for identifying diseases-related metabolites. First, chemical structures of metabolites were used to calculate similarities of metabolites. The similarities of diseases were obtained based on their functional gene network and semantic associations. Therefore, both metabolites and diseases network could be built. Next, Graph Convolutional Network (GCN) was applied to encode the features of metabolites and diseases, respectively. Then, the dimension of these features was reduced by Principal components analysis (PCA) with retainment 99% information. Finally, Deep neural network was built for identifying true metabolite-disease pairs (MDPs) based on these features. The 10-cross validations on three testing setups showed outstanding AUC (0.952) and AUPR (0.939) of Deep-DRM compared with previous methods and similar approaches. Ten of top 15 predicted associations between diseases and metabolites got support by other studies, which suggests that Deep-DRM is an efficient method to identify MDPs. Contact: [email protected]. Availability and implementation: https://github.com/zty2009/GPDNN-for-Identify-ing-Disease-related-Metabolites.
key: 从代谢组的角度研究疾病和代谢产物的关系

Workflow of Deep-DRM

https://academic.oup.com/bib/article/22/4/bbaa239/5923110
Abstract
Delineating the fingerprint or feature vector of a receptor/protein will facilitate the structural and biological studies, as well as the rational design and development of drugs with high affinities and selectivity. However, protein is complicated by its different functional regions that can bind to some of its protein partner(s), substrate(s), orthosteric ligand(s) or allosteric modulator(s) where cogent methods like molecular fingerprints do not work well. We here elaborate a scoring-function-based computing protocol Molecular Complex Characterizing System to help characterize the binding feature of protein–ligand complexes. Based on the reported receptor-ligand interactions, we first quantitate the energy contribution of each individual residue which may be an alternative of MD-based energy decomposition. We then construct a vector for the energy contribution to represent the pattern of the ligand recognition at a receptor and qualitatively analyze the matching level with other receptors. Finally, the energy contribution vector is explored for extensive use in similarity and clustering. The present work provides a new approach to cluster proteins, a perspective counterpart for determining the protein characteristics in the binding, and an advanced screening technique where molecular docking is applicable.
key: 刻画蛋白-配体复合物的特征
小分子的特征可以用分子指纹刻画
蛋白功能域可以绑定到其伴侣，底物，正构配体，变构调节剂，使得蛋白变得复杂
基于蛋白的每个残基的能量贡献而不是整个蛋白的能量

https://academic.oup.com/bib/article/22/4/bbaa242/5924410

Abstract

Human papillomavirus (HPV) integrating into human genome is the main cause of cervical carcinogenesis. HPV integration selection preference shows strong dependence on local genomic environment. Due to this theory, it is possible to predict HPV integration sites. However, a published bioinformatic tool is not available to date. Thus, we developed an attention-based deep learning model DeepHPV to predict HPV integration sites by learning environment features automatically. In total, 3608 known HPV integration sites were applied to train the model, and 584 reviewed HPV integration sites were used as the testing dataset. DeepHPV showed an area under the receiver-operating characteristic (AUROC) of 0.6336 and an area under the precision recall (AUPR) of 0.5670. Adding RepeatMasker and TCGA Pan Cancer peaks improved the model performance to 0.8464 and 0.8501 in AUROC and 0.7985 and 0.8106 in AUPR, respectively. Next, we tested these trained models on independent database VISDB and found the model adding TCGA Pan Cancer performed better (AUROC: 0.7175, AUPR: 0.6284) than the model adding RepeatMasker peaks (AUROC: 0.6102, AUPR: 0.5577). Moreover, we introduced attention mechanism in DeepHPV and enriched the transcription factor binding sites including BHLHA15, CHR, COUP-TFII, DMRTA2, E2A, HIC1, INR, NPAS, Nr5a2, RARa, SCL, Snail1, Sox10, Sox3, Sox4, Sox6, STAT6, Tbet, Tbx5, TEAD, Tgif2, ZNF189, ZNF416 near attention intensive sites. Together, DeepHPV is a robust and explainable deep learning model, providing new insights into HPV integration preference and mechanism.

Availability: DeepHPV is available as an open-source software and can be downloaded from https://github.com/JiuxingLiang/DeepHPV.git
key:利用深度学习预测HPV（人乳头瘤病毒）整合位点
HPV通过整合到人类基因组，诱发基因组不稳定，容易形成插入突变，融合，是一些癌症发生的原因。

https://academic.oup.com/bib/article/22/4/bbaa241/5925270
Abstract
Biological network-based strategies are useful in prioritizing genes associated with diseases. Several comprehensive human gene networks such as STRING, GIANT and HumanNet were developed and used in network-assisted algorithms to identify disease-associated genes. However, none of these networks are disease-specific and may not accurately reflect gene interactions for a specific disease. Aiming to improve disease gene prioritization using networks, we propose a Disease-Specific Network Enhancement Prioritization (DiSNEP) framework. DiSNEP first enhances a comprehensive gene network specifically for a disease through a diffusion process on a gene–gene similarity matrix derived from disease omics data. The enhanced disease-specific gene network thus better reflects true gene interactions for the disease and may improve prioritizing disease-associated genes subsequently. In simulations, DiSNEP that uses an enhanced disease-specific network prioritizes more true signal genes than comparison methods using a general gene network or without prioritization. Applications to prioritize cancer-associated gene expression and DNA methylation signal genes for five cancer types from The Cancer Genome Atlas (TCGA) project suggest that more prioritized candidate genes by DiSNEP are cancer-related according to the DisGeNET database than those prioritized by the comparison methods, consistently across all five cancer types considered, and for both gene expression and DNA methylation signal genes.
key: 利用疾病相关的基因网络对疾病相关的基因进行排序
STRING, GIANT and HumanNet 并不是疾病特异的
加入疾病组学数据计算得到的基因相似性

https://academic.oup.com/bib/article/22/4/bbaa248/5929825
Abstract
Single-cell mRNA sequencing has been adopted as a powerful technique for understanding gene expression profiles at the single-cell level. However, challenges remain due to factors such as the inefficiency of mRNA molecular capture, technical noises and separate sequencing of cells in different batches. Normalization methods have been developed to ensure a relatively accurate analysis. This work presents a survey on 10 tools specifically designed for single-cell mRNA sequencing data preprocessing steps, among which 6 tools are used for dropout normalization and 4 tools are for batch effect correction. In this survey, we outline the main methodology for each of these tools, and we also compare these tools to evaluate their normalization performance on datasets which are simulated under the constraints of dropout inefficiency, batch effect or their combined effects. We found that Saver and Baynorm performed better than other methods in dropout normalization, in most cases. Beer and Batchelor performed better in the batch effect normalization, and the Saver–Beer tool combination and the Baynorm–Beer combination performed better in the mixed dropout-and-batch effect normalization. Over-normalization is a common issue occurred to these dropout normalization tools that is worth of future investigation. For the batch normalization tools, the capability of retaining heterogeneity between different groups of cells after normalization can be another direction for future improvement.
key:

Surveyed tools summary

https://academic.oup.com/bib/article/21/6/1886/5626330?searchresult=1#218009841

Abstract

In clinical cancer treatment, genomic alterations would often affect the response of patients to anticancer drugs. Studies have shown that molecular features of tumors could be biomarkers predictive of sensitivity or resistance to anticancer agents, but the identification of actionable mutations are often constrained by the incomplete understanding of cancer genomes. Recent progresses of next-generation sequencing technology greatly facilitate the extensive molecular characterization of tumors and promote precision medicine in cancers. More and more clinical studies, cancer cell lines studies, CRISPR screening studies as well as patient-derived model studies were performed to identify potential actionable mutations predictive of drug response, which provide rich resources of molecularly and pharmacologically profiled cancer samples at different levels. Such abundance of data also enables the development of various computational models and algorithms to solve the problem of drug sensitivity prediction, biomarker identification and in silico drug prioritization by the integration of multiomics data. Here, we review the recent development of methods and resources that identifies mutation-dependent effects for cancer treatment in clinical studies, functional genomics studies and computational studies and discuss the remaining gaps and future directions in this area.
key:

clinical studies
functional genomics studies
2.1 Cell line-based drug screening studies

A flowchart showing the discovery of therapeutic biomarkers using comprehensive drug screening on cancer cell lines and PDMs. A flowchart showing the discovery of therapeutic biomarkers using comprehensive drug screening on cancer cell lines and PDMs. Cancer specimens are cell cultured or transplanted into immunodeficient mice. In vitro and in vivo drug screenings are carried out among these established cell or mouse models. Next, multiomics profiles together with drug response data are used to establish drug–biomarker relationship (such as specific mutations). After validation in clinical trials, patients with matched genomic features could be treated with the drug that showed ideal response.

2.2 CRISPR screening studies
2.3 PDM studies
patient-derived tumor cell (PDC), patient-derived xenograft model (PDX) and patient-derived organoid (PDO)
computational studies

Summary of the workflow of computational studies

https://academic.oup.com/bib/article/22/3/bbaa108/5854404?searchresult=1#248049308

Abstract

More than 48 kinase inhibitors (KIs) have been approved by Food and Drug Administration. However, drug-resistance (DR) eventually occurs, and secondary mutations have been found in the previously targeted primary-mutated cancer cells. Cancer and drug research communities recognize the importance of the kinase domain (KD) mutations for kinasopathies. So far, a systematic investigation of kinase mutations on DR hotspots has not been done yet. In this study, we systematically investigated four types of representative mutation hotspots (gatekeeper, G-loop, αC-helix and A-loop) associated with DR in 538 human protein kinases using large-scale cancer data sets (TCGA, ICGC, COSMIC and GDSC). Our results revealed 358 kinases harboring 3318 mutations that covered 702 drug resistance hotspot residues. Among them, 197 kinases had multiple genetic variants on each residue. We further computationally assessed and validated the epidermal growth factor receptor mutations on protein structure and drug-binding efficacy. This is the first study to provide a landscape view of DR-associated mutation hotspots in kinase’s secondary structures, and its knowledge will help the development of effective next-generation KIs for better precision medicine.
key: 癌症往往伴随着激酶的激活，对癌症病人会使用激酶抑制剂。然而激酶的突变也会出现抗药性。系统分析了激酶抗药性的hotspot.

Kinase hotspot mutation statistics and the flowchart of annotation of kinase hotspot mutations.

https://academic.oup.com/bib/article/22/4/bbaa240/5929824

Abstract

Emerging evidence indicates that the abnormal expression of miRNAs involves in the evolution and progression of various human complex diseases. Identifying disease-related miRNAs as new biomarkers can promote the development of disease pathology and clinical medicine. However, designing biological experiments to validate disease-related miRNAs is usually time-consuming and expensive. Therefore, it is urgent to design effective computational methods for predicting potential miRNA-disease associations. Inspired by the great progress of graph neural networks in link prediction, we propose a novel graph auto-encoder model, named GAEMDA, to identify the potential miRNA-disease associations in an end-to-end manner. More specifically, the GAEMDA model applies a graph neural networks-based encoder, which contains aggregator function and multi-layer perceptron for aggregating nodes’ neighborhood information, to generate the low-dimensional embeddings of miRNA and disease nodes and realize the effective fusion of heterogeneous information. Then, the embeddings of miRNA and disease nodes are fed into a bilinear decoder to identify the potential links between miRNA and disease nodes. The experimental results indicate that GAEMDA achieves the average area under the curve of $93.56 \pm 0.44 %$ under 5-fold cross-validation. Besides, we further carried out case studies on colon neoplasms, esophageal neoplasms and kidney neoplasms. As a result, 48 of the top 50 predicted miRNAs associated with these diseases are confirmed by the database of differentially expressed miRNAs in human cancers and microRNA deregulation in human disease database, respectively. The satisfactory prediction performance suggests that GAEMDA model could serve as a reliable tool to guide the following researches on the regulatory role of miRNAs. Besides, the source codes are available at https://github.com/chimianbuhetang/GAEMDA.
key: miRNA与疾病的关联

Flowchart of GAEMDA model for predicting potential miRNA-disease associations.

https://academic.oup.com/bib/article/22/4/bbaa254/5937174
Abstract
Enhancer-promoter interactions (EPIs) play an important role in transcriptional regulation. Recently, machine learning-based methods have been widely used in the genome-scale identification of EPIs due to their promising predictive performance. In this paper, we propose a novel method, termed EPI-DLMH, for predicting EPIs with the use of DNA sequences only. EPI-DLMH consists of three major steps. First, a two-layer convolutional neural network is used to learn local features, and an bidirectional gated recurrent unit network is used to capture long-range dependencies on the sequences of promoters and enhancers. Second, an attention mechanism is used for focusing on relatively important features. Finally, a matching heuristic mechanism is introduced for the exploration of the interaction between enhancers and promoters. We use benchmark datasets in evaluating and comparing the proposed method with existing methods. Comparative results show that our model is superior to currently existing models in multiple cell lines. Specifically, we found that the matching heuristic mechanism introduced into the proposed model mainly contributes to the improvement of performance in terms of overall accuracy. Additionally, compared with existing models, our model is more efficient with regard to computational speed.
key:

Architecture of the model. It consists of four steps, including sequence embedding, feature extraction, matching heuristic and prediction.

2021-08-23 BIB Volume 22, Issue 4, July 2021

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

你可能感兴趣的:(2021-08-23 BIB Volume 22, Issue 4, July 2021)