Paper reading (五十三): ML Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights

论文题目:Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights

scholar 引用:111

页数:26

发表时间:2016.07

发表刊物: PLOS computational biology

作者:Edoardo Pasolli1, Duy Tin Truong, ..., Nicola Segata

摘要:

Shotgun metagenomic analysis of the human associated microbiome provides a rich set of microbial features for prediction and biomarker discovery in the context of human diseases and health conditions. However, the use of such high-resolution microbial features presents new challenges, and validated computational tools for learning tasks are lacking. Moreover, classification rules have scarcely been validated in independent studies, posing questions about the generality and generalization of disease-predictive models across cohorts. In this paper, we comprehensively assess approaches to metagenomics-based prediction tasks and for quantitative assessment of the strength of potential microbiome-phenotype associations. We develop a computational framework for prediction tasks using quantitative microbiome profiles, including species-level relative abundances and presence of strain-specific markers. A comprehensive meta-analysis, with particular emphasis on generalization across cohorts, was performed in a collection of 2424 publicly available metagenomic samples from eight large-scale studies. Cross-validation revealed good disease-prediction capabilities, which were in general improved by feature selection and use of strain-specific markers instead of species-level taxonomic abundance. In cross-study analysis, models transferred between studies were in some cases less accurate than models tested by within-study cross-validation. Interestingly, the addition of healthy (control) samples from other studies to training sets improved disease prediction capabilities. Some microbial species (most notably Streptococcus anginosus) seem to characterize general dysbiotic states of the microbiome rather than connections with a specific disease. Our results in modelling features of the “healthy” microbiome can be considered a first step toward defining general microbial dysbiosis. The software framework, microbiome profiles, and metadata for thousands of samples are publicly available at http://segatalab.cibio.unitn.it/tools/metaml.

正文组织架构:

1. Introduction

2. Results and Discussion

2.1 Cross-validation studies revealed good capabilities for disease prediction

2.2 Feature selection and strain-specific markers improve prediction accuracy

2.3 Detection of the disease-associated microbial features

2.4 Extension to non-disease classification problems

2.5 Metagenomic disease-predictive models show strong cross-stage generalization

2.6 Cross-study generalization is improved by including healthy samples from other cohorts

2.7 Avoiding overfitting is crucial to generalization on different cohorts

2.8 Modelling the “healthy” microbiome: Cross-disease prediction

3. Conclustion

4. Methods

4.1 The proposed tool

4.2 The adopted machine learning tools

4.3 Validation and evaluation strategies

4.4 The considered large metagenomic datasets

4.5 Extraction of species abundance and marker presence profiles from metagenomic samples

4.6 Experimental setting

4.7 Code and data availability

正文部分内容摘录:

1. Biological Problem: What biological problems have been solved in this paper?

  • metagenomics-based prediction tasks
  • quantitative assessment of microbiome-phenotype associations
  • 主要关注工作的生物学问题:predictive modelling of disease states

2. Main discoveries: What is the main discoveries in this paper?

  • Cross-validation revealed good disease-prediction capabilities, which were in general improved by feature selection and use of strain-specific markers instead of species-level taxonomic abundance.
  • In cross-study analysis, models transferred between studies were in some cases less accurate than models tested by within-study cross-validation.
  • Our results in modelling features of the “healthy” microbiome can be considered a first step toward defining general microbial dysbiosis.
  • we comprehensively assess approaches to metagenomics-based prediction tasks and for quantitative assessment of microbiome-phenotype associations.
  • We consider 2424 samples from eight studies and six different diseases to assess the independent prediction accuracy of models built on shotgun metagenomic data and to compare strategies for practical use of the microbiome as a prediction tool.
  • 主要关注工作:We uniformly processed shotgun metagenomic microbiome data for 2424 samples from 8 studies of 6 disease types, and used cross-validation, cross-study validation, and cross-disease validation to evaluate the accuracy of candidate methods of predictive modelling of disease states.
  • In this study we uniformly process 2424 shotgun metagenomic samples from eight studies to assess the independent prediction accuracy of models built on metagenomic data and to compare strategies for practical use of the microbiome as a prediction tool.

3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?

  • The developed tool incorporates four classification approaches (i.e. SVM, RF, Lasso, and ENet)
  • Methodologically, our results thus suggested the use of RFs for disease prediction from species abundances.
  • RF in combination with feature selection (RF-FS:Emb) achieved satisfactory classification results
  • Applying Lasso or ENet as pure classifiers, which implicitly incorporates the feature selection and classification steps, did not give satisfactory results, with AUC worse than RF or SVM for both species abundance and marker features
  • Better accuracies were obtained by using them for feature selection only, followed by RF or SVM classification. However, both Lasso and ENet feature selection in general worsened the performance of RF and SVM without prior feature selection. Finally, ENet worked better than Lasso, 
  • 各种分析结果显示,RF最好。

4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?

  • The support vector machines (SVM) [38] and random forest (RF) [39] classifiers were used for this evaluation as they are state-of-the-art approaches and are appropriate for this type of data [18].

5. Biological Significance: What is the biological significance of these ML methods’ results?

  • a set of tools is provided to evaluate classification performances in different ways including i) evaluation metrics such as overall accuracy (OA), precision, recall, F1, and area under the curve (AUC); ii) receiver operating characteristic (ROC) curve plots; iii) confusion matrices; iv) plots of the most relevant features in addition to average relative abundances; and v) heatmap figures.

6. Prospect: What are the potential applications of these machine learning methods in biological science?

  • Future work will be devoted to exploring more advanced machine learning strategies to further improve classification performance.

 

你可能感兴趣的:(Paper,Reading)