论文题目:A reliable method for colorectal cancer prediction based on feature selection and support vector machine
scholar 引用:2
页数:12
发表时间:2018.11
发表刊物:Medical & Biological Engineering & Computing
作者: Dandan Zhao, Hong Liu,...,Chen Lyu
摘要:
Colorectal cancer (CRC) is a common cancer responsible for approximately 600,000 deaths per year worldwide. Thus, it is very important to find the related factors and detect the cancer accurately. However, timely and accurate prediction of the disease is challenging. In this study, we build an integrated model based on logistic regression (LR) and support vector machine (SVM) to classify the CRC into cancer and normal samples. From various factors, human location, age, gender, BMI, and cancer tumor type, tumor grade, and DNA, of the cancer, we select the most significant factors (p < 0.05) using logistic regression as main features, and with these features, a grid-search SVM model is designed using different kernel types (Linear, radial basis function (RBF), Sigmoid, and Polynomial). The result of the logistic regression indicates that the Firmicutes (AUC 0.918), Bacteroidetes (AUC 0.856), body mass index (BMI) (AUC 0.777), and age (AUC 0.710) and their combined factors (AUC 0.942) are effective for CRC detection. And the best kernel type is RBF, which achieves an accuracy of 90.1% when k = 5, and 91.2% when k = 10. This study provides a new method for colorectal cancer prediction based on independent risky factors.
Key Words: Colorectal cancer, Logistic regression, Support vector machine, Microbiome
结论:
- There are still tremendous challenges in predicting the cancer.
- feature selection contains redundant features, thus leading poor performance on classification algorithm.
- Secondly, the prediction accuracy is affected by the choice of predictor.
- we select the most significant factors using logistic regression model from various factors of samples and validate that by ROC curve.
- the proper SVM kernel type is selected by comparing the Linear, RBF, Sigmoid, and Polynomial functions in the same dataset.
- Generally, the common way to select kernel function is based on prior knowledge of experts, but this method often cannot solve the problem well.
- We select the final kernel type as RBF, and we optimize the parameters (C, g) of SVM using the grid.py in LIBSVM, its accuracy achieves 90.1% when k = 5, and 91.2% when k = 10, which is obviously higher than other models (LR + RF, LR + NB, LR + KNN, LR + ANNs), solving the problem of low accuracy of cancer prediction to a large extent.
- the result of this work indicates that integrated model provides new thought in solving the problem of low accuracy and the unbalance between sensitivity and specificity.
Discussion:
- Meanwhile, this study provides a prediction model to differentiate between normal and CRC samples using our method, which combined logistic regression feature selection with SVM using an RBF kernel, yielding an accuracy of 90.1% when k = 5 and 91.2% when k = 10, which are both higher than the comparable models.
- From the above discussion, SVM achieves the best performance whether five or tenfold cross validation is used and keeps a balance between sensitivity and specificity.
- Based on our exiting study, we will predict the cancer with consensus molecular subtypes.
- To solve the problem, we plan to explore the further relationships between CRC classification and clinical translation.
- For classifying the samples, we will choose one of the deep learning methods-CNNs as the classifier based on TensorFlow. Finally, we will adjust the related parameters of CNNs to choose the best classification result.
Introduction:
- recent surveys have suggested that the gut microbiome, through its interactions with host metabolism, plays significant role in influencing CRC risk
- traditional methods and machine learning methods:achieve low prediction accuracy and cannot achieve a balance between sensitivity and specificity due to the improper predictors and redundant features.
- LR + SVM
- The main contributions 跟discussion里面差不多
- 说真的,感觉这篇paper写的不好。。。不时的发现语法错误,内容上,也觉得很多冗余。 Introduction、discussion、conclusion中有大量的内容重复。完全可以再精简一点吧。
正文组织架构:
1. Introduction
2. Related works
3. Materials and methods
3.1 Data set
3.2 Metagenomic sequencing and quality control
3.3 Microbiome diversity of CRC patients
3.4 Risk factor analysis
3.5 Data normalization
3.6 Logistic regression model
3.7 Creation of the SVM prediction model
3.8 Classifier performance measures
3.9 Validation
4. Results
4.1 Logistic regression analysis result
4.2 ROC curve analysis of single relevant factor and combined factor
4.3 SVM model establishment
4.4 Comparison of LR + SVM with other models
5. Discussion
6. Conclusion
正文部分内容摘录:
1. Biological Problem: What biological problems have been solved in this paper?
- colorectal cancer prediction
2. Main discoveries: What is the main discoveries in this paper?
- The result of the logistic regression indicates that the Firmicutes (AUC 0.918), Bacteroidetes (AUC 0.856), body mass index (BMI) (AUC 0.777), and age (AUC 0.710) and their combined factors (AUC 0.942) are effective for CRC detection.
- And the best kernel type is RBF, which achieves an accuracy of 90.1% when k = 5, and 91.2% when k = 10.
- the main contributions
- we select the most significant factors using logistic regression model from various factors of samples and validate that by ROC curve.
- the proper SVM kernel type is selected by comparing the Linear, RBF, Sigmoid, and Polynomial functions in the same dataset.
3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?
- The data set used in this study comes from the passage , which can be downloaded from the NCBI website.
- logistic regression (LR) and support vector machine (SVM)
4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?
- traditional methods and machine learning methods:achieve low prediction accuracy and cannot achieve a balance between sensitivity and specificity due to the improper predictors and redundant features.
- the common way to select kernel function is based on prior knowledge of experts
5. Biological Significance: What is the biological significance of these ML methods’ results?
- measures, such as sensitivity (SN), specificity (SP), accuracy (ACC), and Matthews’ correlation coefficient (MCC) are used to evaluate the performance of classifiers.
- From the above discussion, SVM achieves the best performance whether five or tenfold cross validation is used and keeps a balance between sensitivity and specificity.
6. Prospect: What are the potential applications of these machine learning methods in biological science?
- Based on our exiting study, we will predict the cancer with consensus molecular subtypes.
- To solve the problem, we plan to explore the further relationships between CRC classification and clinical translation.
- For classifying the samples, we will choose one of the deep learning methods-CNNs as the classifier based on TensorFlow. Finally, we will adjust the related parameters of CNNs to choose the best classification result.
- the result of this work indicates that integrated model provides new thought in solving the problem of low accuracy and the unbalance between sensitivity and specificity.
7. Mine Question(Optional)
- 这篇文章的introduction、discussion、conclusion重复的内容太多,可以精简一点。