Article ; Online: Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models.
Laboratory investigation; a journal of technical methods and pathology
2021 Volume 102, Issue 3, Page(s) 236–244
Abstract: Colorectal cancer (CRC) is one of the most common cancers worldwide, and a leading cause of cancer deaths. Better classifying multicategory outcomes of CRC with clinical and omic data may help adjust treatment regimens based on individual's risk. Here, ... ...
Abstract | Colorectal cancer (CRC) is one of the most common cancers worldwide, and a leading cause of cancer deaths. Better classifying multicategory outcomes of CRC with clinical and omic data may help adjust treatment regimens based on individual's risk. Here, we selected the features that were useful for classifying four-category survival outcome of CRC using the clinical and transcriptomic data, or clinical, transcriptomic, microsatellite instability and selected oncogenic-driver data (all data) of TCGA. We also optimized multimetric feature selection to develop the best multinomial logistic regression (MLR) and random forest (RF) models that had the highest accuracy, precision, recall and F1 score, respectively. We identified 2073 differentially expressed genes of the TCGA RNASeq dataset. MLR overall outperformed RF in the multimetric feature selection. In both RF and MLR models, precision, recall and F1 score increased as the feature number increased and peaked at the feature number of 600-1000, while the models' accuracy remained stable. The best model was the MLR one with 825 features based on sum of squared coefficients using all data, and attained the best accuracy of 0.855, F1 of 0.738 and precision of 0.832, which were higher than those using clinical and transcriptomic data. The top-ranked features in the MLR model of the best performance using clinical and transcriptomic data were different from those using all data. However, pathologic staging, HBS1L, TSPYL4, and TP53TG3B were the overlapping top-20 ranked features in the best models using clinical and transcriptomic, or all data. Thus, we developed a multimetric feature-selection based MLR model that outperformed RF models in classifying four-category outcome of CRC patients. Interestingly, adding microsatellite instability and oncogenic-driver data to clinical and transcriptomic data improved models' performances. Precision and recall of tuned algorithms may change significantly as the feature number changes, but accuracy appears not sensitive to these changes. |
---|---|
MeSH term(s) | Adult ; Aged ; Colorectal Neoplasms/genetics ; Colorectal Neoplasms/pathology ; Colorectal Neoplasms/therapy ; Female ; Gene Expression Profiling/methods ; Gene Expression Regulation, Neoplastic ; Humans ; Logistic Models ; Male ; Microsatellite Instability ; Middle Aged ; Oncogenes/genetics ; Outcome Assessment, Health Care/classification ; Outcome Assessment, Health Care/methods ; Outcome Assessment, Health Care/statistics & numerical data ; RNA-Seq/methods ; Reproducibility of Results |
Language | English |
Publishing date | 2021-09-18 |
Publishing country | United States |
Document type | Journal Article ; Research Support, Non-U.S. Gov't |
ZDB-ID | 80178-1 |
ISSN | 1530-0307 ; 0023-6837 |
ISSN (online) | 1530-0307 |
ISSN | 0023-6837 |
DOI | 10.1038/s41374-021-00662-x |
Database | MEDical Literature Analysis and Retrieval System OnLINE |
Full text online
More links
Kategorien
In stock of ZB MED Cologne/Königswinter
Ud II Zs.164: Show issues | Location: Je nach Verfügbarkeit (siehe Angabe bei Bestand) bis Jg. 2021: Bestellungen von Artikeln über das Online-Bestellformular ab Jg. 2022: Lesesaal (EG) |
Order via subito
This service is chargeable due to the Delivery terms set by subito. Orders including an article and supplementary material will be classified as separate orders. In these cases, fees will be demanded for each order.