Article ; Online: Understanding the performance and reliability of NLP tools
Frontiers in Digital Health, Vol
a comparison of four NLP tools predicting stroke phenotypes in radiology reports
2023 Volume 5
Abstract: BackgroundNatural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications.MethodsWe tested the F1 ...
Abstract | BackgroundNatural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications.MethodsWe tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images.ResultsEdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, ≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+ ≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%.ConclusionsThe four NLP tools show varying F1 (and precision/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed “out of the box.” ... |
---|---|
Keywords | natural language processing ; brain radiology ; stroke phenotype ; electronic health records ; Medicine ; R ; Public aspects of medicine ; RA1-1270 ; Electronic computers. Computer science ; QA75.5-76.95 |
Subject code | 006 |
Language | English |
Publishing date | 2023-09-01T00:00:00Z |
Publisher | Frontiers Media S.A. |
Document type | Article ; Online |
Database | BASE - Bielefeld Academic Search Engine (life sciences selection) |
Full text online
More links
Kategorien
Inter-library loan at ZB MED
Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.