LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Search results

Result 1 - 10 of total 39

Search options

  1. Article ; Online: MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval.

    Jin, Qiao / Kim, Won / Chen, Qingyu / Comeau, Donald C / Yeganova, Lana / Wilbur, W John / Lu, Zhiyong

    Bioinformatics (Oxford, England)

    2023  Volume 39, Issue 11

    Abstract: Motivation: Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant ...

    Abstract Motivation: Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine.
    Results: To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models, such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.
    Availability and implementation: The MedCPT code and model are available at https://github.com/ncbi/MedCPT.
    MeSH term(s) Information Storage and Retrieval ; Language ; Natural Language Processing ; PubMed ; Semantics ; Review Literature as Topic
    Language English
    Publishing date 2023-11-06
    Publishing country England
    Document type Journal Article ; Research Support, N.I.H., Intramural
    ZDB-ID 1422668-6
    ISSN 1367-4811 ; 1367-4803
    ISSN (online) 1367-4811
    ISSN 1367-4803
    DOI 10.1093/bioinformatics/btad651
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  2. Article ; Online: Towards a unified search: Improving PubMed retrieval with full text.

    Kim, Won / Yeganova, Lana / Comeau, Donald C / Wilbur, W John / Lu, Zhiyong

    Journal of biomedical informatics

    2022  Volume 134, Page(s) 104211

    Abstract: Objective: A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the ... ...

    Abstract Objective: A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance.
    Materials and methods: For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources - full text articles and abstract only articles - by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness.
    Results and conclusions: Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best.
    MeSH term(s) Data Management ; Information Storage and Retrieval ; PubMed
    Language English
    Publishing date 2022-09-21
    Publishing country United States
    Document type Journal Article ; Research Support, N.I.H., Intramural
    ZDB-ID 2057141-0
    ISSN 1532-0480 ; 1532-0464
    ISSN (online) 1532-0480
    ISSN 1532-0464
    DOI 10.1016/j.jbi.2022.104211
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  3. Article ; Online: Opportunities and challenges for ChatGPT and large language models in biomedicine and health.

    Tian, Shubo / Jin, Qiao / Yeganova, Lana / Lai, Po-Ting / Zhu, Qingqing / Chen, Xiuying / Yang, Yifan / Chen, Qingyu / Kim, Won / Comeau, Donald C / Islamaj, Rezarta / Kapoor, Aadit / Gao, Xin / Lu, Zhiyong

    Briefings in bioinformatics

    2024  Volume 25, Issue 1

    Abstract: ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this ... ...

    Abstract ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically, we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction and medical education and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.
    MeSH term(s) Humans ; Information Storage and Retrieval ; Language ; Privacy ; Research Personnel
    Language English
    Publishing date 2024-04-17
    Publishing country England
    Document type Journal Article
    ZDB-ID 2068142-2
    ISSN 1477-4054 ; 1467-5463
    ISSN (online) 1477-4054
    ISSN 1467-5463
    DOI 10.1093/bib/bbad493
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  4. Article ; Online: Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health.

    Tian, Shubo / Jin, Qiao / Yeganova, Lana / Lai, Po-Ting / Zhu, Qingqing / Chen, Xiuying / Yang, Yifan / Chen, Qingyu / Kim, Won / Comeau, Donald C / Islamaj, Rezarta / Kapoor, Aadit / Gao, Xin / Lu, Zhiyong

    ArXiv

    2023  

    Abstract: ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this ... ...

    Abstract ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction, and medical education, and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.
    Language English
    Publishing date 2023-10-17
    Publishing country United States
    Document type Preprint
    ISSN 2331-8422
    ISSN (online) 2331-8422
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  5. Book ; Online: MedCPT

    Jin, Qiao / Kim, Won / Chen, Qingyu / Comeau, Donald C. / Yeganova, Lana / Wilbur, W. John / Lu, Zhiyong

    Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

    2023  

    Abstract: Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query- ... ...

    Abstract Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.

    Comment: The MedCPT code and API are available at https://github.com/ncbi/MedCPT
    Keywords Computer Science - Information Retrieval ; Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Quantitative Biology - Quantitative Methods
    Subject code 004
    Publishing date 2023-07-02
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  6. Article ; Online: Better synonyms for enriching biomedical search.

    Yeganova, Lana / Kim, Sun / Chen, Qingyu / Balasanov, Grigory / Wilbur, W John / Lu, Zhiyong

    Journal of the American Medical Informatics Association : JAMIA

    2020  Volume 27, Issue 12, Page(s) 1894–1902

    Abstract: Objective: In a biomedical literature search, the link between a query and a document is often not established, because they use different terms to refer to the same concept. Distributional word embeddings are frequently used for detecting related words ...

    Abstract Objective: In a biomedical literature search, the link between a query and a document is often not established, because they use different terms to refer to the same concept. Distributional word embeddings are frequently used for detecting related words by computing the cosine similarity between them. However, previous research has not established either the best embedding methods for detecting synonyms among related word pairs or how effective such methods may be.
    Materials and methods: In this study, we first create the BioSearchSyn set, a manually annotated set of synonyms, to assess and compare 3 widely used word-embedding methods (word2vec, fastText, and GloVe) in their ability to detect synonyms among related pairs of words. We demonstrate the shortcomings of the cosine similarity score between word embeddings for this task: the same scores have very different meanings for the different methods. To address the problem, we propose utilizing pool adjacent violators (PAV), an isotonic regression algorithm, to transform a cosine similarity into a probability of 2 words being synonyms.
    Results: Experimental results using the BioSearchSyn set as a gold standard reveal which embedding methods have the best performance in identifying synonym pairs. The BioSearchSyn set also allows converting cosine similarity scores into probabilities, which provides a uniform interpretation of the synonymy score over different methods.
    Conclusions: We introduced the BioSearchSyn corpus of 1000 term pairs, which allowed us to identify the best embedding method for detecting synonymy for biomedical search. Using the proposed method, we created PubTermVariants2.0: a large, automatically extracted set of synonym pairs that have augmented PubMed searches since the spring of 2019.
    MeSH term(s) Algorithms ; Biomedical Research ; Information Storage and Retrieval/methods ; Linguistics ; Probability ; PubMed ; Terminology as Topic
    Language English
    Publishing date 2020-10-15
    Publishing country England
    Document type Journal Article ; Research Support, N.I.H., Intramural
    ZDB-ID 1205156-1
    ISSN 1527-974X ; 1067-5027
    ISSN (online) 1527-974X
    ISSN 1067-5027
    DOI 10.1093/jamia/ocaa151
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  7. Article: PDC - a probabilistic distributional clustering algorithm: a case study on suicide articles in PubMed.

    Islamaj, Rezarta / Yeganova, Lana / Kim, Won / Xie, Natalie / Wilbur, W John / Lu, Zhiyong

    AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

    2020  Volume 2020, Page(s) 259–268

    Abstract: The need to organize a large collection in a manner that facilitates human comprehension is crucial given the ever-increasing volumes of information. In this work, we present PDC (probabilistic distributional clustering), a novel algorithm that, given a ... ...

    Abstract The need to organize a large collection in a manner that facilitates human comprehension is crucial given the ever-increasing volumes of information. In this work, we present PDC (probabilistic distributional clustering), a novel algorithm that, given a document collection, computes disjoint term sets representing topics in the collection. The algorithm relies on probabilities of word co-occurrences to partition the set of terms appearing in the collection of documents into disjoint groups of related terms. In this work, we also present an environment to visualize the computed topics in the term space and retrieve the most related PubMed articles for each group of terms. We illustrate the algorithm by applying it to PubMed documents on the topic of suicide. Suicide is a major public health problem identified as the tenth leading cause of death in the US. In this application, our goal is to provide a global view of the mental health literature pertaining to the subject of suicide, and through this, to help create a rich environment of multifaceted data to guide health care researchers in their endeavor to better understand the breadth, depth and scope of the problem. We demonstrate the usefulness of the proposed algorithm by providing a web portal that allows mental health researchers to peruse the suicide-related literature in PubMed.
    Language English
    Publishing date 2020-05-30
    Publishing country United States
    Document type Journal Article
    ZDB-ID 2676378-3
    ISSN 2153-4063
    ISSN 2153-4063
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  8. Article ; Online: Topics in machine learning for biomedical literature analysis and text retrieval.

    Islamaj Doğan, Rezarta / Yeganova, Lana

    Journal of biomedical semantics

    2012  Volume 3 Suppl 3, Page(s) S1

    Language English
    Publishing date 2012-10-05
    Publishing country England
    Document type Journal Article
    ZDB-ID 2548651-2
    ISSN 2041-1480 ; 2041-1480
    ISSN (online) 2041-1480
    ISSN 2041-1480
    DOI 10.1186/2041-1480-3-S3-S1
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  9. Article ; Online: Evolving use of ancestry, ethnicity, and race in genetics research-A survey spanning seven decades.

    Byeon, Yen Ji Julia / Islamaj, Rezarta / Yeganova, Lana / Wilbur, W John / Lu, Zhiyong / Brody, Lawrence C / Bonham, Vence L

    American journal of human genetics

    2021  Volume 108, Issue 12, Page(s) 2215–2223

    Abstract: To inform continuous and rigorous reflection about the description of human populations in genomics research, this study investigates the historical and contemporary use of the terms "ancestry," "ethnicity," "race," and other population labels in The ... ...

    Abstract To inform continuous and rigorous reflection about the description of human populations in genomics research, this study investigates the historical and contemporary use of the terms "ancestry," "ethnicity," "race," and other population labels in The American Journal of Human Genetics from 1949 to 2018. We characterize these terms' frequency of use and assess their odds of co-occurrence with a set of social and genetic topical terms. Throughout The Journal's 70-year history, "ancestry" and "ethnicity" have increased in use, appearing in 33% and 26% of articles in 2009-2018, while the use of "race" has decreased, occurring in 4% of articles in 2009-2018. Although its overall use has declined, the odds of "race" appearing in the presence of "ethnicity" has increased relative to the odds of occurring in its absence. Forms of population descriptors "Caucasian" and "Negro" have largely disappeared from The Journal (<1% of articles in 2009-2018). Conversely, the continental labels "African," "Asian," and "European" have increased in use and appear in 18%, 14%, and 42% of articles from 2009-2018, respectively. Decreasing uses of the terms "race," "Caucasian," and "Negro" are indicative of a transition away from the field's history of explicitly biological race science; at the same time, the increasing use of "ancestry," "ethnicity," and continental labels should serve to motivate ongoing reflection as the terminology used to describe genetic variation continues to evolve.
    MeSH term(s) Ethnicity ; Genetic Research/history ; History, 20th Century ; History, 21st Century ; Human Genetics/history ; Human Genetics/trends ; Humans ; Publishing/history ; Racial Groups
    Language English
    Publishing date 2021-12-02
    Publishing country United States
    Document type Historical Article ; Journal Article ; Research Support, N.I.H., Intramural
    ZDB-ID 219384-x
    ISSN 1537-6605 ; 0002-9297
    ISSN (online) 1537-6605
    ISSN 0002-9297
    DOI 10.1016/j.ajhg.2021.10.008
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  10. Article ; Online: Discovering themes in biomedical literature using a projection-based algorithm.

    Yeganova, Lana / Kim, Sun / Balasanov, Grigory / Wilbur, W John

    BMC bioinformatics

    2018  Volume 19, Issue 1, Page(s) 269

    Abstract: Background: The need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available. Two common approaches to provide a broad overview of the information ... ...

    Abstract Background: The need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available. Two common approaches to provide a broad overview of the information space are document clustering and topic modeling. Clustering aims to group documents or terms into meaningful clusters. Topic modeling, on the other hand, focuses on finding coherent keywords for describing topics appearing in a set of documents. In addition, there have been efforts for clustering documents and finding keywords simultaneously.
    Results: We present an algorithm to analyze document collections that is based on a notion of a theme, defined as a dual representation based on a set of documents and key terms. In this work, a novel vector space mechanism is proposed for computing themes. Starting with a single document, the theme algorithm treats terms and documents as explicit components, and iteratively uses each representation to refine the other until the theme is detected. The method heavily relies on an optimization routine that we refer to as the projection algorithm which, under specific conditions, is guaranteed to converge to the first singular vector of a data matrix. We apply our algorithm to a collection of about sixty thousand PubMed
    Conclusions: This study presents a contribution on theoretical and algorithmic levels, as well as demonstrates the feasibility of the method for large scale applications. The evaluation of our system on benchmark datasets demonstrates that our method compares favorably with the current state-of-the-art methods in computing clusters of documents with coherent topic terms.
    MeSH term(s) Algorithms ; Cluster Analysis ; Databases, Genetic ; Humans ; Polymorphism, Single Nucleotide/genetics ; Publications
    Language English
    Publishing date 2018-07-16
    Publishing country England
    Document type Journal Article
    ZDB-ID 2041484-5
    ISSN 1471-2105 ; 1471-2105
    ISSN (online) 1471-2105
    ISSN 1471-2105
    DOI 10.1186/s12859-018-2240-0
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

To top