LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Search results

Result 1 - 10 of total 246

Search options

  1. Article: GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT.

    Chen, Yiqun / Zou, James

    bioRxiv : the preprint server for biology

    2024  

    Abstract: There has been significant recent progress in leveraging large-scale gene expression data to develop foundation models for single-cell biology. Models such as Geneformer and scGPT implicitly learn gene and cellular functions from the gene expression ... ...

    Abstract There has been significant recent progress in leveraging large-scale gene expression data to develop foundation models for single-cell biology. Models such as Geneformer and scGPT implicitly learn gene and cellular functions from the gene expression profiles of millions of cells, which requires extensive data curation and resource-intensive training. Here we explore a much simpler alternative by leveraging ChatGPT embeddings of genes based on literature. Our proposal, GenePT, uses NCBI text descriptions of individual genes with GPT-3.5 to generate gene embeddings. From there, GenePT generates single-cell embeddings in two ways: (i) by averaging the gene embeddings, weighted by each gene's expression level; or (ii) by creating a sentence embedding for each cell, using gene names ordered by the expression level. Without the need for dataset curation and additional pretraining, GenePT is efficient and easy to use. On many downstream tasks used to evaluate recent single-cell foundation models - e.g., classifying gene properties and cell types - GenePT achieves comparable, and often better, performance than Geneformer and other models. GenePT demonstrates that large language model embedding of literature is a simple and effective path for biological foundation models.
    Language English
    Publishing date 2024-03-05
    Publishing country United States
    Document type Preprint
    DOI 10.1101/2023.10.16.562533
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  2. Article ; Online: Principled and interpretable alignability testing and integration of single-cell data.

    Ma, Rong / Sun, Eric D / Donoho, David / Zou, James

    Proceedings of the National Academy of Sciences of the United States of America

    2024  Volume 121, Issue 10, Page(s) e2313719121

    Abstract: Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, ... ...

    Abstract Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.
    MeSH term(s) Algorithms ; Gene Expression Profiling ; Gene Expression ; Single-Cell Analysis
    Language English
    Publishing date 2024-02-28
    Publishing country United States
    Document type Journal Article
    ZDB-ID 209104-5
    ISSN 1091-6490 ; 0027-8424
    ISSN (online) 1091-6490
    ISSN 0027-8424
    DOI 10.1073/pnas.2313719121
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  3. Article ; Online: A spectral method for assessing and combining multiple data visualizations.

    Ma, Rong / Sun, Eric D / Zou, James

    Nature communications

    2023  Volume 14, Issue 1, Page(s) 780

    Abstract: Dimension reduction is an indispensable part of modern data science, and many algorithms have been developed. However, different algorithms have their own strengths and weaknesses, making it important to evaluate their relative performance, and to ... ...

    Abstract Dimension reduction is an indispensable part of modern data science, and many algorithms have been developed. However, different algorithms have their own strengths and weaknesses, making it important to evaluate their relative performance, and to leverage and combine their individual strengths. This paper proposes a spectral method for assessing and combining multiple visualizations of a given dataset produced by diverse algorithms. The proposed method provides a quantitative measure - the visualization eigenscore - of the relative performance of the visualizations for preserving the structure around each data point. It also generates a consensus visualization, having improved quality over individual visualizations in capturing the underlying structure. Our approach is flexible and works as a wrapper around any visualizations. We analyze multiple real-world datasets to demonstrate the effectiveness of the method. We also provide theoretical justifications based on a general statistical framework, yielding several fundamental principles along with practical guidance.
    Language English
    Publishing date 2023-02-11
    Publishing country England
    Document type Journal Article
    ZDB-ID 2553671-0
    ISSN 2041-1723 ; 2041-1723
    ISSN (online) 2041-1723
    ISSN 2041-1723
    DOI 10.1038/s41467-023-36492-2
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  4. Article ; Online: Machine learning modeling of RNA structures: methods, challenges and future perspectives.

    Wu, Kevin E / Zou, James Y / Chang, Howard

    Briefings in bioinformatics

    2023  Volume 24, Issue 4

    Abstract: The three-dimensional structure of RNA molecules plays a critical role in a wide range of cellular processes encompassing functions from riboswitches to epigenetic regulation. These RNA structures are incredibly dynamic and can indeed be described aptly ... ...

    Abstract The three-dimensional structure of RNA molecules plays a critical role in a wide range of cellular processes encompassing functions from riboswitches to epigenetic regulation. These RNA structures are incredibly dynamic and can indeed be described aptly as an ensemble of structures that shifts in distribution depending on different cellular conditions. Thus, the computational prediction of RNA structure poses a unique challenge, even as computational protein folding has seen great advances. In this review, we focus on a variety of machine learning-based methods that have been developed to predict RNA molecules' secondary structure, as well as more complex tertiary structures. We survey commonly used modeling strategies, and how many are inspired by or incorporate thermodynamic principles. We discuss the shortcomings that various design decisions entail and propose future directions that could build off these methods to yield more robust, accurate RNA structure predictions.
    MeSH term(s) RNA/metabolism ; Epigenesis, Genetic ; Machine Learning ; Protein Structure, Secondary ; Computational Biology/methods
    Chemical Substances RNA (63231-63-0)
    Language English
    Publishing date 2023-06-06
    Publishing country England
    Document type Review ; Journal Article ; Research Support, Non-U.S. Gov't
    ZDB-ID 2068142-2
    ISSN 1477-4054 ; 1467-5463
    ISSN (online) 1477-4054
    ISSN 1467-5463
    DOI 10.1093/bib/bbad210
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  5. Article ; Online: Daptomycin-associated pulmonary toxicity sans eosinophilia in a hematopoietic cell transplant recipient with profound leukopenia.

    Zou, James / Rivera Sarti, Jose Eduardo / Strasfeld, Lynne

    Transplant infectious disease : an official journal of the Transplantation Society

    2023  Volume 25, Issue 3, Page(s) e14029

    MeSH term(s) Humans ; Daptomycin/adverse effects ; Transplant Recipients ; Hematopoietic Stem Cell Transplantation/adverse effects ; Pulmonary Eosinophilia ; Leukopenia/chemically induced
    Chemical Substances Daptomycin (NWQ5N31VKK)
    Language English
    Publishing date 2023-02-14
    Publishing country Denmark
    Document type Journal Article
    ZDB-ID 1476094-0
    ISSN 1399-3062 ; 1398-2273
    ISSN (online) 1399-3062
    ISSN 1398-2273
    DOI 10.1111/tid.14029
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  6. Book ; Online: Data-OOB

    Kwon, Yongchan / Zou, James

    Out-of-bag Estimate as a Simple and Efficient Data Value

    2023  

    Abstract: Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they ... ...

    Abstract Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.

    Comment: 18 pages, to be published at ICML 2023
    Keywords Computer Science - Machine Learning ; Statistics - Machine Learning
    Subject code 006
    Publishing date 2023-04-16
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  7. Book ; Online: ArtWhisperer

    Vodrahalli, Kailas / Zou, James

    A Dataset for Characterizing Human-AI Interactions in Artistic Creations

    2023  

    Abstract: As generative AI becomes more prevalent, it is important to study how human users interact with such models. In this work, we investigate how people use text-to-image models to generate desired target images. To study this interaction, we created ... ...

    Abstract As generative AI becomes more prevalent, it is important to study how human users interact with such models. In this work, we investigate how people use text-to-image models to generate desired target images. To study this interaction, we created ArtWhisperer, an online game where users are given a target image and are tasked with iteratively finding a prompt that creates a similar-looking image as the target. Through this game, we recorded over 50,000 human-AI interactions; each interaction corresponds to one text prompt created by a user and the corresponding generated image. The majority of these are repeated interactions where a user iterates to find the best prompt for their target image, making this a unique sequential dataset for studying human-AI collaborations. In an initial analysis of this dataset, we identify several characteristics of prompt interactions and user strategies. People submit diverse prompts and are able to discover a variety of text descriptions that generate similar images. Interestingly, prompt diversity does not decrease as users find better prompts. We further propose a new metric to quantify the steerability of AI using our dataset. We define steerability as the expected number of interactions required to adequately complete a task. We estimate this value by fitting a Markov chain for each target task and calculating the expected time to reach an adequate score in the Markov chain. We quantify and compare AI steerability across different types of target images and two different models, finding that images of cities and natural world images are more steerable than artistic and fantasy images. These findings provide insights into human-AI interaction behavior, present a concrete method of assessing AI steerability, and demonstrate the general utility of the ArtWhisperer dataset.

    Comment: 26 pages, 20 figures
    Keywords Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Human-Computer Interaction ; Computer Science - Machine Learning
    Subject code 006
    Publishing date 2023-06-13
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  8. Book ; Online: TWIGMA

    Chen, Yiqun / Zou, James

    A dataset of AI-Generated Images with Metadata From Twitter

    2023  

    Abstract: Recent progress in generative artificial intelligence (gen-AI) has enabled the generation of photo-realistic and artistically-inspiring photos at a single click, catering to millions of users online. To explore how people use gen-AI models such as DALLE ... ...

    Abstract Recent progress in generative artificial intelligence (gen-AI) has enabled the generation of photo-realistic and artistically-inspiring photos at a single click, catering to millions of users online. To explore how people use gen-AI models such as DALLE and StableDiffusion, it is critical to understand the themes, contents, and variations present in the AI-generated photos. In this work, we introduce TWIGMA (TWItter Generative-ai images with MetadatA), a comprehensive dataset encompassing over 800,000 gen-AI images collected from Jan 2021 to March 2023 on Twitter, with associated metadata (e.g., tweet text, creation date, number of likes), available at https://zenodo.org/records/8031785. Through a comparative analysis of TWIGMA with natural images and human artwork, we find that gen-AI images possess distinctive characteristics and exhibit, on average, lower variability when compared to their non-gen-AI counterparts. Additionally, we find that the similarity between a gen-AI image and natural images is inversely correlated with the number of likes. Finally, we observe a longitudinal shift in the themes of AI-generated images on Twitter, with users increasingly sharing artistically sophisticated content such as intricate human portraits, whereas their interest in simple subjects such as natural scenes and animals has decreased. Our findings underscore the significance of TWIGMA as a unique data resource for studying AI-generated images.
    Keywords Statistics - Applications ; Computer Science - Computers and Society
    Subject code 004
    Publishing date 2023-06-14
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  9. Book ; Online: New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking

    Singh, Karanpartap / Zou, James

    2023  

    Abstract: With the increasing use of large-language models (LLMs) like ChatGPT, watermarking has emerged as a promising approach for tracing machine-generated content. However, research on LLM watermarking often relies on simple perplexity or diversity-based ... ...

    Abstract With the increasing use of large-language models (LLMs) like ChatGPT, watermarking has emerged as a promising approach for tracing machine-generated content. However, research on LLM watermarking often relies on simple perplexity or diversity-based measures to assess the quality of watermarked text, which can mask important limitations in watermarking. Here we introduce two new easy-to-use methods for evaluating watermarking algorithms for LLMs: 1) evaluation by LLM-judger with specific guidelines; and 2) binary classification on text embeddings to distinguish between watermarked and unwatermarked text. We apply these methods to characterize the effectiveness of current watermarking techniques. Our experiments, conducted across various datasets, reveal that current watermarking methods are detectable by even simple classifiers, challenging the notion of watermarking subtlety. We also found, through the LLM judger, that watermarking impacts text quality, especially in degrading the coherence and depth of the response. Our findings underscore the trade-off between watermark robustness and text quality and highlight the importance of having more informative metrics to assess watermarking quality.
    Keywords Computer Science - Computation and Language
    Subject code 303
    Publishing date 2023-12-04
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  10. Article ; Online: Ensuring that biomedical AI benefits diverse populations.

    Zou, James / Schiebinger, Londa

    EBioMedicine

    2021  Volume 67, Page(s) 103358

    Abstract: Artificial Intelligence (AI) can potentially impact many aspects of human health, from basic research discovery to individual health assessment. It is critical that these advances in technology broadly benefit diverse populations from around the world. ... ...

    Abstract Artificial Intelligence (AI) can potentially impact many aspects of human health, from basic research discovery to individual health assessment. It is critical that these advances in technology broadly benefit diverse populations from around the world. This can be challenging because AI algorithms are often developed on non-representative samples and evaluated based on narrow metrics. Here we outline key challenges to biomedical AI in outcome design, data collection and technology evaluation, and use examples from precision health to illustrate how bias and health disparity may arise in each stage. We then suggest both short term approaches-more diverse data collection and AI monitoring-and longer term structural changes in funding, publications, and education to address these challenges.
    MeSH term(s) Artificial Intelligence ; Health Policy ; Health Status Disparities ; Humans ; Medical Informatics/methods ; Medical Informatics/trends
    Language English
    Publishing date 2021-05-04
    Publishing country Netherlands
    Document type Journal Article ; Review
    ZDB-ID 2851331-9
    ISSN 2352-3964
    ISSN (online) 2352-3964
    DOI 10.1016/j.ebiom.2021.103358
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

To top