LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Search results

Result 1 - 10 of total 32

Search options

  1. Article ; Online: Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data.

    Pereira, Mayana / Kshirsagar, Meghana / Mukherjee, Sumit / Dodhia, Rahul / Lavista Ferres, Juan / de Sousa, Rafael

    PloS one

    2024  Volume 19, Issue 2, Page(s) e0297271

    Abstract: Differentially private (DP) synthetic datasets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such ...

    Abstract Differentially private (DP) synthetic datasets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We systematically investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic dataset generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generated using AIM and MWEM PGM algorithms can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.
    MeSH term(s) Algorithms ; Health Facilities ; Interior Design and Furnishings ; Knowledge ; Machine Learning
    Language English
    Publishing date 2024-02-05
    Publishing country United States
    Document type Journal Article
    ZDB-ID 2267670-3
    ISSN 1932-6203 ; 1932-6203
    ISSN (online) 1932-6203
    ISSN 1932-6203
    DOI 10.1371/journal.pone.0297271
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  2. Article ; Online: Dynamic Grammar Pruning for Program Size Reduction in Symbolic Regression.

    Ali, Muhammad Sarmad / Kshirsagar, Meghana / Naredo, Enrique / Ryan, Conor

    SN computer science

    2023  Volume 4, Issue 4, Page(s) 402

    Abstract: Grammar is a key input in grammar-based genetic programming. Grammar design not only influences performance, but also program size. However, grammar design and the choice of productions often require expert input as no automatic approach exists. This ... ...

    Abstract Grammar is a key input in grammar-based genetic programming. Grammar design not only influences performance, but also program size. However, grammar design and the choice of productions often require expert input as no automatic approach exists. This research work discusses our approach to automatically reduce a bloated grammar. By utilizing a simple Production Ranking mechanism, we identify productions which are less useful and dynamically prune those to channel evolutionary search towards better (smaller) solutions. Our objective in this work was program size reduction without compromising generalization performance. We tested our approach on 13 standard symbolic regression datasets with Grammatical Evolution. Using a grammar embodying a well-defined function set as a baseline, we compare effective genome length and test performance with our approach. Dynamic grammar pruning achieved significantly better genome lengths for all datasets, while significantly improving generalization performance on three datasets, although it worsened in five datasets. When we utilized linear scaling during the production ranking stages (the first 20 generations) the results dramatically improved. Not only were the programs smaller in all datasets, but generalization scores were also significantly better than the baseline in 6 out of 13 datasets, and comparable in the rest. When the baseline was also linearly scaled as well, the program size was still smaller with the Production Ranking approach, while generalization scores dropped in only three datasets without any significant compromise in the rest.
    Language English
    Publishing date 2023-05-17
    Publishing country Singapore
    Document type Journal Article
    ISSN 2661-8907
    ISSN (online) 2661-8907
    DOI 10.1007/s42979-023-01840-y
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  3. Article ; Online: BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin.

    Kshirsagar, Meghana / Yuan, Han / Ferres, Juan Lavista / Leslie, Christina

    Genome biology

    2022  Volume 23, Issue 1, Page(s) 174

    Abstract: We present a novel unsupervised deep learning approach called BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. BindVAE can disentangle an input DNA sequence into distinct ... ...

    Abstract We present a novel unsupervised deep learning approach called BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. BindVAE can disentangle an input DNA sequence into distinct latent factors that encode cell-type specific in vivo binding signals for individual TFs, composite patterns for TFs involved in cooperative binding, and genomic context surrounding the binding sites. On the task of retrieving the motifs of expressed TFs in a given cell type, BindVAE is competitive with existing motif discovery approaches.
    MeSH term(s) Binding Sites/genetics ; Chromatin ; Chromatin Immunoprecipitation ; Nucleotide Motifs ; Protein Binding/genetics ; Transcription Factors/metabolism
    Chemical Substances Chromatin ; Transcription Factors
    Language English
    Publishing date 2022-08-15
    Publishing country England
    Document type Journal Article ; Research Support, Non-U.S. Gov't ; Research Support, N.I.H., Extramural
    ZDB-ID 2040529-7
    ISSN 1474-760X ; 1474-760X
    ISSN (online) 1474-760X
    ISSN 1474-760X
    DOI 10.1186/s13059-022-02723-w
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  4. Article ; Online: Design of a cryptographically secure pseudo random number generator with grammatical evolution.

    Ryan, Conor / Kshirsagar, Meghana / Vaidya, Gauri / Cunningham, Andrew / Sivaraman, R

    Scientific reports

    2022  Volume 12, Issue 1, Page(s) 8602

    Abstract: This work investigates the potential for using Grammatical Evolution (GE) to generate an initial seed for the construction of a pseudo-random number generator (PRNG) and cryptographically secure (CS) PRNG. We demonstrate the suitability of GE as an ... ...

    Abstract This work investigates the potential for using Grammatical Evolution (GE) to generate an initial seed for the construction of a pseudo-random number generator (PRNG) and cryptographically secure (CS) PRNG. We demonstrate the suitability of GE as an entropy source and show that the initial seeds exhibit an average entropy value of 7.940560934 for 8-bit entropy, which is close to the ideal value of 8. We then construct two random number generators, GE-PRNG and GE-CSPRNG, both of which employ these initial seeds. We use Monte Carlo simulations to establish the efficacy of the GE-PRNG using an experimental setup designed to estimate the value for pi, in which 100,000,000 random numbers were generated by our system. This returned the value of pi of 3.146564000, which is precise up to six decimal digits for the actual value of pi. We propose a new approach called control_flow_incrementor to generate cryptographically secure random numbers. The random numbers generated with CSPRNG meet the prescribed National Institute of Standards and Technology SP800-22 and the Diehard statistical test requirements. We also present a computational performance analysis of GE-CSPRNG demonstrating its potential to be used in industrial applications.
    MeSH term(s) Monte Carlo Method
    Language English
    Publishing date 2022-05-21
    Publishing country England
    Document type Journal Article ; Research Support, Non-U.S. Gov't
    ZDB-ID 2615211-3
    ISSN 2045-2322 ; 2045-2322
    ISSN (online) 2045-2322
    ISSN 2045-2322
    DOI 10.1038/s41598-022-11613-x
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  5. Article: Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning.

    Sledzieski, Samuel / Kshirsagar, Meghana / Baek, Minkyung / Berger, Bonnie / Dodhia, Rahul / Ferres, Juan Lavista

    bioRxiv : the preprint server for biology

    2023  

    Abstract: Proteomics has been revolutionized by large pre-trained protein language models, which learn unsupervised representations from large corpora of sequences. The parameters of these models are then fine-tuned in a supervised setting to tailor the model to a ...

    Abstract Proteomics has been revolutionized by large pre-trained protein language models, which learn unsupervised representations from large corpora of sequences. The parameters of these models are then fine-tuned in a supervised setting to tailor the model to a specific downstream task. However, as model size increases, the computational and memory footprint of fine-tuning becomes a barrier for many research groups. In the field of natural language processing, which has seen a similar explosion in the size of models, these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we newly bring parameter-efficient fine-tuning methods to proteomics. Using the parameter-efficient method LoRA, we train new models for two important proteomic tasks: predicting protein-protein interactions (PPI) and predicting the symmetry of homooligomers. We show that for homooligomer symmetry prediction, these approaches achieve performance competitive with traditional fine-tuning while requiring reduced memory and using three orders of magnitude fewer parameters. On the PPI prediction task, we surprisingly find that PEFT models actually outperform traditional fine-tuning while using two orders of magnitude fewer parameters. Here, we go even further to show that freezing the parameters of the language model and training only a classification head also outperforms fine-tuning, using five orders of magnitude fewer parameters, and that both of these models outperform state-of-the-art PPI prediction methods with substantially reduced compute. We also demonstrate that PEFT is robust to variations in training hyper-parameters, and elucidate where best practices for PEFT in proteomics differ from in natural language processing. Thus, we provide a blueprint to democratize the power of protein language model tuning to groups which have limited computational resources.
    Language English
    Publishing date 2023-11-10
    Publishing country United States
    Document type Preprint
    DOI 10.1101/2023.11.09.566187
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  6. Article ; Online: Design of a cryptographically secure pseudo random number generator with grammatical evolution

    Conor Ryan / Meghana Kshirsagar / Gauri Vaidya / Andrew Cunningham / R. Sivaraman

    Scientific Reports, Vol 12, Iss 1, Pp 1-

    2022  Volume 10

    Abstract: Abstract This work investigates the potential for using Grammatical Evolution (GE) to generate an initial seed for the construction of a pseudo-random number generator (PRNG) and cryptographically secure (CS) PRNG. We demonstrate the suitability of GE as ...

    Abstract Abstract This work investigates the potential for using Grammatical Evolution (GE) to generate an initial seed for the construction of a pseudo-random number generator (PRNG) and cryptographically secure (CS) PRNG. We demonstrate the suitability of GE as an entropy source and show that the initial seeds exhibit an average entropy value of 7.940560934 for 8-bit entropy, which is close to the ideal value of 8. We then construct two random number generators, GE-PRNG and GE-CSPRNG, both of which employ these initial seeds. We use Monte Carlo simulations to establish the efficacy of the GE-PRNG using an experimental setup designed to estimate the value for pi, in which 100,000,000 random numbers were generated by our system. This returned the value of pi of 3.146564000, which is precise up to six decimal digits for the actual value of pi. We propose a new approach called control_flow_incrementor to generate cryptographically secure random numbers. The random numbers generated with CSPRNG meet the prescribed National Institute of Standards and Technology SP800-22 and the Diehard statistical test requirements. We also present a computational performance analysis of GE-CSPRNG demonstrating its potential to be used in industrial applications.
    Keywords Medicine ; R ; Science ; Q
    Language English
    Publishing date 2022-05-01T00:00:00Z
    Publisher Nature Portfolio
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  7. Article: BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin

    Kshirsagar, Meghana / Yuan, Han / Ferres, Juan Lavista / Leslie, Christina

    Genome biology. 2022 Dec., v. 23, no. 1

    2022  

    Abstract: We present a novel unsupervised deep learning approach called BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. BindVAE can disentangle an input DNA sequence into distinct ... ...

    Abstract We present a novel unsupervised deep learning approach called BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. BindVAE can disentangle an input DNA sequence into distinct latent factors that encode cell-type specific in vivo binding signals for individual TFs, composite patterns for TFs involved in cooperative binding, and genomic context surrounding the binding sites. On the task of retrieving the motifs of expressed TFs in a given cell type, BindVAE is competitive with existing motif discovery approaches.
    Keywords chromatin ; genome ; genomics ; nucleotide sequences
    Language English
    Dates of publication 2022-12
    Size p. 174.
    Publishing place BioMed Central
    Document type Article
    ZDB-ID 2040529-7
    ISSN 1474-760X
    ISSN 1474-760X
    DOI 10.1186/s13059-022-02723-w
    Database NAL-Catalogue (AGRICOLA)

    More links

    Kategorien

  8. Article ; Online: Predicting locations of cryptic pockets from single protein structures using the PocketMiner graph neural network.

    Meller, Artur / Ward, Michael / Borowsky, Jonathan / Kshirsagar, Meghana / Lotthammer, Jeffrey M / Oviedo, Felipe / Ferres, Juan Lavista / Bowman, Gregory R

    Nature communications

    2023  Volume 14, Issue 1, Page(s) 1177

    Abstract: Cryptic pockets expand the scope of drug discovery by enabling targeting of proteins currently considered undruggable because they lack pockets in their ground state structures. However, identifying cryptic pockets is labor-intensive and slow. The ... ...

    Abstract Cryptic pockets expand the scope of drug discovery by enabling targeting of proteins currently considered undruggable because they lack pockets in their ground state structures. However, identifying cryptic pockets is labor-intensive and slow. The ability to accurately and rapidly predict if and where cryptic pockets are likely to form from a structure would greatly accelerate the search for druggable pockets. Here, we present PocketMiner, a graph neural network trained to predict where pockets are likely to open in molecular dynamics simulations. Applying PocketMiner to single structures from a newly curated dataset of 39 experimentally confirmed cryptic pockets demonstrates that it accurately identifies cryptic pockets (ROC-AUC: 0.87) >1,000-fold faster than existing methods. We apply PocketMiner across the human proteome and show that predicted pockets open in simulations, suggesting that over half of proteins thought to lack pockets based on available structures likely contain cryptic pockets, vastly expanding the potentially druggable proteome.
    MeSH term(s) Humans ; Pregnancy ; Female ; Proteome ; Drug Discovery ; Labor, Obstetric ; Molecular Dynamics Simulation ; Neural Networks, Computer
    Chemical Substances Proteome
    Language English
    Publishing date 2023-03-01
    Publishing country England
    Document type Journal Article ; Research Support, Non-U.S. Gov't ; Research Support, N.I.H., Extramural ; Research Support, U.S. Gov't, Non-P.H.S.
    ZDB-ID 2553671-0
    ISSN 2041-1723 ; 2041-1723
    ISSN (online) 2041-1723
    ISSN 2041-1723
    DOI 10.1038/s41467-023-36699-3
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  9. Article ; Online: An epigenetic barrier sets the timing of human neuronal maturation.

    Ciceri, Gabriele / Baggiolini, Arianna / Cho, Hyein S / Kshirsagar, Meghana / Benito-Kwiecinski, Silvia / Walsh, Ryan M / Aromolaran, Kelly A / Gonzalez-Hernandez, Alberto J / Munguba, Hermany / Koo, So Yeon / Xu, Nan / Sevilla, Kaylin J / Goldstein, Peter A / Levitz, Joshua / Leslie, Christina S / Koche, Richard P / Studer, Lorenz

    Nature

    2024  Volume 626, Issue 8000, Page(s) 881–890

    Abstract: The pace of human brain development is highly protracted compared with most other ... ...

    Abstract The pace of human brain development is highly protracted compared with most other species
    MeSH term(s) Adult ; Animals ; Humans ; Mice ; Epigenesis, Genetic ; Gene Expression Regulation, Developmental ; Histocompatibility Antigens/metabolism ; Histone-Lysine N-Methyltransferase/antagonists & inhibitors ; Histone-Lysine N-Methyltransferase/metabolism ; Human Embryonic Stem Cells/cytology ; Human Embryonic Stem Cells/metabolism ; Neural Stem Cells/cytology ; Neural Stem Cells/metabolism ; Neurogenesis/genetics ; Neurons/cytology ; Neurons/metabolism ; Time Factors ; Transcription, Genetic
    Chemical Substances DOT1L protein, human (EC 2.1.1.-) ; EHMT1 protein, human (EC 2.1.1.-) ; EHMT2 protein, human (EC 2.1.1.43) ; EZH2 protein, human (EC 2.1.1.43) ; Histocompatibility Antigens ; Histone-Lysine N-Methyltransferase (EC 2.1.1.43)
    Language English
    Publishing date 2024-01-31
    Publishing country England
    Document type Journal Article
    ZDB-ID 120714-3
    ISSN 1476-4687 ; 0028-0836
    ISSN (online) 1476-4687
    ISSN 0028-0836
    DOI 10.1038/s41586-023-06984-8
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  10. Book ; Online: Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data

    Pereira, Mayana / Kshirsagar, Meghana / Mukherjee, Sumit / Dodhia, Rahul / Ferres, Juan Lavista / de Sousa, Rafael

    2023  

    Abstract: Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas ... ...

    Abstract Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic data set generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generator MWEM PGM can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.

    Comment: arXiv admin note: text overlap with arXiv:2106.10241
    Keywords Computer Science - Machine Learning ; Computer Science - Cryptography and Security
    Subject code 006
    Publishing date 2023-10-29
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

To top