LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Your last searches

  1. AU="Gregg, John T"
  2. AU="Klimašauskienė, Aušra"

Search results

Result 1 - 5 of total 5

Search options

  1. Article: STAR_outliers: a python package that separates univariate outliers from non-normal distributions.

    Gregg, John T / Moore, Jason H

    BioData mining

    2023  Volume 16, Issue 1, Page(s) 25

    Abstract: There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and ... ...

    Abstract There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes significantly closer to 0.7 percent of values from these features than other outlier removal methods on average.Conclusions STAR_outliers is an easily implemented python package for removing outliers that outperforms multiple commonly used methods of univariate outlier removal.
    Language English
    Publishing date 2023-09-04
    Publishing country England
    Document type Journal Article
    ZDB-ID 2438773-3
    ISSN 1756-0381
    ISSN 1756-0381
    DOI 10.1186/s13040-023-00342-0
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  2. Article: Improving Genetic Association Studies with a Novel Methodology that Unveils the Hidden Complexity of All-Cause Heart Failure.

    Gregg, John T / Himes, Blanca E / Asselbergs, Folkert W / Moore, Jason H

    medRxiv : the preprint server for health sciences

    2023  

    Abstract: Motivation: Genome-Wide Association Studies (GWAS) commonly assume phenotypic and genetic homogeneity that is not present in complex conditions. We designed Transformative Regression Analysis of Combined Effects (TRACE), a GWAS methodology that better ... ...

    Abstract Motivation: Genome-Wide Association Studies (GWAS) commonly assume phenotypic and genetic homogeneity that is not present in complex conditions. We designed Transformative Regression Analysis of Combined Effects (TRACE), a GWAS methodology that better accounts for clinical phenotype heterogeneity and identifies gene-by-environment (GxE) interactions. We demonstrated with UK Biobank (UKB) data that TRACE increased the variance explained in All-Cause Heart Failure (AHF) via the discovery of novel single nucleotide polymorphism (SNP) and SNP-by-environment (i.e. GxE) interaction associations. First, we transformed 312 AHF-related ICD10 codes (including AHF) into continuous low-dimensional features (i.e., latent phenotypes) for a more nuanced disease representation. Then, we ran a standard GWAS on our latent phenotypes to discover main effects and identified GxE interactions with target encoding. Genes near associated SNPs subsequently underwent enrichment analysis to explore potential functional mechanisms underlying associations. Latent phenotypes were regressed against their SNP hits and the estimated latent phenotype values were used to measure the amount of AHF variance explained.
    Results: Our method identified over 100 main GWAS effects that were consistent with prior studies and hundreds of novel gene-by-smoking interactions, which collectively accounted for approximately 10% of AHF variance. This represents an improvement over traditional GWAS whose results account for a negligible proportion of AHF variance. Enrichment analyses suggested that hundreds of miRNAs mediated the SNP effect on various AHF-related biological pathways. The TRACE framework can be applied to decode the genetics of other complex diseases.
    Availability: All code is available at https://github.com/EpistasisLab/latent_phenotype_project.
    Language English
    Publishing date 2023-08-04
    Publishing country United States
    Document type Preprint
    DOI 10.1101/2023.08.02.23293567
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  3. Article ; Online: PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods.

    Romano, Joseph D / Le, Trang T / La Cava, William / Gregg, John T / Goldberg, Daniel J / Chakraborty, Praneel / Ray, Natasha L / Himmelstein, Daniel / Fu, Weixuan / Moore, Jason H

    Bioinformatics (Oxford, England)

    2021  Volume 38, Issue 3, Page(s) 878–880

    Abstract: Motivation: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, ... ...

    Abstract Motivation: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows.
    Results: This release of PMLB (Penn Machine Learning Benchmarks) provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community.
    Availability and implementation: PMLB is available at https://github.com/EpistasisLab/pmlb. Python and R interfaces for PMLB can be installed through the Python Package Index and Comprehensive R Archive Network, respectively.
    MeSH term(s) Software ; Benchmarking ; Machine Learning ; Models, Statistical
    Language English
    Publishing date 2021-10-22
    Publishing country England
    Document type Journal Article ; Research Support, N.I.H., Extramural
    ZDB-ID 1422668-6
    ISSN 1367-4811 ; 1367-4803
    ISSN (online) 1367-4811
    ISSN 1367-4803
    DOI 10.1093/bioinformatics/btab727
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  4. Article ; Online: A Recurrent Silent Mutation Implicates fecA in Ethanol Tolerance by Escherichia coli.

    Lupino, Katherine M / Romano, Kymberleigh A / Simons, Matthew J / Gregg, John T / Panepinto, Leanna / Cruz, Ghislaine M / Grajek, Lauren / Caputo, Gregory A / Hickman, Mark J / Hecht, Gregory B

    BMC microbiology

    2018  Volume 18, Issue 1, Page(s) 36

    Abstract: Background: An issue associated with efficient bioethanol production is the fact that the desired product is toxic to the biocatalyst. Among other effects, ethanol has previously been found to influence the membrane of E. coli in a dose-dependent manner ...

    Abstract Background: An issue associated with efficient bioethanol production is the fact that the desired product is toxic to the biocatalyst. Among other effects, ethanol has previously been found to influence the membrane of E. coli in a dose-dependent manner and induce changes in the lipid composition of the plasma membrane. We describe here the characterization of a collection of ethanol-tolerant strains derived from the ethanologenic Escherichia coli strain FBR5.
    Results: Membrane permeability assays indicate that many of the strains in the collection have alterations in membrane permeability and/or responsiveness of the membrane to environmental changes such as temperature shifts or ethanol exposure. However, analysis of the strains by gas chromatography and mass spectrometry revealed no qualitative changes in the acyl chain composition of membrane lipids in response to ethanol or temperature. To determine whether these strains contain any mutations that might contribute to ethanol tolerance or changes in membrane permeability, we sequenced the entire genome of each strain. Unexpectedly, none of the strains displayed mutations in genes known to control membrane lipid synthesis, and a few strains carried no mutations at all. Interestingly, we found that four independently-isolated strains acquired an identical C → A (V244 V) silent mutation in the ferric citrate transporter gene fecA. Further, we demonstrated that either a deletion of fecA or over-expression of fecA can confer increased ethanol survival, suggesting that any misregulation of fecA expression affects the cellular response to ethanol.
    Conclusions: The fact that no mutations were observed in several ethanol-tolerant strains suggested that epigenetic mechanisms play a role in E. coli ethanol tolerance and membrane permeability. Our data also represent the first direct phenotypic evidence that the fecA gene plays a role in ethanol tolerance. We propose that the recurring silent mutation may exert an effect on phenotype by altering RNA-mediated regulation of fecA expression.
    MeSH term(s) Bacterial Proteins/genetics ; Bacterial Proteins/metabolism ; Cell Membrane ; Cell Membrane Permeability/drug effects ; Drug Tolerance/genetics ; Escherichia coli/genetics ; Escherichia coli/metabolism ; Escherichia coli Proteins/genetics ; Escherichia coli Proteins/metabolism ; Ethanol/toxicity ; Gene Expression Regulation, Bacterial ; Genetic Loci ; Membrane Proteins/genetics ; Membrane Proteins/metabolism ; Microbial Sensitivity Tests ; Microbial Viability/drug effects ; Receptors, Cell Surface/genetics ; Receptors, Cell Surface/metabolism ; Silent Mutation ; Temperature ; Whole Genome Sequencing
    Chemical Substances Bacterial Proteins ; Escherichia coli Proteins ; FecA protein, E coli ; Membrane Proteins ; Receptors, Cell Surface ; Ethanol (3K9958V90M)
    Language English
    Publishing date 2018-04-18
    Publishing country England
    Document type Journal Article ; Research Support, Non-U.S. Gov't ; Research Support, U.S. Gov't, Non-P.H.S.
    ISSN 1471-2180
    ISSN (online) 1471-2180
    DOI 10.1186/s12866-018-1180-1
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  5. Book ; Online: PMLB v1.0

    Romano, Joseph D. / Le, Trang T. / La Cava, William / Gregg, John T. / Goldberg, Daniel J. / Ray, Natasha L. / Chakraborty, Praneel / Himmelstein, Daniel / Fu, Weixuan / Moore, Jason H.

    An open source dataset collection for benchmarking machine learning methods

    2020  

    Abstract: Motivation: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, ... ...

    Abstract Motivation: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. Results: This release of PMLB provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community. Availability: PMLB is available at https://github.com/EpistasisLab/pmlb. Python and R interfaces for PMLB can be installed through the Python Package Index and Comprehensive R Archive Network, respectively.

    Comment: 4 pages, 1 figure. *: These authors contributed equally
    Keywords Computer Science - Machine Learning ; Computer Science - Databases ; H.2.8
    Publishing date 2020-11-30
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

To top