LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Search results

Result 1 - 10 of total 14

Search options

  1. Book ; Online: Heaps' Law in GPT-Neo Large Language Model Emulated Corpora

    Lai, Uyen / Randhawa, Gurjit S. / Sheridan, Paul

    2023  

    Abstract: Heaps' law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size. While this law has been validated in diverse human-authored text corpora, its applicability to large language model generated text remains ... ...

    Abstract Heaps' law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size. While this law has been validated in diverse human-authored text corpora, its applicability to large language model generated text remains unexplored. This study addresses this gap, focusing on the emulation of corpora using the suite of GPT-Neo large language models. To conduct our investigation, we emulated corpora of PubMed abstracts using three different parameter sizes of the GPT-Neo model. Our emulation strategy involved using the initial five words of each PubMed abstract as a prompt and instructing the model to expand the content up to the original abstract's length. Our findings indicate that the generated corpora adhere to Heaps' law. Interestingly, as the GPT-Neo model size grows, its generated vocabulary increasingly adheres to Heaps' law as as observed in human-authored text. To further improve the richness and authenticity of GPT-Neo outputs, future iterations could emphasize enhancing model size or refining the model architecture to curtail vocabulary repetition.

    Comment: 4 pages, 1 figure, 1 table, EVIA 2023
    Keywords Computer Science - Computation and Language
    Subject code 410
    Publishing date 2023-11-10
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  2. Article ; Online: Environment and taxonomy shape the genomic signature of prokaryotic extremophiles.

    Arias, Pablo Millán / Butler, Joseph / Randhawa, Gurjit S / Soltysiak, Maximillian P M / Hill, Kathleen A / Kari, Lila

    Scientific reports

    2023  Volume 13, Issue 1, Page(s) 16105

    Abstract: This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine ...

    Abstract This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of [Formula: see text] extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, [Formula: see text]. The supervised learning resulted in high accuracies for taxonomic classifications at [Formula: see text], and medium to medium-high accuracies for environment category classifications of the same datasets at [Formula: see text]. For [Formula: see text], our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.
    MeSH term(s) Extremophiles/genetics ; Genomics/methods ; Bacteria/genetics ; Archaea/genetics ; Genome, Archaeal/genetics
    Language English
    Publishing date 2023-09-26
    Publishing country England
    Document type Journal Article ; Research Support, Non-U.S. Gov't
    ZDB-ID 2615211-3
    ISSN 2045-2322 ; 2045-2322
    ISSN (online) 2045-2322
    ISSN 2045-2322
    DOI 10.1038/s41598-023-42518-y
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  3. Article ; Online: Environment and taxonomy shape the genomic signature of prokaryotic extremophiles

    Pablo Millán Arias / Joseph Butler / Gurjit S. Randhawa / Maximillian P. M. Soltysiak / Kathleen A. Hill / Lila Kari

    Scientific Reports, Vol 13, Iss 1, Pp 1-

    2023  Volume 17

    Abstract: Abstract This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and ... ...

    Abstract Abstract This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of $$\sim 700$$ ∼ 700 extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, $$1\le k \le 6$$ 1 ≤ k ≤ 6 . The supervised learning resulted in high accuracies for taxonomic classifications at $$2\le k \le 6$$ 2 ≤ k ≤ 6 , and medium to medium-high accuracies for environment category classifications of the same datasets at $$3\le k \le 6$$ 3 ≤ k ≤ 6 . For $$k=3$$ k = 3 , our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.
    Keywords Medicine ; R ; Science ; Q
    Subject code 006
    Language English
    Publishing date 2023-09-01T00:00:00Z
    Publisher Nature Portfolio
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  4. Article ; Online: MLDSP-GUI: an alignment-free standalone tool with an interactive graphical user interface for DNA sequence comparison and analysis.

    Randhawa, Gurjit S / Hill, Kathleen A / Kari, Lila

    Bioinformatics (Oxford, England)

    2019  Volume 36, Issue 7, Page(s) 2258–2259

    Abstract: Summary: Machine Learning with Digital Signal Processing and Graphical User Interface (MLDSP-GUI) is an open-source, alignment-free, ultrafast, computationally lightweight, and standalone software tool with an interactive GUI for comparison and analysis ...

    Abstract Summary: Machine Learning with Digital Signal Processing and Graphical User Interface (MLDSP-GUI) is an open-source, alignment-free, ultrafast, computationally lightweight, and standalone software tool with an interactive GUI for comparison and analysis of DNA sequences. MLDSP-GUI is a general-purpose tool that can be used for a variety of applications such as taxonomic classification, disease classification, virus subtype classification, evolutionary analyses, among others.
    Availability and implementation: MLDSP-GUI is open-source, cross-platform compatible, and is available under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/). The executable and dataset files are available at https://sourceforge.net/projects/mldsp-gui/.
    Supplementary information: Supplementary data are available at Bioinformatics online.
    MeSH term(s) Base Sequence ; Machine Learning ; Signal Processing, Computer-Assisted ; Software ; User-Computer Interface
    Language English
    Publishing date 2019-12-12
    Publishing country England
    Document type Journal Article ; Research Support, Non-U.S. Gov't
    ZDB-ID 1422668-6
    ISSN 1367-4811 ; 1367-4803
    ISSN (online) 1367-4811
    ISSN 1367-4803
    DOI 10.1093/bioinformatics/btz918
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  5. Article ; Online: ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.

    Randhawa, Gurjit S / Hill, Kathleen A / Kari, Lila

    BMC genomics

    2019  Volume 20, Issue 1, Page(s) 267

    Abstract: Background: Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated ... ...

    Abstract Background: Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods.
    Results: We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the "Purine/Pyrimidine", "Just-A" and "Real" numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes.
    Conclusions: Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
    MeSH term(s) Algorithms ; Animals ; Computer Simulation ; Dengue Virus/genetics ; Genome, Bacterial ; Genome, Mitochondrial ; Genome, Viral ; Genomics/methods ; Humans ; Machine Learning ; Signal Processing, Computer-Assisted ; Software ; Vertebrates/classification ; Vertebrates/genetics
    Language English
    Publishing date 2019-04-03
    Publishing country England
    Document type Journal Article
    ISSN 1471-2164
    ISSN (online) 1471-2164
    DOI 10.1186/s12864-019-5571-y
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  6. Article ; Online: SomaticSiMu: a mutational signature simulator.

    Chen, David / Randhawa, Gurjit S / Soltysiak, Maximillian P M / de Souza, Camila P E / Kari, Lila / Singh, Shiva M / Hill, Kathleen A

    Bioinformatics (Oxford, England)

    2022  Volume 38, Issue 9, Page(s) 2619–2620

    Abstract: Summary: SomaticSiMu is an in silico simulator of single and double base substitutions, and single base insertions and deletions in an input genomic sequence to mimic mutational signatures. SomaticSiMu outputs simulated DNA sequences and mutational ... ...

    Abstract Summary: SomaticSiMu is an in silico simulator of single and double base substitutions, and single base insertions and deletions in an input genomic sequence to mimic mutational signatures. SomaticSiMu outputs simulated DNA sequences and mutational catalogues with imposed mutational signatures. The tool is the first mutational signature simulator featuring a graphical user interface, control of mutation rates and built-in visualization tools of the simulated mutations. Simulated datasets are useful as a ground truth to test the accuracy and sensitivity of DNA sequence classification tools and mutational signature extraction tools under different experimental scenarios. The reliability of SomaticSiMu was affirmed by (i) supervised machine learning classification of simulated sequences with different mutation types and burdens, and (ii) mutational signature extraction from simulated mutational catalogues.
    Availability and implementation: SomaticSiMu is written in Python 3.8.3. The open-source code, documentation and tutorials are available at https://github.com/HillLab/SomaticSiMu under the terms of the CreativeCommonsAttribution4.0InternationalLicense.
    Supplementary information: Supplementary data are available at Bioinformatics online.
    MeSH term(s) Reproducibility of Results ; Software ; Mutation ; Genomics ; Genome
    Language English
    Publishing date 2022-03-08
    Publishing country England
    Document type Journal Article ; Research Support, Non-U.S. Gov't
    ZDB-ID 1422668-6
    ISSN 1367-4811 ; 1367-4803
    ISSN (online) 1367-4811
    ISSN 1367-4803
    DOI 10.1093/bioinformatics/btac128
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  7. Article ; Online: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

    Randhawa, Gurjit S. / Soltysiak, Maximillian P.M. / Roz, Hadi El / de Souza, Camila P.E. / Hill, Kathleen A. / Kari, Lila

    bioRxiv

    Keywords covid19
    Language English
    Publishing date 2020-02-20
    Publisher Cold Spring Harbor Laboratory
    Document type Article ; Online
    DOI 10.1101/2020.02.03.932350
    Database COVID19

    Kategorien

  8. Article ; Online: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study.

    Randhawa, Gurjit S / Soltysiak, Maximillian P M / El Roz, Hadi / de Souza, Camila P E / Hill, Kathleen A / Kari, Lila

    PloS one

    2020  Volume 15, Issue 4, Page(s) e0232391

    Abstract: The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin ...

    Abstract The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman's rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.
    MeSH term(s) Betacoronavirus/classification ; Betacoronavirus/genetics ; COVID-19 ; Coronavirus Infections/epidemiology ; Coronavirus Infections/virology ; Genome, Viral ; Genomics ; Humans ; Machine Learning ; Pandemics ; Pneumonia, Viral/epidemiology ; Pneumonia, Viral/virology ; SARS-CoV-2
    Keywords covid19
    Language English
    Publishing date 2020-04-24
    Publishing country United States
    Document type Journal Article ; Research Support, Non-U.S. Gov't
    ZDB-ID 2267670-3
    ISSN 1932-6203 ; 1932-6203
    ISSN (online) 1932-6203
    ISSN 1932-6203
    DOI 10.1371/journal.pone.0232391
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  9. Article ; Online: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens

    Gurjit S Randhawa / Maximillian P M Soltysiak / Hadi El Roz / Camila P E de Souza / Kathleen A Hill / Lila Kari

    PLoS ONE, Vol 15, Iss 4, p e

    COVID-19 case study.

    2020  Volume 0232391

    Abstract: The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin ...

    Abstract The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman's rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.
    Keywords Medicine ; R ; Science ; Q ; covid19
    Subject code 006
    Language English
    Publishing date 2020-01-01T00:00:00Z
    Publisher Public Library of Science (PLoS)
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  10. Article ; Online: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens

    Randhawa, Gurjit S. / Soltysiak, Maximillian P. M. / El Roz, Hadi / de Souza, Camila P. E. / Hill, Kathleen A. / Kari, Lila

    PLOS ONE

    COVID-19 case study

    2020  Volume 15, Issue 4, Page(s) e0232391

    Keywords General Biochemistry, Genetics and Molecular Biology ; General Agricultural and Biological Sciences ; General Medicine ; covid19
    Language English
    Publisher Public Library of Science (PLoS)
    Publishing country us
    Document type Article ; Online
    ISSN 1932-6203
    DOI 10.1371/journal.pone.0232391
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

To top