LIVIVO - Search results -

Search results

Result 1 - 10 of total 14

Search options

Book ; Online: Heaps' Law in GPT-Neo Large Language Model Emulated Corpora

Lai, Uyen / Randhawa, Gurjit S. / Sheridan, Paul

2023

Abstract: Heaps' law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size. While this law has been validated in diverse human-authored text corpora, its applicability to large language model generated text remains ... ...

Abstract	Heaps' law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size. While this law has been validated in diverse human-authored text corpora, its applicability to large language model generated text remains unexplored. This study addresses this gap, focusing on the emulation of corpora using the suite of GPT-Neo large language models. To conduct our investigation, we emulated corpora of PubMed abstracts using three different parameter sizes of the GPT-Neo model. Our emulation strategy involved using the initial five words of each PubMed abstract as a prompt and instructing the model to expand the content up to the original abstract's length. Our findings indicate that the generated corpora adhere to Heaps' law. Interestingly, as the GPT-Neo model size grows, its generated vocabulary increasingly adheres to Heaps' law as as observed in human-authored text. To further improve the richness and authenticity of GPT-Neo outputs, future iterations could emphasize enhancing model size or refining the model architecture to curtail vocabulary repetition. Comment: 4 pages, 1 figure, 1 table, EVIA 2023
Keywords	Computer Science - Computation and Language
Subject code	410
Publishing date	2023-11-10
Publishing country	us
Document type	Book ; Online
Database	BASE - Bielefeld Academic Search Engine (life sciences selection)

Full text online

Full text

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Article ; Online: Environment and taxonomy shape the genomic signature of prokaryotic extremophiles.

Arias, Pablo Millán / Butler, Joseph / Randhawa, Gurjit S / Soltysiak, Maximillian P M / Hill, Kathleen A / Kari, Lila

Scientific reports

2023 Volume 13, Issue 1, Page(s) 16105

Abstract: This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine ...

Abstract	This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of [Formula: see text] extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, [Formula: see text]. The supervised learning resulted in high accuracies for taxonomic classifications at [Formula: see text], and medium to medium-high accuracies for environment category classifications of the same datasets at [Formula: see text]. For [Formula: see text], our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.
MeSH term(s)	Extremophiles/genetics ; Genomics/methods ; Bacteria/genetics ; Archaea/genetics ; Genome, Archaeal/genetics
Language	English
Publishing date	2023-09-26
Publishing country	England
Document type	Journal Article ; Research Support, Non-U.S. Gov't
ZDB-ID	2615211-3
ISSN	2045-2322 ; 2045-2322
ISSN (online)	2045-2322
ISSN	2045-2322
DOI	10.1038/s41598-023-42518-y
Database	MEDical Literature Analysis and Retrieval System OnLINE

Order via subito

This service is chargeable due to the Delivery terms set by subito. Orders including an article and supplementary material will be classified as separate orders. In these cases, fees will be demanded for each order.

Article ; Online: Environment and taxonomy shape the genomic signature of prokaryotic extremophiles

Pablo Millán Arias / Joseph Butler / Gurjit S. Randhawa / Maximillian P. M. Soltysiak / Kathleen A. Hill / Lila Kari

Scientific Reports, Vol 13, Iss 1, Pp 1-

2023 Volume 17

Abstract: Abstract This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and ... ...

Abstract	Abstract This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of $$\sim 700$$ ∼ 700 extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, $$1\le k \le 6$$ 1 ≤ k ≤ 6 . The supervised learning resulted in high accuracies for taxonomic classifications at $$2\le k \le 6$$ 2 ≤ k ≤ 6 , and medium to medium-high accuracies for environment category classifications of the same datasets at $$3\le k \le 6$$ 3 ≤ k ≤ 6 . For $$k=3$$ k = 3 , our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.
Keywords	Medicine ; R ; Science ; Q
Subject code	006
Language	English
Publishing date	2023-09-01T00:00:00Z
Publisher	Nature Portfolio
Document type	Article ; Online
Database	BASE - Bielefeld Academic Search Engine (life sciences selection)

Full text online

Full text

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Article ; Online: MLDSP-GUI: an alignment-free standalone tool with an interactive graphical user interface for DNA sequence comparison and analysis.

Randhawa, Gurjit S / Hill, Kathleen A / Kari, Lila

Bioinformatics (Oxford, England)

2019 Volume 36, Issue 7, Page(s) 2258–2259

Abstract: Summary: Machine Learning with Digital Signal Processing and Graphical User Interface (MLDSP-GUI) is an open-source, alignment-free, ultrafast, computationally lightweight, and standalone software tool with an interactive GUI for comparison and analysis ...

Abstract	Summary: Machine Learning with Digital Signal Processing and Graphical User Interface (MLDSP-GUI) is an open-source, alignment-free, ultrafast, computationally lightweight, and standalone software tool with an interactive GUI for comparison and analysis of DNA sequences. MLDSP-GUI is a general-purpose tool that can be used for a variety of applications such as taxonomic classification, disease classification, virus subtype classification, evolutionary analyses, among others. Availability and implementation: MLDSP-GUI is open-source, cross-platform compatible, and is available under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/). The executable and dataset files are available at https://sourceforge.net/projects/mldsp-gui/. Supplementary information: Supplementary data are available at Bioinformatics online.
MeSH term(s)	Base Sequence ; Machine Learning ; Signal Processing, Computer-Assisted ; Software ; User-Computer Interface
Language	English
Publishing date	2019-12-12
Publishing country	England
Document type	Journal Article ; Research Support, Non-U.S. Gov't
ZDB-ID	1422668-6
ISSN	1367-4811 ; 1367-4803
ISSN (online)	1367-4811
ISSN	1367-4803
DOI	10.1093/bioinformatics/btz918
Database	MEDical Literature Analysis and Retrieval System OnLINE

In stock of ZB MED Cologne/Königswinter

Zs.A 2374: Show issues

Location:
Je nach Verfügbarkeit (siehe Angabe bei Bestand)
bis Jg. 1994: Bestellungen von Artikeln über das Online-Bestellformular
Jg. 1995 - 2021: Lesesall (2.OG)
ab Jg. 2022: Lesesaal (EG)

Order via subito

Details ▾
- See ZB MED holdings
- Order with fees

Article ; Online: ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.

Randhawa, Gurjit S / Hill, Kathleen A / Kari, Lila

BMC genomics

2019 Volume 20, Issue 1, Page(s) 267

Abstract: Background: Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated ... ...

Abstract	Background: Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results: We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the "Purine/Pyrimidine", "Just-A" and "Real" numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions: Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
MeSH term(s)	Algorithms ; Animals ; Computer Simulation ; Dengue Virus/genetics ; Genome, Bacterial ; Genome, Mitochondrial ; Genome, Viral ; Genomics/methods ; Humans ; Machine Learning ; Signal Processing, Computer-Assisted ; Software ; Vertebrates/classification ; Vertebrates/genetics
Language	English
Publishing date	2019-04-03
Publishing country	England
Document type	Journal Article
ISSN	1471-2164
ISSN (online)	1471-2164
DOI	10.1186/s12864-019-5571-y
Database	MEDical Literature Analysis and Retrieval System OnLINE

Order via subito

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Article ; Online: SomaticSiMu: a mutational signature simulator.

Chen, David / Randhawa, Gurjit S / Soltysiak, Maximillian P M / de Souza, Camila P E / Kari, Lila / Singh, Shiva M / Hill, Kathleen A

Bioinformatics (Oxford, England)

2022 Volume 38, Issue 9, Page(s) 2619–2620

Abstract: Summary: SomaticSiMu is an in silico simulator of single and double base substitutions, and single base insertions and deletions in an input genomic sequence to mimic mutational signatures. SomaticSiMu outputs simulated DNA sequences and mutational ... ...

Abstract	Summary: SomaticSiMu is an in silico simulator of single and double base substitutions, and single base insertions and deletions in an input genomic sequence to mimic mutational signatures. SomaticSiMu outputs simulated DNA sequences and mutational catalogues with imposed mutational signatures. The tool is the first mutational signature simulator featuring a graphical user interface, control of mutation rates and built-in visualization tools of the simulated mutations. Simulated datasets are useful as a ground truth to test the accuracy and sensitivity of DNA sequence classification tools and mutational signature extraction tools under different experimental scenarios. The reliability of SomaticSiMu was affirmed by (i) supervised machine learning classification of simulated sequences with different mutation types and burdens, and (ii) mutational signature extraction from simulated mutational catalogues. Availability and implementation: SomaticSiMu is written in Python 3.8.3. The open-source code, documentation and tutorials are available at https://github.com/HillLab/SomaticSiMu under the terms of the CreativeCommonsAttribution4.0InternationalLicense. Supplementary information: Supplementary data are available at Bioinformatics online.
MeSH term(s)	Reproducibility of Results ; Software ; Mutation ; Genomics ; Genome
Language	English
Publishing date	2022-03-08
Publishing country	England
Document type	Journal Article ; Research Support, Non-U.S. Gov't
ZDB-ID	1422668-6
ISSN	1367-4811 ; 1367-4803
ISSN (online)	1367-4811
ISSN	1367-4803
DOI	10.1093/bioinformatics/btac128
Database	MEDical Literature Analysis and Retrieval System OnLINE

In stock of ZB MED Cologne/Königswinter

Zs.A 2374: Show issues

Location:
Je nach Verfügbarkeit (siehe Angabe bei Bestand)
bis Jg. 1994: Bestellungen von Artikeln über das Online-Bestellformular
Jg. 1995 - 2021: Lesesall (2.OG)
ab Jg. 2022: Lesesaal (EG)

Order via subito

Details ▾
- See ZB MED holdings
- Order with fees

Article ; Online: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

Randhawa, Gurjit S. / Soltysiak, Maximillian P.M. / Roz, Hadi El / de Souza, Camila P.E. / Hill, Kathleen A. / Kari, Lila

bioRxiv

Keywords	covid19
Language	English
Publishing date	2020-02-20
Publisher	Cold Spring Harbor Laboratory
Document type	Article ; Online
DOI	10.1101/2020.02.03.932350
Database	COVID19

Full text online

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Article ; Online: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study.

Randhawa, Gurjit S / Soltysiak, Maximillian P M / El Roz, Hadi / de Souza, Camila P E / Hill, Kathleen A / Kari, Lila

PloS one

2020 Volume 15, Issue 4, Page(s) e0232391

Abstract: The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin ...

Abstract	The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman's rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.
MeSH term(s)	Betacoronavirus/classification ; Betacoronavirus/genetics ; COVID-19 ; Coronavirus Infections/epidemiology ; Coronavirus Infections/virology ; Genome, Viral ; Genomics ; Humans ; Machine Learning ; Pandemics ; Pneumonia, Viral/epidemiology ; Pneumonia, Viral/virology ; SARS-CoV-2
Keywords	covid19
Language	English
Publishing date	2020-04-24
Publishing country	United States
Document type	Journal Article ; Research Support, Non-U.S. Gov't
ZDB-ID	2267670-3
ISSN	1932-6203 ; 1932-6203
ISSN (online)	1932-6203
ISSN	1932-6203
DOI	10.1371/journal.pone.0232391
Database	MEDical Literature Analysis and Retrieval System OnLINE

Order via subito

Article ; Online: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens

Gurjit S Randhawa / Maximillian P M Soltysiak / Hadi El Roz / Camila P E de Souza / Kathleen A Hill / Lila Kari

PLoS ONE, Vol 15, Iss 4, p e

COVID-19 case study.

2020 Volume 0232391

Abstract	The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman's rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.
Keywords	Medicine ; R ; Science ; Q ; covid19
Subject code	006
Language	English
Publishing date	2020-01-01T00:00:00Z
Publisher	Public Library of Science (PLoS)
Document type	Article ; Online
Database	BASE - Bielefeld Academic Search Engine (life sciences selection)

Full text online

Full text

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Article ; Online: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens

Randhawa, Gurjit S. / Soltysiak, Maximillian P. M. / El Roz, Hadi / de Souza, Camila P. E. / Hill, Kathleen A. / Kari, Lila

PLOS ONE

COVID-19 case study

2020 Volume 15, Issue 4, Page(s) e0232391

Keywords	General Biochemistry, Genetics and Molecular Biology ; General Agricultural and Biological Sciences ; General Medicine ; covid19
Language	English
Publisher	Public Library of Science (PLoS)
Publishing country	us
Document type	Article ; Online
ISSN	1932-6203
DOI	10.1371/journal.pone.0232391
Database	BASE - Bielefeld Academic Search Engine (life sciences selection)

Full text online

Full text

Order via subito

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Details ▾
- Full text online
- Order with fees

To top

Full text online

More links

Kategorien

Inter-library loan at ZB MED

More links

Kategorien

Order via subito

Full text online

More links

Kategorien

Inter-library loan at ZB MED

More links

Kategorien

In stock of ZB MED Cologne/Königswinter

Order via subito

More links

Kategorien

Order via subito

Inter-library loan at ZB MED

More links

Kategorien

In stock of ZB MED Cologne/Königswinter

Order via subito

Full text online

Kategorien

Inter-library loan at ZB MED

More links

Kategorien

Order via subito

Full text online

More links

Kategorien

Inter-library loan at ZB MED

Full text online

More links

Kategorien

Order via subito

Inter-library loan at ZB MED