LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Search results

Result 1 - 10 of total 10

Search options

  1. Article ; Online: Leveraging machine learning for taxonomic classification of emerging astroviruses

    Fatemeh Alipour / Connor Holmes / Yang Young Lu / Kathleen A. Hill / Lila Kari

    Frontiers in Molecular Biosciences, Vol

    2024  Volume 10

    Abstract: Astroviruses are a family of genetically diverse viruses associated with disease in humans and birds with significant health effects and economic burdens. Astrovirus taxonomic classification includes two genera, Avastrovirus and Mamastrovirus. However, ... ...

    Abstract Astroviruses are a family of genetically diverse viruses associated with disease in humans and birds with significant health effects and economic burdens. Astrovirus taxonomic classification includes two genera, Avastrovirus and Mamastrovirus. However, with next-generation sequencing, broader interspecies transmission has been observed necessitating a reexamination of the current host-based taxonomic classification approach. In this study, a novel taxonomic classification method is presented for emergent and as yet unclassified astroviruses, based on whole genome sequence k-mer composition in addition to host information. An optional component responsible for identifying recombinant sequences was added to the method’s pipeline, to counteract the impact of genetic recombination on viral classification. The proposed three-pronged classification method consists of a supervised machine learning method, an unsupervised machine learning method, and the consideration of host species. Using this three-pronged approach, we propose genus labels for 191 as yet unclassified astrovirus genomes. Genus labels are also suggested for an additional eight as yet unclassified astrovirus genomes for which incompatibility was observed with the host species, suggesting cross-species infection. Lastly, our machine learning-based approach augmented by a principal component analysis (PCA) analysis provides evidence supporting the hypothesis of the existence of human astrovirus (HAstV) subgenus of the genus Mamastrovirus, and a goose astrovirus (GoAstV) subgenus of the genus Avastrovirus. Overall, this multipronged machine learning approach provides a fast, reliable, and scalable prediction method of taxonomic labels, able to keep pace with emerging viruses and the exponential increase in the output of modern genome sequencing technologies.
    Keywords machine learning ; viral classification and clustering ; family Astroviridae ; Avastrovirus ; Mamastrovirus ; alignment-free classification ; Biology (General) ; QH301-705.5
    Subject code 006
    Language English
    Publishing date 2024-01-01T00:00:00Z
    Publisher Frontiers Media S.A.
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  2. Article ; Online: DeLUCS

    Pablo Millán Arias / Fatemeh Alipour / Kathleen A. Hill / Lila Kari

    PLoS ONE, Vol 17, Iss

    Deep learning for unsupervised clustering of DNA sequences

    2022  Volume 1

    Abstract: We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary ... ...

    Abstract We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates “mimic” sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.
    Keywords Medicine ; R ; Science ; Q
    Subject code 612 ; 006
    Language English
    Publishing date 2022-01-01T00:00:00Z
    Publisher Public Library of Science (PLoS)
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  3. Article ; Online: DeLUCS

    Pablo Millán Arias / Fatemeh Alipour / Kathleen A Hill / Lila Kari

    PLoS ONE, Vol 17, Iss 1, p e

    Deep learning for unsupervised clustering of DNA sequences.

    2022  Volume 0261531

    Abstract: We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary ... ...

    Abstract We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates "mimic" sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.
    Keywords Medicine ; R ; Science ; Q
    Subject code 612 ; 006
    Language English
    Publishing date 2022-01-01T00:00:00Z
    Publisher Public Library of Science (PLoS)
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  4. Article ; Online: Environment and taxonomy shape the genomic signature of prokaryotic extremophiles

    Pablo Millán Arias / Joseph Butler / Gurjit S. Randhawa / Maximillian P. M. Soltysiak / Kathleen A. Hill / Lila Kari

    Scientific Reports, Vol 13, Iss 1, Pp 1-

    2023  Volume 17

    Abstract: Abstract This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and ... ...

    Abstract Abstract This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of $$\sim 700$$ ∼ 700 extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, $$1\le k \le 6$$ 1 ≤ k ≤ 6 . The supervised learning resulted in high accuracies for taxonomic classifications at $$2\le k \le 6$$ 2 ≤ k ≤ 6 , and medium to medium-high accuracies for environment category classifications of the same datasets at $$3\le k \le 6$$ 3 ≤ k ≤ 6 . For $$k=3$$ k = 3 , our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.
    Keywords Medicine ; R ; Science ; Q
    Subject code 006
    Language English
    Publishing date 2023-09-01T00:00:00Z
    Publisher Nature Portfolio
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  5. Article ; Online: Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers

    Lila Kari / Stavros Konstantinidis / Steffen Kopecki / Meng Yang

    Algorithms, Vol 11, Iss 11, p

    2018  Volume 165

    Abstract: The concept of edit distance and its variants has applications in many areas such as computational linguistics, bioinformatics, and synchronization error detection in data communications. Here, we revisit the problem of computing the inner edit distance ... ...

    Abstract The concept of edit distance and its variants has applications in many areas such as computational linguistics, bioinformatics, and synchronization error detection in data communications. Here, we revisit the problem of computing the inner edit distance of a regular language given via a Nondeterministic Finite Automaton (NFA). This problem relates to the inherent maximal error-detecting capability of the language in question. We present two efficient algorithms for solving this problem, both of which execute in time <math display="inline"> <semantics> <mrow> <mi>O</mi> <mo stretchy="false">(</mo> <msup> <mi>r</mi> <mn>2</mn> </msup> <msup> <mi>n</mi> <mn>2</mn> </msup> <mi>d</mi> <mo stretchy="false">)</mo> </mrow> </semantics> </math> , where r is the cardinality of the alphabet involved, n is the number of transitions in the given NFA, and d is the computed edit distance. We have implemented one of the two algorithms and present here a set of performance tests. The correctness of the algorithms is based on the connection between word distances and error detection and the fact that nondeterministic transducers can be used to represent the errors (resp., edit operations) involved in error-detection (resp., in word distances).
    Keywords algorithms ; automata ; complexity ; edit distance ; implementation ; transducers ; regular language ; Industrial engineering. Management engineering ; T55.4-60.8 ; Electronic computers. Computer science ; QA75.5-76.95
    Language English
    Publishing date 2018-10-01T00:00:00Z
    Publisher MDPI AG
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  6. Article ; Online: An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes.

    Stephen Solis-Reyes / Mariano Avino / Art Poon / Lila Kari

    PLoS ONE, Vol 13, Iss 11, p e

    2018  Volume 0206409

    Abstract: For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine ... ...

    Abstract For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine component of clinical management, and there are now many classification algorithms available for this purpose. Although several of these algorithms are similar in accuracy and speed, the majority are proprietary and require laboratories to transmit HIV-1 sequence data over the network to remote servers. This potentially exposes sensitive patient data to unauthorized access, and makes it impossible to determine how classifications are made and to maintain the data provenance of clinical bioinformatic workflows. We propose an open-source supervised and alignment-free subtyping method (Kameris) that operates on k-mer frequencies in HIV-1 sequences. We performed a detailed study of the accuracy and performance of subtype classification in comparison to four state-of-the-art programs. Based on our testing data set of manually curated real-world HIV-1 sequences (n = 2, 784), Kameris obtained an overall accuracy of 97%, which matches or exceeds all other tested software, with a processing rate of over 1,500 sequences per second. Furthermore, our fully standalone general-purpose software provides key advantages in terms of data security and privacy, transparency and reproducibility. Finally, we show that our method is readily adaptable to subtype classification of other viruses including dengue, influenza A, and hepatitis B and C virus.
    Keywords Medicine ; R ; Science ; Q
    Subject code 006
    Language English
    Publishing date 2018-01-01T00:00:00Z
    Publisher Public Library of Science (PLoS)
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  7. Article ; Online: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens

    Gurjit S Randhawa / Maximillian P M Soltysiak / Hadi El Roz / Camila P E de Souza / Kathleen A Hill / Lila Kari

    PLoS ONE, Vol 15, Iss 4, p e

    COVID-19 case study.

    2020  Volume 0232391

    Abstract: The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin ...

    Abstract The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman's rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.
    Keywords Medicine ; R ; Science ; Q ; covid19
    Subject code 006
    Language English
    Publishing date 2020-01-01T00:00:00Z
    Publisher Public Library of Science (PLoS)
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  8. Article ; Online: Ciliate Gene Unscrambling with Fewer Templates

    Lila Kari / Afroza Rahman

    Electronic Proceedings in Theoretical Computer Science, Vol 31, Iss Proc. DCFS 2010, Pp 120-

    2010  Volume 129

    Abstract: One of the theoretical models proposed for the mechanism of gene unscrambling in some species of ciliates is the template-guided recombination (TGR) system by Prescott, Ehrenfeucht and Rozenberg which has been generalized by Daley and McQuillan from a ... ...

    Abstract One of the theoretical models proposed for the mechanism of gene unscrambling in some species of ciliates is the template-guided recombination (TGR) system by Prescott, Ehrenfeucht and Rozenberg which has been generalized by Daley and McQuillan from a formal language theory perspective. In this paper, we propose a refinement of this model that generates regular languages using the iterated TGR system with a finite initial language and a finite set of templates, using fewer templates and a smaller alphabet compared to that of the Daley-McQuillan model. To achieve Turing completeness using only finite components, i.e., a finite initial language and a finite set of templates, we also propose an extension of the contextual template-guided recombination system (CTGR system) by Daley and McQuillan, by adding an extra control called permitting contexts on the usage of templates.
    Keywords Mathematics ; QA1-939 ; Electronic computers. Computer science ; QA75.5-76.95
    Language English
    Publishing date 2010-08-01T00:00:00Z
    Publisher Open Publishing Association
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  9. Article ; Online: State Complexity of Catenation Combined with Star and Reversal

    Bo Cui / Yuan Gao / Lila Kari / Sheng Yu

    Electronic Proceedings in Theoretical Computer Science, Vol 31, Iss Proc. DCFS 2010, Pp 58-

    2010  Volume 67

    Abstract: This paper is a continuation of our research work on state complexity of combined operations. Motivated by applications, we study the state complexities of two particular combined operations: catenation combined with star and catenation combined with ... ...

    Abstract This paper is a continuation of our research work on state complexity of combined operations. Motivated by applications, we study the state complexities of two particular combined operations: catenation combined with star and catenation combined with reversal. We show that the state complexities of both of these combined operations are considerably less than the compositions of the state complexities of their individual participating operations.
    Keywords Mathematics ; QA1-939 ; Electronic computers. Computer science ; QA75.5-76.95
    Language English
    Publishing date 2010-08-01T00:00:00Z
    Publisher Open Publishing Association
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  10. Article ; Online: Mapping the space of genomic signatures.

    Lila Kari / Kathleen A Hill / Abu S Sayem / Rallis Karamichalis / Nathaniel Bryans / Katelyn Davis / Nikesh S Dattani

    PLoS ONE, Vol 10, Iss 5, p e

    2015  Volume 0119815

    Abstract: We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each ... ...

    Abstract We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to k (herein k = 9) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence alignment and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information. This method also correctly finds the mtDNA sequences most closely related to that of the anatomically modern human (the Neanderthal, the Denisovan, and the chimp), and that the sequence most different from it in this dataset belongs to a cucumber.
    Keywords Medicine ; R ; Science ; Q
    Subject code 612
    Language English
    Publishing date 2015-01-01T00:00:00Z
    Publisher Public Library of Science (PLoS)
    Document type Article ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

To top