Skip to main navigation Skip to search Skip to main content

Disambiguating a Soft Metagenomic Clustering

  • Georgia Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Clustering is a popular technique used for analyzing amplicon sequencing data in metagenomics. Specifically, it is used to assign sequences (reads) to clusters, each cluster representing a species or a higher level taxonomic unit. Reads from multiple species often sharing subsequences, combined with lack of a perfect similarity measure, make it difficult to correctly assign reads to clusters. Thus, metagenomic clustering methods must either resort to ambiguity, or make the best available choice at each read assignment stage, which could lead to incorrect clusters and potentially cascading errors. In this article, we argue for first generating an ambiguous clustering and then resolving the ambiguities collectively by analyzing the ambiguous clusters. We propose a rigorous formulation of this problem and show that it is NP-Hard. We then propose an efficient heuristic to solve it in practice. We validate our approach on several synthetically generated datasets and two datasets consisting of 16S rDNA sequences from the microbiome of rat guts.

Original languageEnglish
Pages (from-to)473-485
Number of pages13
JournalJournal of Computational Biology
Volume32
Issue number5
DOIs
StatePublished - May 1 2025

Keywords

  • algorithm
  • clustering
  • metagenomics
  • NP-completeness

Fingerprint

Dive into the research topics of 'Disambiguating a Soft Metagenomic Clustering'. Together they form a unique fingerprint.

Cite this