Benchmarking Genomic Encodings for AMR Prediction: The Superiority of K-mers and Ensemble Learning over Deep Learning

Lam- Tung Nguyen; Cuong Nguyen; Tien Dat Nguyen; Minh Trien Pham; Thi Quyen Ha; Thi- Xuan Trinh

doi:10.25073/2588-1086/vnucsce.6980

Lam- Tung Nguyen, Cuong Nguyen, Tien Dat Nguyen, Minh Trien Pham, Thi Quyen Ha, Thi- Xuan Trinh

PDF

Published Mar 16, 2026

DOI: https://doi.org/10.25073/2588-1086/vnucsce.6980

How to Cite

NGUYEN, Lam- Tung et al. Benchmarking Genomic Encodings for AMR Prediction: The Superiority of K-mers and Ensemble Learning over Deep Learning. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], v. 42, n. 1, mar. 2026. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/6980>. Date accessed: 08 july 2026. doi: https://doi.org/10.25073/2588-1086/vnucsce.6980.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Vol 42 No 1 (2026)

Section

Original Articles

Abstract

Abstract: The rise of Antimicrobial Resistance (AMR) necessitates fast and accurate computational
approaches to predict resistance phenotypes directly from genomic data. While Whole-Genome
Sequencing (WGS) coupled with Deep Learning (DL) models is the state-of-the-art paradigm, a
systematic comparative evaluation of different genomic encoding and visualization methods
remains limited, particularly in the critical context of AMR prediction for Escherichia coli. This
study systematically assesses four distinct genomic representation strategies: traditional K-mer
counting with ensemble tree-based classifiers, reference-based SNP profiles with ensemble learning,
One-Hot Encoding with a 1D-Convolutional Neural Network (1D-CNN), and Chaos Game
Representation (CGR) with a 2D-Convolutional Neural Network (2D-CNN), for predicting
resistance to ciprofloxacin, gentamicin, and ampicillin. The results reveal a consistent and superior
discriminatory power of the alignment-free traditional Machine Learning approach based on Kmer frequency profiles (specifically 4-mers) when coupled with gradient boosting algorithms (such
as XGBoost and LightGBM), compared to both SNP-based Machine Learning and Deep Learning
architectures. This performance advantage was most pronounced for gentamicin and ampicillin,
where complex resistance mechanisms involving mobile genetic elements are captured more
effectively by the K-mer approach. Crucially, the study benchmarks the limitations of Deep
Learning: while the One-Hot 1D-CNN model exhibited a severe calibration failure characterized by an extremely low Recall for ampicillin (F1-Score of only 0.1132), the SNP-based Machine Learning
models maintained robust performance on the same feature set, highlighting the architectural
efficiency of gradient boosting over CNNs for tabular genomic data. Statistical analysis confirmed
the significance of these differences, with K-mer ML significantly outperforming Deep Learning
across all antibiotics (p < 0.001 for Gentamicin and Ampicillin). The amino acid 4-mer XGBoost
model achieved an AUC of 0.9917 (95% CI: 0.9827-0.9983) for Ciprofloxacin. The study
concludes that, for current dataset sizes and complex resistance phenotypes, the dense
information representation of K-mers offers a more accurate and robust solution, and identifies the
4-mer XGBoost and Combined K-mer LightGBM configurations as the optimal modeling strategies.
Keywords: Machine learning, Deep learning, Bioinformatics, Computational Biology,
Antimicrobials, Bacteria, Escherichia coli, Applied microbiology.

Article Sidebar

Article Details

Main Article Content

Abstract