Improving Biomedical Multi-document Abstractive Summarization Model with Syntax Tree Pruning and Generative Pre-training Adaptation
Main Article Content
Abstract
Abstract: Biomedical multi-answer summarization (MAS) presents critical challenges for
healthcare applications, where standard transformer-based models face input length limitations, factual inconsistency risks, and inadequate query-driven content selection mechanisms. We propose SAMSUM, a Syntax-aware Adaptive Transformer-based Model for MAS,
which integrates three innovative components to address these fundamental limitations. Our
approach combines an adaptive BART architecture with extractive preprocessing to mitigate
information loss, query-conditioned formatting to ensure medical question relevance, and
dynamic length prediction for optimal information density. A syntax tree pruning mechanism
employs supervised gradient boosting classification to systematically eliminate redundant
phrases while preserving medical content integrity and grammatical structure. Comprehensive evaluation on the MEDIQA-MAS 2021 dataset demonstrates that SAMSUM achieves
state-of-the-art performance across all evaluation metrics, with ROUGE-2 F1 score of 17.3%
and BERTScore F1 of 66.8%, substantially outperforming existing baselines and challenging
participants. Data and code are available at: https://github.com/catcd/SAMSum.
Keywords: biomedical text summarization, multi-answer summarization, syntax tree pruning,
abstractive summarization, generative models, healthcare systems, clinical decision support.