Adaptive Weighting by Sinkhorn Distance for Sharing Experiences between Multi-Task Reinforcement Learning in Sparse-reward Environments

Viet Cuong TA

doi:10.25073/2588-1086/vnucsce.3918

Viet Cuong TA

PDF

Published Apr 23, 2025

DOI: https://doi.org/10.25073/2588-1086/vnucsce.3918

How to Cite

TA, Viet Cuong. Adaptive Weighting by Sinkhorn Distance for Sharing Experiences between Multi-Task Reinforcement Learning in Sparse-reward Environments. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], v. 41, n. 1 (2025), apr. 2025. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/3918>. Date accessed: 30 july 2025. doi: https://doi.org/10.25073/2588-1086/vnucsce.3918.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Vol 41 No 1 (2025)

Section

Original Articles

Abstract

Abstract: In multi-task reinforcement learning, an agent using off-policy learning can leverage samples from other tasks to improve its learning process. When the reward signal from the environment is
sparse, the agent in each task spends most of its training time exploring the environment. Therefore,
the shared experiences between tasks can generally be considered as samples derived from an exploration policy. However, when the exploitation phase begins, the shared experience framework must
account for the divergence of policies across different learning tasks. However, when the exploitation phase starts, the sharing experience framework has to take into account the policies’ divergence
issue of different learning tasks. Our work addresses this issue by employing an adaptive weight for
shared experiences. First, a central buffer collects and shares the experiences from each individual
task. To mitigate the effects of policy divergence among multiple tasks, we propose an algorithm that
measures policy distances using the Sinkhorn distance. The computed distances are used to assign
a specific weight to each shared sample, controlling the amount of knowledge shared as the policies
begin to diverge during the exploitation phase. We conduct experiments in two goal-based multitask learning environments to evaluate the effectiveness of our approach. The results show that our
proposed method can improve from 8%-10% in average rewards in comparison with other baselines.
Keywords: multi-task reinforcement learning, off-policy, experience sharing.

Article Sidebar

Article Details

Main Article Content

Abstract