A Bandwidth-Efficient High-Performance  RTL-Microarchitecture of 2D-Convolution for Deep  Neural Networks

Nguyen Kiem Hung; Tran Quoc Long

doi:10.25073/2588-1086/vnucsce.596

Nguyen Kiem Hung; Tran Quoc Long

PDF

Published Aug 7, 2023

DOI: https://doi.org/10.25073/2588-1086/vnucsce.596

How to Cite

HUNG, Nguyen Kiem; LONG, Tran Quoc. A Bandwidth-Efficient High-Performance RTL-Microarchitecture of 2D-Convolution for Deep Neural Networks. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], v. 39, n. 2, aug. 2023. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/596>. Date accessed: 13 july 2025. doi: https://doi.org/10.25073/2588-1086/vnucsce.596.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Vol 39 No 2

Section

Original Articles

Abstract

The computation complexity and huge memory access bandwidth of the convolutional layers in convolutional neural networks (CNNs) require specialized hardware architectures to accelerate CNN’s computations while keeping hardware costs reasonable for area-constrained embedded applications. This paper presents an RTL (Register Transfer Logic) level microarchitecture of hardware- and bandwidth-efficient high-performance 2D convolution unit for CNN in deep learning. The 2D convolution unit is made up of three main components including a dedicated Loader, a Circle Buffer, and a MAC (Multiplier-Accumulator) unit. The 2D convolution unit has a 2-stage pipeline structure that reduces latency, increases processing throughput, and reduces power consumption. The architecture proposed in the paper eliminates the reloading of both the weights as well as the input image data. The 2D convolution unit is configurable to support 2D convolution operations with different sizes of input image matrix and kernel filter. The architecture can reduce memory access time and power as well as execution time thanks to the efficient reuse of the preloaded input data while simplifying hardware implementation. The 2D convolution unit has been simulated and implemented on Xilinx's FPGA platform to evaluate its superiority. Experimental results show that our design is 1.54× and 13.6× faster in performance than the design in [1] and [2], respectively, at lower hardware cost without using any FPGA’s dedicated hardware blocks. By reusing preloaded data, our design achieves a bandwidth reduction ratio between 66.4% and 90.5%.

Keywords: 2D Convolution, RTL microarchitecture, Circle Buffer, Deep Neural Network, MAC, Loader.

Article Sidebar

Article Details

Main Article Content

Abstract