You are currently viewing MUSRA#8 A Reconfigurable Multi-function DMA Controller for High-Performance Computing Systems

MUSRA#8 A Reconfigurable Multi-function DMA Controller for High-Performance Computing Systems

DMAC có khả năng tái cấu hình đa chức năng ứng dụng trong các hệ thống tính toán hiệu năng cao

Mục tiêu: Thiết kế mức RTL, mô hình hóa bằng VHDL, mô phỏng bằng ModelSIM và thực thi một bộ truy xuất bộ nhớ trực tiếp DMAC có khả năng tái cấu hình đa chức năng ứng dụng cho các hệ thống tính toán hiệu năng cao. DMAC có khả năng tái cấu hình để thực hiện đồng thời nhiều chức năng như sắp xếp dãy số, chuyển vị ma trận, ghép nhiều ma trận thành một ma trận, … trong cùng một mạch phần cứng duy nhất.

Abstract—Huge bandwidth demand along with the requirement to synchronize data structures between different processing structures in multiprocessor system-on-chip (MPSoC) lead to the need to design dedicated memory access controllers. This paper presents the design of a reconfigurable multi-function memory direct memory controller (ReDMAC) for high-performance MPSoCs. The ReDMAC supports the capability of dynamic reconfiguration by enabling the hardware fabrics to be synthesized into various functions even if the system is working. The ReDMAC can support four operating modes, including direct memory access, matrix transposing, data sorting, and matrix merging. The ReDMAC has been modeled at the Register Transfer Level (RTL) using VHDL language. The controller has been simulated and evaluated on reconfigurability to work with individual functions. The controller is also synthesized with the Synopsys Design Compiler tool to compare hardware costs with the independent implementation of each individual function. Simulation and synthesis results indicate that the proposed design meets the required functionality, while the area of the controller decreases about three times compared to total area of independent function cores.

I.    Introduction

Recently, the research trend in the design of high-performance computing systems has shifted toward the hybrid reconfigurable Multiprocessor System-on-Chips (MPSoC) (e.g. MUSRA [1], Zynq Ultrascale[2], ADRES[3], REMUS[4], CPSoC [5] etc.). These systems are normally integrated many heterogeneous processing resources such as software programmable microprocessors (mP), hardwired IP (Intellectual Property) cores, reconfigurable hardware architectures, etc. To program such a system, a target application is first partitioned into a set of tasks and then mapped onto the heterogeneous computational and routing resources of the system. Mapping and partitioning the application so that it can be executed on several smaller processors in a parallel or pipelining fashion is more efficient than execution on a single processor. Especially, computation-intensive kernel functions of the application are mapped onto the reconfigurable hardware so that they can achieve high performance approximately equivalent to that of ASIC while maintaining a degree of flexibility close to that of DSP processors [6]. Moreover, by dynamically reconfiguring hardware, reconfigurable computing systems allow many hardware tasks to be mapped onto the same hardware platform, thus reducing the area and power consumption of the design [7].

However, designing such high-performance computing systems also has some challenges. One of them is the communication and synchronization of data between different processing structures. Parallel processing architectures usually require a huge data bandwidth. Therefore, the system bandwidth is necessary to ensure that data is always available for all resources to run concurrently without idle states. Moreover, because the processing structures have different execution models, the data structure exchanged between them needs to be transformed to ensure compatibility.

A common method used for data communication between processing units is through a shared memory with assistance of a direct memory access controller (DMAC). Here, DMAC is used for transferring data between shared-memory and parallel processing arrays without the participation of the central processing unit (CPU). Hence, DMAC is a very important component that helps to increase data transfer rate and reduce load for CPU in computing systems. Unfortunately, a conventional DMAC [8] in general-purpose computer usually supports only simple operations that copy continuous data blocks from source storage area to destination one. This architecture is not efficient to access to complex data structure supported by parallel processing architectures. Because of these limitations the traditional DMACs architectures cannot provide enough throughput to keep up with new technology trends. The role of DMACs becomes more complicated in parallel computation architectures.  Improving and optimizing the functionality of DMAC become a key issue in designing high-performance computing systems [9]. Many DMACs ([10]-[14]) have been proposed with the unique features that are dedicated to a specific domain of applications.

In this paper, we propose and implement a reconfigurable multi-function DMA controller (ReDMAC) for the coarse-grained reconfigurable architecture, named MUSRA [1]. Because MUSRA is designed to aim at accelerating computation of loops in the multimedia processing applications, some loop-transformation techniques have to be applied while mapping a specific loop onto the MUSRA. As a result, the data that is transferred between software modules running on microprocessors and loops executing on the MUSRA also need to be applied some proper transformations such as tiling, fusion, splitting, skewing, sectioning, etc. [15]. Therefore, the proposed DMAC does not only take charge of moving data from system’s memory to parallel processing array, but also has to convert data structures to the suitable formats that are compatible to the execution model of parallel processing array of MUSRA. The DMAC supports four modes:

  • Basic DMA mode allows a data block to be moved from one place to another one;
  • Fusing DMA mode merges an M×N-matrix with an M×L- matrix into a M×(N+L)-matrix then move it to another position;
  • Transposing DMA mode copies a M×N-matrix from one specified place, and then transposes before moves it to another place;
  • Sorting DMA mode copies a data block from one place, and then sorts before moves it to another place.

The rest of this paper is organized as follows. The operation principle and architecture of the proposed DMAC are presented in Section II. In Section III, experimental results and the evaluation on flexibility, performance and implementation cost are reported and discussed. Finally, some conclusions are given in Section IV.

II.    Proposed Architecture

A.   Principle Overview

The ReDMAC is designed to keep the role as an adapter between ARM AMBA-based processing systems with the hardware accelerators. Fig. 1 shows ReDMAC’s interface and connectivity in a system-on-chip. The interface between the ReDMAC and the processing system complies with the AMBA AHB protocol specification [16]. It includes an AHB Master interface for accessing to system’s memory and an AHB slave interface for receiving DMA command from CPU. In addition, ReDMAC also has another interface for handshaking with CPU or peripherals that request a DMA session. From the structure perspective, the ReDMAC includes two parts: DMAC wrapper and DMAC core. The wrapper is to make the interface of DMAC core compatible with the AHB bus and accelerator interface, therefore, allow DMAC core to transfer data between memory and accelerator.

ReDMAC interface and interconection in a SoC

Fig. 1. ReDMAC interface and interconection in a SoC.

B.   DMAC core

The proposed architecture of the DMAC core is shown in Fig. 2. The DMAC core consists of the three main blocks which are Control Register File, Configuration Context Generator (CCG), and Control Unit (CU). Especially, to offer the reconfigurability in real-time, the CU is in turn composed of a parameterized FSM (Finite State Machine), Reconfigurable Fabrics, and Context Register File (CRF).

FSMD flowchart of DMAC core

Fig. 2. Functional block diagram of DMAC core.

FSMD flowchart of DMAC core

Fig. 3. FSMD flowchart of DMAC core.

The operation of DMAC core is described by FSMD (Finite State Machine with Data-path) flowchart in Fig. 3. The separation of the control unit from the configuration context generator aims at isolating the functional operation of the DMAC core from the configuration process. This structure avoids interferences between two sections, thus ensuring design stability. In addition, it creates a two-stage pipelined mechanism (as shown in Fig. 2) between these sections, which reduces the time overhead caused by configuration. After right the CCG finishes the configuration process, it is possible to immediately write a new DMA command to the control register file.

1)  Control Register File

Control register file contains the some registers, which determine the function and control parameters of the DMAC core. These registers are written by an external CPU via AHB slave interface, and are read by the CCG to generate configuration information for the DMAC core. There are six registers as follows:

  • CMR (Command Register) contains the DMA control commands (e.g. function, single/burst transfer mode, data width, etc.) sent by the CPU;
  • SAR (Source Address Register) stores the starting address of the source data block in the memory that DMAC core needs to read data from;
  • SGR (Source Gap Register) stores the gap between two rows of the source data block in the memory that DMAC core needs to read data from.
  • DAR (Destination Address Register) stores the starting address of the destination data block in the memory that DMAC core needs to write data to.
  • DGR (Destination Gap Register) stores the gap between two rows of the destination data block in the memory that DMAC core needs to write data to.
  • BLR (Block Length Register) stores the amount of data to be processed. This register includes two separated registers: RIR (Row Index Register) indicates the row numbers of the data block; CIR (Column Index Register) indicates the column numbers of the data block.

2)  Configuration Context Generator (CCG)

CCG takes charge of two tasks in the DMA core. Firstly, it gets a DMA request and performs the handshaking protocol to get access to AHB master bus. Secondly, CCG has to decode the information contained in the register CMR and then generating configuration information and control parameters for the parameterized FSM and reconfigurable fabrics. A set of such information is called as the configuration context for the ReDMAC and is stored in configuration register files CRF.

Timing diagram of handshaking signals

Fig. 4. Timing diagram of handshaking signals.

CCG is designed to allow handshaking process and configuration context generation to happen in parallel. Fig. 4 shows the timing diagram of handshaking signals generated by CCG. After detecting the transition from 0 to 1 on the signal Dreg, CCG will start handshaking and context generating concurrently. As a result, it takes only one clock cycle to latch a configuration context to CRF.

3)  Control unit (CU)

Flowchart of the FSM

 Fig. 5. Flowchart of the FSM.

Control unit performs the functions of generating addresses to read data from the source memory area, converting data structure, moving and writing data to the target memory area. CU includes two parts:

  • Parameterized FSM is responsible for generating the signals that control the operation of the reconfigurable fabrics. The operation of the parameterized FSM is described by the flowchart in 5.
  • Reconfigurable fabrics consist of the routing blocks and basic building blocks that enable it to alert physically into a control circuit that handles the required DMA transfer and transformation. The routing blocks consist of the wires and programmable switches for establishing the connection between basic building blocks to build up address generator as well as and data converter according to a specific requirement.

In addition, CU also includes a Context Register File (CRF) that contains the configuration information for reconfigurable fabrics as well as parameters for the FSM. The CRF is established by the CCG based on the content of the register CMR. The values of these registers will be kept during the operation of the DMAC core in a particular mode and only changed when the DMAC core changes its operating mode.

Fig. 6 shows one of reconfigurable fabrics that can be configured to build various write address generators depending on the required DMA function. The basic building blocks are distinguished by grey while the routing blocks are identified by the orange. The registers in the CRF are denoted by green. CRF can be used to contain parameters that specify the address range or contain information bits that set the state of switches.

A reconfigurable fabric

Fig. 6. A reconfigurable fabric.

II.    Results and Evaluation

A.   Synthesis Results

The proposed reconfigurable DMAC was modeled at Register-Transfer-Level (RTL) in VHDL language and successfully synthesized into the gate-level circuits by Synopsys Design Compiler with the NANGATE 45nm open cell library [17].

Besides, in order to evaluate the effectiveness as well as the area cost of the proposed ReDMAC, we also implemented the five different DMAC versions (as shown in TABLE I). Here, basic DMAC, sorting DMAC, transposing DMAC, and fusing DMAC adopt only one of functions supported by ReDMAC as follows:

  • Basic DMAC only supports transferring data between memory areas;
  • Sorting DMAC supports the basic DMA function with the capability of data sorting;
  • Transposing DMAC supports the basic DMA function with the capability of matrix transposing;
  • Fusing DMAC supports the basic DMA function with the capability of matrix fusing.

 Function-select DMAC also supports four functions by integrated all above function cores into the same design, but each function is selected by switching between cores.

The synthesis results of DMAC versions are shown in TABLE I. The maximum frequency of the ReDMAC is about to 625 MHz that is the lowest compared with the other DMACs. This decrease in frequency is due to the delays introduced by the routing blocks. However, ReDMAC can support all four functions with an implementation cost of just 1407µm2 that is three times lower than the Function-Select DMAC. Also, note that ReDMAC’s implementation cost is only slightly higher than Sorting DMAC that is the most complex single-function DMAC.

TABLE I. Synthesis Results of Different DMAC Designs.

Design Cell Area (µm2) Fmax (MHz)
Basic DMAC 944.3 1428.57
Sorting DMAC 1309.78 1219.51
Transposing DMAC 1070.92 675.68
Fusing DMAC 1118.26 729.93
Function-Select DMAC 4526.51 675.68
Reconfigurable DMAC 1407.14 625.1

B.   Simulation Results

The proposed reconfigurable DMAC is evaluated in terms of performance, flexibility and configuration overhead using the HDL-based simulator. To do that, an evaluation testbench platform as shown in Fig. 7 has been built from the RTL model of ReDMAC.

Fig. 8 shows the simulation result of ReDMAC by ModelSim simulator. Each DMA session includes three phases: (1) Initializing: CPU writes a DMA command to ReDMAC and starts a DMA session by assert the signal dreq = ‘1’; (2) handshaking and configuring: ReDMAC handshakes with CPU to become the bus master and configures DMAC core at the same time; (3) DMA processing: DMAC transfers data between system memory and accelerator memory. Let’s look inside the waveform in Fig. 8 to analyze the operation of ReDMAC. At the time of 2845ns, CPU writes the first DMA command into the DMAC. After detecting that the signal dreq transit from ‘0’ to ‘1’, ReDMAC performs handshaking protocol to get access to the AHB master bus.  ReDMAC confirms that the DMA session is started by asserting the signals Dack = ‘1’. At the time of 3315ns, after right Dack = ‘1’, CPU can start an initializing phase for a next DMA session by writing new DMA command to the control register file of ReDMAC. The simulation results prove that our ReDMAC design allows the initialization of next DMA session to be hidden under the DMA process of current DMA session. In addition, it takes only one clock cycle to switch to next configuration context.

Simulation testbench

Fig. 7. Simulation testbench.

TABLE 2 summarizes execution time (in cycles) of the DMAC designs depending on the size of input data block. Where, the execution time is defined as latency of DMA process. The input data block is a 2D-array of R×C bytes (R = 1 in the case of verifying basic DMA function and sorting DMA function). The results in the table have been inferred from the FSM flowchart of DMA core (in Fig. 5) and verified by simulation with blocks of random data with many different sizes. Note that beside depending on the size of the input data block, the execution time of the sorting function also depends on the content of the data. Therefore, the execution time that is shown in the table for sorting function is latency for the worst case. As shown in TABLE 2, ReDMAC can be reconfigured flexibly to support all functions with a slight increase in the execution time. This increase is the result of each function being built-up from the reconfigurable fabrics instead of a dedicated architecture designed for that function.

Simulation Result

Fig. 8. Simulation Result.

TABLE 2. Execution time (cycles) of kernel loops on various computation platforms.

Function Basic DMAC Fusing DMAC Transposing DMAC Sorting DMAC ReDMAC
Basic DMA 7×C + 3 7×C + 8
Fusion [(7×C+4)×R+5] ×2 [(7×C+5)×R+3]×2
Transposing (7×C+5)×R+3 (7×C+5)×R+3
Sorting (9×C2 +25×C-26)/2 (9×C2 +25×C-26)/2

III.   Conclusion

This paper presents the design of a reconfigurable multi-function DMAC for high-perform computing systems. In addition to basic DMA function, the proposed ReDMAC also supports three data transformation functions that are popularly used in digital signal processing and multimedia processing. The ReDMAC also supports the capability of dynamic reconfiguration by enabling the hardware fabrics to be reconfigured into different functions even if the system is working. To reduce time overhead caused by reconfiguration, a DMA session is partitioned into phases and implemented by an architecture of two-stage pipeline. The proposed architecture has been modeled at RTL using VHDL language, and then simulated and synthesized in order to validate the flexibility, cost and performance of the architecture. The experimental results have proven that the proposed design meets the required functionality, while the area of the controller decreases about three times compared to total area of independent function cores. The proposed ReDMAC can be applied to reconfigurable high-performance SoCs.

References

  1. Kiem Hung Nguyen and Thi Minh Phan (2017) RTL Design of a Dynamically Reconfigurable Cell Array for Multimedia Processing.In Proceeding of the 4th NAFOSTED Conference on Information and Computer Science (NICS), 24-25 November 2017, Hanoi, Vietnam.
  2. Santarini, M. “Xilinx 16nm ultrascale+ devices yield 2-5X performance/watt advantage.” XCell Journal 90 (2015): 8-15.
  3. Mei, M. Berekovic and J.Y. Mignolet: “ADRES & DRESC: Architecture and Compiler for Coarse-Grain Reconfigurable Processors”, Fine- and Coarse-Grain Reconfigurable Computing, chapter 6, pp.255-297, 2007.
  4. KiemHung Nguyen and Peng Cao and Xuexiang Wang and Jun Yang and Longxing Shi (2013) Hardware Software Co-design of H.264 Baseline Encoder on Coarse-Grained Dynamically Reconfigurable Computing System-on-Chip.IEICE Transactions on Information and Systems, E96-D (3). pp. 601-615. ISSN 0916-8532.
  5. Dutt, A. Jantsch, S. Sarma, “Toward Smart Embedded Systems: A Self-aware System-on-Chip (SoC) Perspective” ACM TECS, Vol. 15, No. 2, Article 22, February 2016.
  6. João M. P. Cardoso, Pedro C. Diniz: “Compilation Techniques for Reconfigurable Architectures”, Springer, 2009.
  7. Shoa and S. Shirani, “Run-Time Reconfigurable Systems for Digital Signal Processing Applications: A Survey”, Journal of VLSI Signal Processing, Vol. 39, pp.213–235, 2005, Springer Science.
  8. Datasheet of Intel 8257 Programmable DMA Controller.
  9. Tehre, Vaishali, and Ravindra Kshirsagar. “Survey on coarse grained reconfigurable architectures.” International Journal of Computer Applications 48.16 (2012): 1-7.
  10. Lattice Semiconductor Corporation. Scatter-Gather Direct Memory Access Controller IP Core Users Guide. October 2010.
  11. Altera Corporation. Scatter-Gather DMA Controller Core, Quartus II 9.1. November 2009.
  12. Channelized Direct Memory Access and Scatter Gather. February 2010.
  13. Hussain, Tassadaq, et al. “PPMC: a programmable pattern based memory controller.” International Symposium on Applied Reconfigurable Computing. Springer, Berlin, Heidelberg, 2012.
  14. Nilsson, Emelie. “DMA Controller for LEON3 SoC: s Using AMBA.” (2013).
  15. João M. P. Cardoso Pedro C. Diniz: Compilation Techniques for Reconfigurable Architectures, 2009, Springer.
  16. AMBA Specification (Rev 2.0). http://www.arm.com
  17. http://www.nangate.com/

Nguyễn Kiêm Hùng

Hung K. Nguyen studied “Electronic Engineering” in both his bachelor’s and master’s degrees at the Vietnam National University, Hanoi, Vietnam. He received the bachelor’s degree in 2003. After receiving his bachelor’s degree, He worked as an internship in the Research Center of Electronics and Telecommunications. In 2006, He received the master’s degree in electronic engineering from VNU University of Engineering and Technology (VNU-UET). Before pursuing his Ph.D’s degree, He worked as a researcher at the Laboratory for Smart Integrated Systems in VNU University of Engineering and Technology for two years. In 2008, He went to Southeast University, Nanjing, China to get his Ph.D degree. He received the Ph.D. degree in Microelectronics and Solid State Electronics from Southeast University in 2013. After got his Ph.D’s degree, He returned to VNU University of Engineering and Technology to continue his research in VLSI design. He works currently as an assistant professor and senior researcher at VNU Key Laboratory for Smart Integrated Systems. His research interests mainly include multimedia processing, reconfigurable computing, and SoC designs.

Trả lời