Abstract—This post belongs to the article series that presents the design of a Coarse-grained Reconfigurable Architecture (CGRA), called MUSRA (Multimedia Specific Reconfigurable Architecture). The MUSRA is proposed to exploit multi-level parallelism of the computation-intensive loops in multimedia processing applications. To solve the huge bandwidth requirement of parallel processing arrays, the proposed architecture focuses on the exploitation of data locality to reduce data access bandwidth and increase efficiency of pipelined execution of the kernel loops. The MUSRA also supports the capability of dynamic reconfiguration by enabling the hardware fabrics to be reconfigured into different functions even if the system is working. The proposed architecture has been modeled at Register Transfer Level (RTL) by using VHDL language. Some benchmark applications have been mapped onto the MUSRA in order to validate the high flexibility and performance of the architecture that is suitable for a wide range of multimedia processing applications. The proposed CGRA can be applied as a reconfigurable hardware IP (Intellectual Property) core in reconfigurable high-performance System-on-Chips.
Top-level Architecture of MUSRA
The MUSRA is composed of a Reconfigurable Computing Array (RCAs), Input/Output FIFOs, Global Register File (GRF), Data/Context memory subsystems, and DMA (Direct Memory Access) controllers, etc. (Fig. 1). Data/Context memory subsystems consist of storage blocks and DMA controllers (i.e. CDMAC and DDMAC). The RCA is an array of 8´8 RCs (Reconfigurable Cells) that can be configured partially to implement computation-intensive tasks. The input and output FIFOs are the I/O buffers between the data memory and the RCA. Each RC can get the input data from the input FIFO or/and GRF, and store the results back to the output FIFO. These FIFOs are all 512-bit in width and 8-row in depth, and can load/store sixty-four bytes or thirty-two 16-bit words per cycle. Especially, the input FIFO can broadcast data to every RC that has been configured to receive the data from the input FIFO. This mechanism aims at exploiting the reusable data between several iterations. The interconnection between two neighboring rows of RCs is implemented by a crossbar switch. Through the crossbar switch, an RC can get results that come from an arbitrary RC in the above row of it. The Context Parser decodes the configuration information that has been read from the Context Memory, and then generates the control signals that ensure the execution of RCA accurately and automatically.