應用高階合成之設計效能改善輔助技術｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	張瑋君 Wei-Chun Chang
論文名稱：	應用高階合成之設計效能改善輔助技術 Assisted Design Optimization using High-Level Synthesis Flow
指導教授：	黃稚存 Huang, Chih Tsun
口試委員:	劉靖家 Liou, Jing Jia 黃俊達 Huang, Juinn Dar
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2016
畢業學年度：	105
語文別：	英文
論文頁數：	42
中文關鍵詞：	高階合成、記憶體、架構探索
外文關鍵詞：	High-level synthesis, Memory, Design Space Exploration
相關次數：	點閱：102 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

現在在加速器的研究主要是仰賴寄存器傳輸級(RTL)的流程來得到
準確的模擬時間、耗能、面積的估計。而前寄存器傳輸級(pre-RTL)，
像是Aladdin這個工具，可以花很少的時間得到大約精準的的估計，但
是卻不用產生RTL的程式。但是隨著設計的複雜化，不管是用RTL或
是pre-RTL都是非常的花時間。
在這篇論文，我們提出了一個輔助設計的流程，它可以有效地減少由循
環展開(loop unrolling)、內存分區(memory partition)組合的點。首先我們
會用Aladdin在不考慮內存分區的狀況下快速的找到循環展開的參數，並得
到動態資料關係圖(DDDG)。接著我們會分析DDDG來尋找內存分區的參
數。然後傳統分配資料的方法主要有區塊(block)、循環(cyclic)、區塊循
環(block-cyclic)。但是這些可能會造成在性能上的瓶頸。所以我們也提出
了不同於傳統的記憶體分區方法，透過分析DDDG讓資料平均分散到各個
記憶體分區。在這個流程的最後，我們會用高階合成工具把我們要探索
的C語言程式合成成RTL語言程式，並得到精準的估計。
現有的高接合成工具，像是Vivado HLS，可以使用C/C++/Systemc語
言的程式，然後根據不同的參數來得到不同的設計。然而，為了使用，程
式的撰寫方式是非常受局限的。所以我們為了增進內存分區、循環展開、
輸入緩衝提供了三個補丁。
實驗數據顯示我們可以大量的減少模擬時間。我們的記憶體分區方法也
在使用較少的分區下，大幅地增進了效能。

Current research works in accelerator designs mainly relies on register-transfer level (RTL)-
based flows to obtain accurate timing, power, and area estimations. Pre-RTL synthesis tool
such as Aladdin [1] can also be used to obtain approximately accurate estimations without
generating RTL code. However, design exploration of large or complex designs has become
a time-consuming process even using RTL or pre-RTL tools.
In this thesis, we proposed a design assisted flow which can efficiently reduce the searching
points of design exploration when using pre-synthesis tool considering micro-architecture
factors, such as loop unrolling, and memory partition. First, we use Aladdin [1] to quickly
explore the unrolling factor without considering memory partition and generate dynamic
data dependence graphs (DDDG). After choosing a unrolling number, the DDDG is analyzed
to explore the memory partition. However, conventional methods for memory partition are
mainly block, cyclic, or block-cyclic. The memory partition affects the performance a lot, and
it may be the bottleneck for the performance. In our flow, we proposed a memory-remapping
methodology to improve the source code with the better data placement in memory partitions
based on the DDDG. In the end, we use high-level synthesis tool to generate RTL code to
obtain accelerator designs with performance, area, and power.
Existing high-level synthesis (HLS) tools, such as Vivado HLS, can generate different
architectures of the application by applying different user’s configurartions in
C/C++/SystemC. However, the coding style is quite limited. Therefore, we provide three
patch methods, which address the improvement of memory partition, loop unrolling, and
input buffer of the high-level hardware description, respectively.
Experiment results show that we can dramatically reduce the simulation time. Our
memory-remapping methodology can also improve the performance of the design with an
optimized number of BRAM.

Introduction 2
Preliminary 5
1 Introduction to Aladdin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Dynamic Data Dependency Graphs . . . . . . . . . . . . . . . . . . . 6
2 H.-T. Tsai’s Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Proposed Methodology 12
1 Proposed Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Unroll Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Partition Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Data Remapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 New Data Remapping . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Case Study - Sobel Gy . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7 Code Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.8 HLS & Select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Experiment 34
1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2 Experiment Result of Our Remapping Approach . . . . . . . . . . . . . . . . 34
3 Experiment Result of Our Proposed Flow . . . . . . . . . . . . . . . . . . . . 38
6
Conclusion and Future Work 39
1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
                                

[1] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks “Aladdin: A Pre-RTL, PowerPerformance
Accelerator Simulator Enabling Large Design Space Exploration of Customized
Architectures,” in ACM/IEEE International Symposium on Computer Architecture
(ISCA), pp. 97-108 Jun. 2014.
[2] J. Cong, W. Jiang, B. Liu, and Y. Zou “Automatic Memory Partitioning and Scheduling
for Throughput and Power Optimization” in ACM, 2011.
[3] Y. Wang, P. Zhang, X. Cheng, and J. Cong “An Integrated and Automated Memory
Optimization Flow for FPGA Behavioral Synthesis” in ASPDAC, 2012.
[4] P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong “Memory Partitioning and
Scheduling Co-optimization in Behavioral Synthesis” in ICCAD, 2012.
[5] Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong “Memory Partitioning for Multidimensional
Arrays in High-Level Synthesis” in DAC, 2013.
[6] Y. Wang, P. Li, and J. Cong “Theory and Algorithm for Generalized Memory Partitioning
in High-Level Synthesis” in FPGA, 2014.
[7] P. Li, P. Zhang, L. -N. Pouchet, and J. Cong “Resource-Aware Throughput Optimization
for High-Level Synthesis” in FPGA, 2015.
[8] M. Li, P. Zhang, C. Zhu, H. Jia, X. Xie, J. Cong, and W. Gao “High Efficiency VLSI
Implementation of an Edge-directed Video Up-scaler Using High Level Synthesis” in
IEEE International Conference on Consumer Electronics (ICCE), 2015.
[9] C.-T. Huang, H.-T. Tsai “Performance Optimization of Accelerators using C-bassd
High-Level Synthesis Flow ”
[10] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, David Brooks
“MachSuite: Benchmarks for Accelerator Design and Customized Architectures” in
Workload Characterization (IISWC), IEEE International Symposium on), 2014.
[11] B. Carrion Schafer and A. Mahapatra “S2CBench : Synthesizable SystemC Benchmark
Suite for High-Level Synthesis” in IEEE Embedded Systems Letters (Volume:6 , Issue:
3 ) 2014.
[12] J. Cong, V. Sarkar, G. Reinman and A. Bui “Customizable Domain Specific Computing”
in IEEE Design & Test of Computers, 2011.
[13] Avalible:http://accelerator.eecs.harvard.edu/isca14tutorial/isca2014-tutorial-cadbenchmarks.pdf

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文