簡易檢索 / 詳目顯示

研究生: 張瑋君
Wei-Chun Chang
論文名稱: 應用高階合成之設計效能改善輔助技術
Assisted Design Optimization using High-Level Synthesis Flow
指導教授: 黃稚存
Huang, Chih Tsun
口試委員: 劉靖家
Liou, Jing Jia
黃俊達
Huang, Juinn Dar
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2016
畢業學年度: 105
語文別: 英文
論文頁數: 42
中文關鍵詞: 高階合成記憶體架構探索
外文關鍵詞: High-level synthesis, Memory, Design Space Exploration
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現在在加速器的研究主要是仰賴寄存器傳輸級(RTL)的流程來得到
    準確的模擬時間、耗能、面積的估計。 而前寄存器傳輸級(pre-RTL),
    像是Aladdin這個工具,可以花很少的時間得到大約精準的的估計,但
    是卻不用產生RTL的程式。 但是隨著設計的複雜化,不管是用RTL或
    是pre-RTL都是非常的花時間。
    在這篇論文,我們提出了一個輔助設計的流程,它可以有效地減少由循
    環展開(loop unrolling)、 內存分區(memory partition)組合的點。 首先我們
    會用Aladdin在不考慮內存分區的狀況下快速的找到循環展開的參數,並得
    到動態資料關係圖(DDDG)。 接著我們會分析DDDG來尋找內存分區的參
    數。 然後傳統分配資料的方法主要有區塊(block)、循環(cyclic)、區塊循
    環(block-cyclic)。 但是這些可能會造成在性能上的瓶頸。 所以我們也提出
    了不同於傳統的記憶體分區方法,透過分析DDDG讓資料平均分散到各個
    記憶體分區。 在這個流程的最後,我們會用高階合成工具把我們要探索
    的C語言程式合成成RTL語言程式,並得到精準的估計。
    現有的高接合成工具,像是Vivado HLS,可以使用C/C++/Systemc語
    言的程式,然後根據不同的參數來得到不同的設計。 然而,為了使用,程
    式的撰寫方式是非常受局限的。 所以我們為了增進內存分區、循環展開、
    輸入緩衝提供了三個補丁。
    實驗數據顯示我們可以大量的減少模擬時間。 我們的記憶體分區方法也
    在使用較少的分區下,大幅地增進了效能。


    Current research works in accelerator designs mainly relies on register-transfer level (RTL)-
    based flows to obtain accurate timing, power, and area estimations. Pre-RTL synthesis tool
    such as Aladdin [1] can also be used to obtain approximately accurate estimations without
    generating RTL code. However, design exploration of large or complex designs has become
    a time-consuming process even using RTL or pre-RTL tools.
    In this thesis, we proposed a design assisted flow which can efficiently reduce the searching
    points of design exploration when using pre-synthesis tool considering micro-architecture
    factors, such as loop unrolling, and memory partition. First, we use Aladdin [1] to quickly
    explore the unrolling factor without considering memory partition and generate dynamic
    data dependence graphs (DDDG). After choosing a unrolling number, the DDDG is analyzed
    to explore the memory partition. However, conventional methods for memory partition are
    mainly block, cyclic, or block-cyclic. The memory partition affects the performance a lot, and
    it may be the bottleneck for the performance. In our flow, we proposed a memory-remapping
    methodology to improve the source code with the better data placement in memory partitions
    based on the DDDG. In the end, we use high-level synthesis tool to generate RTL code to
    obtain accelerator designs with performance, area, and power.
    Existing high-level synthesis (HLS) tools, such as Vivado HLS, can generate different
    architectures of the application by applying different user’s configurartions in
    C/C++/SystemC. However, the coding style is quite limited. Therefore, we provide three
    patch methods, which address the improvement of memory partition, loop unrolling, and
    input buffer of the high-level hardware description, respectively.
    Experiment results show that we can dramatically reduce the simulation time. Our
    memory-remapping methodology can also improve the performance of the design with an
    optimized number of BRAM.

    1 Introduction 2 2 Preliminary 5 2.1 Introduction to Aladdin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Dynamic Data Dependency Graphs . . . . . . . . . . . . . . . . . . . 6 2.2 H.-T. Tsai’s Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Proposed Methodology 12 3.1 Proposed Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.2 Unroll Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.3 Partition Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.4 Data Remapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.5 New Data Remapping . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.6 Case Study - Sobel Gy . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.7 Code Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.8 HLS & Select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Experiment 34 4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Experiment Result of Our Remapping Approach . . . . . . . . . . . . . . . . 34 4.3 Experiment Result of Our Proposed Flow . . . . . . . . . . . . . . . . . . . . 38 6 5 Conclusion and Future Work 39 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    [1] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks “Aladdin: A Pre-RTL, PowerPerformance
    Accelerator Simulator Enabling Large Design Space Exploration of Customized
    Architectures,” in ACM/IEEE International Symposium on Computer Architecture
    (ISCA), pp. 97-108 Jun. 2014.
    [2] J. Cong, W. Jiang, B. Liu, and Y. Zou “Automatic Memory Partitioning and Scheduling
    for Throughput and Power Optimization” in ACM, 2011.
    [3] Y. Wang, P. Zhang, X. Cheng, and J. Cong “An Integrated and Automated Memory
    Optimization Flow for FPGA Behavioral Synthesis” in ASPDAC, 2012.
    [4] P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong “Memory Partitioning and
    Scheduling Co-optimization in Behavioral Synthesis” in ICCAD, 2012.
    [5] Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong “Memory Partitioning for Multidimensional
    Arrays in High-Level Synthesis” in DAC, 2013.
    [6] Y. Wang, P. Li, and J. Cong “Theory and Algorithm for Generalized Memory Partitioning
    in High-Level Synthesis” in FPGA, 2014.
    [7] P. Li, P. Zhang, L. -N. Pouchet, and J. Cong “Resource-Aware Throughput Optimization
    for High-Level Synthesis” in FPGA, 2015.
    [8] M. Li, P. Zhang, C. Zhu, H. Jia, X. Xie, J. Cong, and W. Gao “High Efficiency VLSI
    Implementation of an Edge-directed Video Up-scaler Using High Level Synthesis” in
    IEEE International Conference on Consumer Electronics (ICCE), 2015.
    [9] C.-T. Huang, H.-T. Tsai “Performance Optimization of Accelerators using C-bassd
    High-Level Synthesis Flow ”
    [10] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, David Brooks
    “MachSuite: Benchmarks for Accelerator Design and Customized Architectures” in
    Workload Characterization (IISWC), IEEE International Symposium on), 2014.
    [11] B. Carrion Schafer and A. Mahapatra “S2CBench : Synthesizable SystemC Benchmark
    Suite for High-Level Synthesis” in IEEE Embedded Systems Letters (Volume:6 , Issue:
    3 ) 2014.
    [12] J. Cong, V. Sarkar, G. Reinman and A. Bui “Customizable Domain Specific Computing”
    in IEEE Design & Test of Computers, 2011.
    [13] Avalible:http://accelerator.eecs.harvard.edu/isca14tutorial/isca2014-tutorial-cadbenchmarks.pdf

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE