簡易檢索 / 詳目顯示

研究生: 陳文遠
Chen, Wen-Yuan
論文名稱: Accelerating H.264 Decoder on an Mulit-core Platform
在多核心平台上加速H.264解碼器
指導教授: 李政崑
Lee, Jenq-Kuen
口試委員: 許雅三
馬瑞良
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 37
中文關鍵詞: H.264嵌入式系統單一指令處理多筆資料數位訊號處理器多核心
外文關鍵詞: H.264, Embedded System, SIMD, DSP, Multi-core
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • The increasing demand of high performance with applications in embedded devices pose challenges on embedded systems. A natural way to tackle this problem is the use of multi-core systems.
    In this work, we use H.264 decoder as a case study to show how one can tackle this problem with embedded multi-core systems. The target platform is the domestic
    (System-on-Chip)SoC, which is called PACDUO and it is with a ARM MPU and two PACDSPs. In addition, the Android platform is also ported on it. In the past, we have developed a set of intrinsics on the PACDSPs, which can let the user to write more ecient codes. In other words, we can exploit SIMD by using these intrinsics to accelerate our programs. Although we can accelerate programs with SIMD, we still have to dispatch the functions we attempt to speed up to PACDSPs. These functions should be independent according to its input data as well as output data.
    Then the data must be well provided on the external memory in case of insucient local memory of PACDSPs.
    To gain the performance of the H.264 decoder, the following should be taken in consideration: rst, Remote Procedure Call(RPC) overheads will raise if times of waking up PACDSPs are increased; secondly, data movement plays an important role on performance due to lack of local memory of PACDSPs; last but not least, complicated data dependencies among H.264 decoding process would hinder from parallelization.
    To accelerate the H.264 decoder, we propose a method including thread-level, data-level, and function-level parallelism. Creating two threads to execute decoding
    procedure and rendering procedure will exploit thread-level parallelism. Then, in the decoding procedure, we deploy independent data to be processed on the PACDSPs
    to exploit data-level parallelism. Lastly, partitioning the function in the rendering procedure to PACDSPs to take advantage of function-level parallelism.
    In experiments, we show the frame rates of each combination on the target platform, and discuss the performance of them. One-ARM reaches 10.93 fps while in our ultimate combination, it reaches 14.26 fps. Furthermore, supposed just looking at the performance that C compiler with intrinsic functions gains, it reaches about 3.56x based on one-ARM, whose compiler is arm-gcc.
    Besides, this work delivers a high performance application written in C language instead of assembly language. In the past, there are only H.264 decoder kernels of assembly version on PACDUO. It goes without saying that the performance of programs written in assembly language is the best, while in C language the performance degrades. Moreover, some applications written in C language have worse performance on this multi-core platform, PACDUO. We use H.264 decoder to show that even written in C language, the application still gets good performance on PACDUO.


    在嵌入式系統的領域中,許多應用軟體開始夾帶大量的計算。而基於嵌入式系統的單一處理器效能並不彰,所以現今方法大多使用多顆處理器來同時處理這些運算。
    此篇論文中,我們探討H.264 的解碼器在多核心架構的嵌入式系統上表現如何。我們所使用的平台是工研院研發的PAC DUO,其上有一顆ARM 的記憶體單元處理器以及兩顆數位訊號處理器PACDSP。除此之外,我們將H.264 的解碼器放到Android 的平台上面去跑以期能夠更接近現世的潮流。
    過去,實驗室曾在PACDSP 上面研發一套編譯器。此套編譯器上有提供一組Intrinsic 的功能,讓使用者能夠寫出相當有效率的程式碼。換句話說,利用這組Intrinsic,使用者能夠輕易的達成SIMD 的效果,同時加速整個程式。為了要提昇H.264 在此平台的效能,我們將以下的因素全都納入考量:第一,遠端程序呼叫(RPC),每當程式需要從MPU 去呼叫PACDSP 時,此功能就會被呼叫一次,所以MPU 與PACDSP 之間的溝通越頻繁,此功能所花費的時間就越多,同時會拖慢整個程式;第二,由於PACDSP 本身的記憶體相當有限,在MPU 與
    PACDSP 之間資料的傳遞也會是個關鍵;第三,H.264 解碼器本身的資料處理就附帶大量的關聯性,此特性會對整個程式的平行化有所阻礙。
    我們不論在執行緒層級(thread-level)上,抑或是資料層級(data-level)以及功能層級(function-level)上都做了平行化。在執行緒層級(thread-level)上,我們創兩個執行緒分別執行解碼(decoding)部份以及繪圖(rendering)部份。在資料層級(data-level)上,平均分配資料給兩顆PACDSP 去執行繪圖部份。最後在功能層級(function-level)上,我們將繪圖部份其中的一些功能切到PACDSP 去執行,使其加速。
    而實驗部份,我們測出在不同的加速技巧組合下,分別得出的每秒張數(frame rate),並且深入的坦討。只有一顆MPU 的版本有10.93 的每秒張數,而在我們許多加速技巧的最終版本則是得到14.26 的每秒張數。此外,如果只針對編譯的的Intrinsic 所產生的加速效果來看,則是有3.56 倍的成長。

    Abstract i Contents iii List of Figures v 1 Introduction 1 1.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 6 2.1 Target Multi-core Platform . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 FFMpeg H.264 Decoding Process . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Introduction of FFMpeg . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 The H.264 Decoder . . . . . . . . . . . . . . . . . . . . . . . . 8 3 ACCELARATION OF H.264 DECODER 18 3.1 Data Dependencies within H.264 Decoding . . . . . . . . . . . . . . 18 3.1.1 Comparison of data and functional partition . . . . . . . . . . 18 3.1.2 Data dependencies . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Partition Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 iii 3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2 Implementation on the Multi-core Platform . . . . . . . . . . 23 3.3 Accelerating H.264 Decoder with SIMD on Coprocessors . . . . . . . 24 3.3.1 Introduction to PACDSPs' intrinsic functions . . . . . . . . . 25 3.3.2 Exploiting SIMD by using intrinsic functions . . . . . . . . . . 25 3.4 Data Transferring Techniques in Use of DMA . . . . . . . . . . . . . 28 4 Experimental Results 31 4.1 Experimental Results on the Three Accelerating Techniques . . . . . 31 4.1.1 Performance Discussion of Accelerating Techniques on Multi- core Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Conclusion 36 References 37

    [1] Wikipedia, mpeg. http://en.wikipedia.org/wiki/FFmpeg.
    [2] Gisle Bjontegaard Thomas Wiegand, Gary J. Sullivan and Ajay Luthra. Overview of the h.264 / avc video coding standard. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2003.
    [3] Arnaldo Azevedo Cor Meenderinck Ben Juurlink Mauricio Alvarez Mesa, Alex Ramirez and Mateo Valero. Scalability of macroblock-level parallelism for h.264 decoding. International Conference on Parallel and Distributed Systems, pages 236{243, 2009.
    [4] Michael Bleyer Florian H. Seitner, Ralf M.Schreier, and Margrit Gelautz. Evaluation of data-parallel splitting approaches for h.264 decoding. Mobile Computing
    and Multimedia, 2008.
    [5] E. G.T. Jaspers E. B. van der Tol and R. H. Gelderblom. Mapping of h.264 decoding on a multiprocessor architechture. Proceedings of SPIE, 2003.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE