研究生: |
陳家明 Chen, Jia-Ming |
---|---|
論文名稱: |
Multimedia Programming for Mobile Handhelds 移動式手持裝置上的多媒體程式設計 |
指導教授: |
石維寬
Shih, Wei-Kuan |
口試委員: | |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2010 |
畢業學年度: | 98 |
語文別: | 英文 |
論文頁數: | 130 |
中文關鍵詞: | 平行架構核心 、PAC數位訊號處理器 、超長指令字 、H.264/AVC 、AAC 、異質多核心 、異質雙核心 、應用處理器 、Power-aware 、DVFS 、Imprecise computation 、Deferred optional task 、平行計算 、即時排程 |
外文關鍵詞: | Parallel Architecture Core, PACDSP, VLIW DSP, H.264/AVC, AAC, Heterogeneous multi-core, Heterogeneous dual-core, Application processor, Power-aware, DVFS, Imprecise computation, Deferred optional task, Parallel computing, Real-time scheduling |
相關次數: | 點閱:4 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Multimedia programming on the heterogeneous embedded SoCs for mobile handhelds has been investigated in this thesis. The proposed methodologies in design space explorations for multimedia programming concern three major constraints on mobile handhelds: computational performance, power consumption, and real-time processing. The HW/SW co-design approach in addressing the issues contains the experiences from the microarchitecture level to multimedia application developments in the early design stage of the PAC application processor. The major contributions are described as follows.
First, the high performance of computations can be addressed by parallelism explorations in multimedia applications from instruction-level & data-level parallelisms (ILP & DLP) on the microarchitecture to task or thread-level parallelism (TLP) in the system-level processing. To achieve the purpose, the multimedia application is decomposed into computational kernels and dynamic information parts. The computational kernels are greatly optimized by programming methodologies in ILP & DLP fashion to exert the capability of microarchitectures, such as the PACDSP in the ways of very long instruction word (VLIW) and single instruction multiple data (SIMD). Meantime the exploration results in the early design stage strengthened the V3 ISA of the PACDSP so that together with its superior distributed & ping-pong register organization, it has been proved in outperformed several popular commercial microprocessors by the BDTI certification. On the other hand, the performance can be further elevated with the TLP in the grade of computation kernels by the proposed parallel programming models in the system-level processing. Especially a novel parallel execution model, particularly suitable for the embedded multi-core SoC, is proposed for the H.264/AVC decoder. The macroblock dependencies in the spatial domain are cleverly relaxed, thus approaching the maximized degree of parallelism. With the best of our knowledge, it outperforms current existing research results.
Next, a formal scheduling model extended from the previous researches in imprecise computations, namely ICwDOT is proposed to properly joint the quality, energy, and real-time processing into considerations. The ICwDOT model possesses the minimized error as before and meantime benefits the incoming tasks in schedulability, quality, fault-tolerance, or even power savings. The model contains two types of problems, namely DOTwP and DOTwoP, according to the preemptive properties of the task system. DOTwoP is theoretically proven to be NP-hardness. Then two fast algorithms, namely SchedDOTwP and SchedDOTwoP, in low complexity are devised to solve the problems respectively. The SchedDOTwP algorithm is optimal, while the algorithm SchedDOTwoP reveals its elasticity in the system-level processing. The results contribute in simple and flexible way for addressing the notorious energy and real-time issues resided in multimedia applications on mobile handhelds.
Then two essential applications resided in mobile handhelds – flash memory system and H.264/AVC video player are studied to complete the whole thesis by joining the aforementioned methodologies. First the behavior of online service requests of the flash memory system is transformed into aforementioned DOTwoP problems. Therefore the issue of dynamic power saving can be efficiently solved by the SchedDOTwoP algorithm. Second, the comprehensive design exploration for a power-aware video player is proposed. The static or leakage power can be saved from a coarse-grained view of power management for a video player by mapping the user’s behavior onto the power-state transitions provided in a dual- or multi-core SoC. Then a fine-grained view of power-aware video decoding framework is invented to save dynamic power in advance whenever the user engages continuous video decoding. The power-aware video decoding framework possesses three novelties. First, the innovative power-aware video decoding flow can collaborate well with the aforementioned macroblock-level parallel programming models. Second, the required computation, that is worse case execution cycle (WCEC) in terms of the real-time system, can be precisely formulated by well optimized computational kernels of a coding standard, such as H.264/AVC, which is totally different compared to prediction methods adopted in the previous researches. Third, by the aforementioned partitioning methodology, the dynamic information, such as coded block patterns (CBPs), motion vectors (MVs), and macroblock (MB) types can precisely control the variant WCEC at runtime. Therefore the power-aware real-time scheduling kernel can be easily transformed as well as the power-aware flash memory system. We believe that the methodologies provided in the thesis may benefit to developing multimedia applications on the embedded SoCs for the mobile handhelds toward high performance, energy-efficient and real-time design considerations.
本篇論文主題為探討基於異質多核心嵌入系統晶片架構下之移動手持裝置上的多媒體程式設計。內文提出的設計方法專門應付移動式手持裝置上的多媒體設計探索過程中亟需解決的三個關鍵問題:計算效能,功耗和即時性處理。所提出的設計方法論與模型以軟硬體共同設計的觀點,經由初期階段即參與工研院開發的PAC應用處理器計畫的系統晶片開發來實現。這些方法與經驗涵蓋從底層的微處理器架構設計階段到上層的多媒體應用程式的設計開發。主要的貢獻描述如下。
首先,高效能計算可透過指令(instruction),資料(data)與任務單元(task)的平行化形式,將多媒體程式存在的平行化特質轉化為底層的微處理器架構設計與系統層級的運算使其達到高效能的處理。此目標的達成關鍵在於將多媒體應用程式適當分解成計算單元與動態資訊。一方面,計算單元可藉由本論文提出的程式設計方法有效率的轉化為指令與資料平行化形式,以充分發揮微處理器架構的特性與能力,比如PACDSP提供的VLIW與SIMD指令等。同時,提出的設計方法亦能對微處理器指令集架構的設計進行加強與改進。如同本論文所共同開發的PACDSP微處理器,透過此方式改善其指令集到V3版本並配合其原先優異化的分散式與ping-pong組織的暫存器架構,在BDTI機構的認證下證明其設計優於數個商業化知名的微處理器晶片架構。另一方面,以任務單元的平行化形式可經由本論文提出的優異平行化設計方法論,巧妙的將這些計算單元連結起來,進而在系統層級上達到高效能的運算目標。例如現今最熱門的H.264/AVC解碼程式,透過本論文所提出的平行方法模型,能夠將空間的資料相關性影響平行化的因素降至最低,使其達到高效能的平行運算。實驗結果亦證明所提方法優於目前其他研究文獻所提供的方法。
在應付功耗和即時性處理問題上,本論文提出了一個理論的正規排程模型,稱為ICwDOT,此模型為先前的imprecise computation研究成果的延伸,並適當的將質量,能量與即時處理特性一併納入排程考量。ICwDOT模型除了能夠保持原先imprecise computation研究成果中優化的最小誤差特性外,同時能兼顧即將到來的服務任務單元的需求,使其能夠考量整體系統的可排程性,質量,容錯,甚至於省電的要求。而ICwDOT模型依據處理任務單元的可被中斷與否分為兩種類型:DOTwP與DOTwoP問題。本論文證明了DOTwoP為NP-hardness的問題,並各別提出了快速低複雜度的演算法:SchedDOTwP與SchedDOTwoP來解決。其中,SchedDOTwP為解決DOTwP的最優化演算法,而SchedDOTwoP演算法在處理DOTwoP問題上具備了簡單與彈性的特質,非常適用於處理移動式手持裝置上執行多媒體應用程式中棘手的功耗與即時性運算問題。
最後,綜合上述的方法論與模型並導入實現於移動式手持裝置上兩個典型應用:快閃記憶體系統與H.264/AVC多媒體播放程式,以佐證本論文所提之方法論與模型的實用性。將快閃記憶體系統的讀寫需求轉換成DOTwoP問題後,就能使用提出的SchedDOTwoP演算法輕鬆解如何兼顧快閃記憶體系統中省電與即時性處理的問題。在H.264/AVC多媒體播放程式的實現上,利用上述提及的分解方式,充分利用影像訊號中的動態資訊搭配本論文提出的適應性功耗與即時性運算調整框架,進而搭配前述優異的微處理器到系統層級的平行化方法論與模型下,達到超高效能且低功耗的H.264/AVC多媒體播放程式設計。可預見的,本論文提出的相關研究方法論與模型,能對即將到來的多核心嵌入晶片系統架構為主的移動式手持裝置上頭進行的多媒體應用程式開發,同時兼顧高效能,低功耗與即時運算特性的需求提供極具參考價值的解決方案。
[1] J. Turley. Survey says: software tools more important than chips. Embedded Systems Design Journal, 4-11-2005
[2] P. Paulin, C. Pilkington, M. Langevin, E. Bensoudane, D. Lyonnard, O. Benny, B. Lavigueur, D. Lo, G. Beltrame, V. Gagne, and G. Nicolescu, “Parallel programming models for a multi-processor SoC platform applied to networking and multimedia,” IEEE Transactions on VLSI Journal, 2006
[3] A. A. Jerraya, A. Bouchhima, and F. Petrot, “Programming models and HW-SW interfaces abstraction for multi-processor SoC,” Proceedings of DAC, pp. 280-285, 2006
[4] TI OMAP – http://www.omap.com
[5] Atmel DIOPSIS – http://www.atmel.com
[6] Philips NXP – http://www.nxp.com
[7] Qualcomm Snapdragon – http://www.qualcom.com
[8] ST Nomadik – http://www.st.com
[9] STI Cell BE – http://www.ibm.com/developerworks/power/cell/
[10] S. Han, S. Chae, L. Brisolara, L. Carro, K. Popovici, X. Guerin, A. A. Jerraya, K. Huang, L. Li, and X. Yan,“Simulink®-based heterogeneous multiprocessor SoC design flow for mixed hardware/software refinement and simulation, “ Integr. VLSI Journal vol. 42, issue 2, pp. 227-245, 2009
[11] S. Kwon, Y. Kim, W. Jeun, S. Ha, and Y. Paek, “A retargetable parallel-programming framework for MPSoC,” ACM Trans. Des. Autom. Electron. Sys, .vol. 13, issue 3, pp. 1-18, 2008
[12] J. Ceng, W. Sheng, J. Castrillon, A. Stulova, R. Leupers, G. Ascheid, and H. Meyr, “A high-level virtual platform for early MPSoC software development,” Proceedings of the 7th IEEE/ACM international Conference on Hardware/Software Codesign and System Synthesis , pp. 11-20, 2009
[13] K. Popovici, and A. Jerraya, “Programming models for MPSoC,” Chapter 9 in Model Based Design of Heterogeneous Embedded Systems,” Ed. CRC Press, 2009
[14] T. J. Lin, C. N. Liu, S. Y. Tseng, Y. H. Chu, and A. Y. Wu, “Overview of ITRI PAC project – from VLIW DSP processor to multicore computing platform,” in Proc. VLSI-DAT, pp.188-191, Apr. 2008
[15] C. W. Chang, et al., “PACDSP core and application processors,” in Proc. ICME, pp.289-292, 2006
[16] T. J. Lin, P. C. Hsiao, S. K. Chen, Y. T. Kuo, and C. W. Liu, “Design & implementation of a high-performance & complexity-effective VLIW DSP for multimedia applications,” Journal of Signal Processing Systems, vol. 51, pp.209-223, Jun. 2008
[17] http://www.itri.org.tw/
[18] P. Lapsley, J. Bier, and E. A. Lee, "DSP Processor Fundamentals – Architectures and Features, IEEE Press, 1996
[19] Y. H. Hu, Programmable Digital Signal Processors – Architecture, Programming, and Applications, Marcel Dekker Inc., 2002
[20] J. A. Fisher, P. Faraboschi, and C. Young, Embedded Computing – A VLIW Approach to Architecture, Compiler, and Tools, Morgan Kaufmann, 2005
[21] T. J. Lin, P. C. Hsiao, C. W. Liu, and C. W. Jen, “Area-efficient register organization for fully-synthesizable VLIW DSP cores,” International Journal of Electrical Engineering, vol. 13, pp.117-127, May 2006
[22] T. J. Lin, C. C. Lee, C. W. Liu, and C. W. Jen, “A novel register organization for VLIW digital signal processors,” in Proc. VLSI-TSA-DAT, Apr. 2005, pp.337-340
[23] http://www.bdti.com/bdtimark/core_scores.pdf
[24] T. J. Lin, et al., “A unified processor architecture for RISC & VLIW DSP,” in Proc. GLSVLSI, pp.50-55, Apr. 2005
[25] C. H. Liu, et al., “Hierarchical instruction encoding for VLIW digital signal processors,” in Proc. ISCAS, pp.3503-3506, May 2005
[26] C. Y. Lai, J. H. Lin, and Y. F. Wang, “DVFS SoC architecture and implementation,” SoC Technology Journal, vol. 3, pp.84-91, 2005
[27] Advanced Video Coding for Generic Audiovisual Services, ITU-T Recommendation H.264, Mar. 2005
[28] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, issue 7, pp. 560–576, July 2003
[29] Sullivan, G.J., Topiwala, P., and Luthra, A., “The H.264/AVC Advanced Video Coding standard: overview and introduction to the fidelity range extensions," SPIE Conference on Applications of Digital Image Processing, vol. 5558, part 1, pp. 454-474, Aug. 2004
[30] M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro, “H.264/AVC baseline profile decoder complexity analysis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, issue 7, pp. 704–716, July 2003
[31] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer, and T. Wedi, “Video coding with H.264/AVC: tools, performance, and complexity,” IEEE Circuit and Systems Magazine, pp. 7–28, Q1, 2004
[32] H. Kalva and B. Furht, “Complexity estimation of the H.264 coded video bitstreams,” The Computer Journal, vol. 48, pp. 504–513, 2005
[33] E. Nurvitadhi, B. Lee, C. Yu, and M. Kim, “A comparative study of dynamic voltage scaling techniques for low-power video decoding,” International Conference on Embedded Systems and Applications, pp. 23–26, June 2003
[34] K. Choi, R. Soma, and M. Pedram, “Off-chip latency-driven dynamic voltage and frequency scaling for an MPEG decoding,” Proceedings of the 41st Annual Conference on Design Automation, pp. 7–11, June 2004
[35] S. Lee, “Low-power video decoding on variable voltage processor for mobile multimedia applications,” ETRI Journal, vol. 27, no. 5, pp. 504–510, Oct. 2005
[36] O. S. Unsal and I. Koren, “System-level power-aware design techniques in real-time systems,” Proc. IEEE, Special Issue on Real-Time Systems, vol. 91, no.7, pp.1055–1069, July2003
[37] F. Yao, A. Demers, and S. Shenker, “A scheduling model for reduced CPU energy,” IEEE Annual Foundations of Computer Science, pp.374-382, 1995
[38] I. Hong, M. Potkonjak, and M. B. Srivastava, “On-line scheduling of hard real-time tasks on variable voltage processor,” IEEE/ACM international Conference on Computer-Aided Design, pp. 653-656, 1998
[39] I. Hong, D. Kirovski, G. Qu, M. Potkonjak, and M. B. Srivastava, “Power optimization of variable voltage core-based systems,” DAC ACM, pp. 176-181, 1998
[40] W. Kim, D. Shin, H. Yun, J. Kim, and S. L. Min, “Performance comparison of dynamic voltage scaling algorithms for hard real-time systems,” RTAS. IEEE Computer Society, pp. 219, 2002
[41] H. Aydi, P. Mejía-Alvarez, D. Mossé, and R. Melhem, “Dynamic and aggressive scheduling techniques for power-aware real-time Systems,” RTSS. IEEE Computer Society, pp. 95, 2001
[42] R. Xu, R. Melhem, and D. Mossé, “A unified practical approach to stochastic DVS scheduling,” EMSOFT '07. ACM, pp. 37-46, 2007
[43] R. Xu, D. Mossé, and R. Melhem, “Minimizing expected energy consumption in real-time systems through dynamic voltage scaling,” ACM Trans. Comput. Syst., vol. 25, issue 4, no. 9, 2007
[44] V. Berten, C. Chang, and T. Kuo, “Discrete frequency selection of frame-based stochastic real-time tasks,” RTCSA. IEEE Computer Society, pp. 269-278, 2008
[45] J. Seo, T. Kim, and K. Chung, “Profile-based optimal intra-task voltage scheduling for hard real-time applications,” DAC '04. ACM, pp. 87-92, 2004
[46] D. Shin and J. Kim, “A profile-based energy-efficient intra-task voltage scheduling algorithm for real-time applications,” ISLPED '01. ACM, pp. 271-274, 2001
[47] F. Gruian, “Hard real-time scheduling for low-energy using stochastic data and DVS processors,” ISLPED '01. ACM, pp. 46-51, 2001
[48] R. Jejurikar, C. Pereira, and R. Gupta, “Leakage aware dynamic voltage scaling for real-time embedded systems,” DAC '04. ACM, pp. 275-280, 2004
[49] J. Chen and T. Kuo, “Procrastination determination for periodic real-time tasks in leakage-aware dynamic voltage scaling systems,” IEEE/ACM international Conference on Computer-Aided Design, pp. 289-294, 2007
[50] S. K. Baruah, M. E. Hickey, “Competitive on-line scheduling of imprecise computations,” IEEE Transactions on Computers, pp. 1027-1032, 1996
[51] J. Y. Chung, W. K. Shih, J. W. S. Liu, and D. W. Gillies, “Scheduling imprecise computations to minimize total error,” Microprocessing and Microprogramming. 27, pp. 767-774
[52] W. K. Shih, J. W. S. Liu, and J. Y. Chung, “Fast algorithms for scheduling imprecise computations,” SIAM J. on Computing. 20(3), pp. 537-552
[53] W. K. Shih, C. R. Lee, and C. H. Tang, “A fast algorithm for scheduling imprecise computations with timing constraints to minimize weighted error,” The 21th IEEE Real-Time Systems Symposium, pp. 305-310
[54] W. K. Shih and J. W. S. Liu, “Algorithms for scheduling imprecise computations to minimize maximum error,” IEEE Transactions on Computers, Mar. 1995, pp. 466-471
[55] W. K. Shih and J. W. S. Liu, ”On-line algorithms for scheduling imprecise computations,” SIAM J. on Computing, pp. 1105-1121, 1996
[56] J. W. S Liu, W. K Shih, K. J. Lin, R. Bettati, and J. Y. Chung, “Imprecise computations,” IEEE Special Issue on Real-Time Systems, pp. 83-94
[57] J. h. Kim, K. Song, K. Choi, and G. Jung, “Performance evaluation of on-line scheduling algorithms for imprecise computation,” In Proceedings of the 5th International Conference on Real-Time Computing Systems and Applications (RTCSA’98), IEEE Computer Society Press
[58] J. H. Ahn, M. Erez, and W. J. Dally, “Tradeoff between data-, instruction-, and thread-level parallelism in stream processors,” In Proceedings of the 21st Annual international Conference on Supercomputing (ICS '07. ACM), pp. 126-137, June 2007
[59] U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens, “Programmable stream processors,” IEEE Computer Society Press vol. 36, issue 8, pp. 54-62, Aug. 2003
[60] J. Gummaraju, J. Coburn, Y. Turner, and M. Rosenblum, “Streamware: programming general-purpose multicore processors using streams,” SIGARCH Comput. Archit. News, vol. 36, issue 1, pp. 297-307, Mar. 2008
[61] V. Z. Mesarovic, R. Rao, M. V. Dokic, and S. Deo, “Selecting an optimal Huffman decoder for AAC,” AES Convention Paper 5436, Sep. 2001
[62] T.-H. Tsai and C.-C. Yen, “A high quality re-quantization/quantization method for MP3 and MPEG-4 AAC audio coding,” IEEE International Symposium on Circuits and Systems, Vol. 3, pp.851-854, May 2002
[63] P. Duhamel, Y. Mahieux and J.P. Petit, “A fast algorithm for the implementation of filter banks based on `time domain aliasing cancellation',” International Conference on Acoustics, Speech, and Signal Processing, Vol. 3, pp.2209-2212, Apr. 1991
[64] ARM, “ARM MPEG-4 AAC LC decoder technical specification,” June. 2003
[65] http://www.kanecomputing.co.uk/pdfs/mpeg4_aac_ds.pdf
[66] S.-Y. Tseng; and T.-W. Hsieh, “A pattern-search method for H.264/AVC CAVLC decoding,” IEEE International Conference on Multimedia and Expo, pp. 1073-1076, July 2006
[67] R. Hashemian, “Design and hardware implementation of a memory efficient Huffman decoding,” IEEE Transactions on Consumer Electronics, vol. 40, No. 3, pp. 345-352, Aug. 1994
[68] BDTI solution benchmark results for H.264 decoders on the TI TMS320DM6446 DaVinci SoC at http://www.bdti.com/bdtimark/sc_dm6446.htm
[69] BDTI solution benchmark results for H.264 decoders on the ARC AV 401V Video Subsystem at http://www.bdti.com/bdtimark/sc_arc.htm
[70] J. Pouwelse, K. Langendoen, R. Lagendijk, and H. Sips, “Power aware video decoding,” Proc. Picture Coding Symposium, pp. 303–306, 2001
[71] K. Choi, K. Dantu, W. Cheng, and M. Pedram, “Frame-based dynamic voltage and frequency scaling for a MPEG decoder,” Proc. Intl Conf. Computer-Aided Design, pp. 732–737, 2002
[72] A. C. Bavier, A. B. Montz, and L. L. Peterson, “Predicting MPEG execution times,” Proceedings of ACM SIGMETRICS '98, pp. 131–140, 1998
[73] E.-Y. Chung, L. Benini, A. Bogliolo, Y.-H. Lu, and G. D. Micheli, “Dynamic power management for nonstationary service requests,” IEEE Transactions on Computers, v.51 n.11, pp.1345-1361, November 2002
[74] L.-P. Chang, “Flash memory storage system for embedded systems,” Ph. D. thesis, 2003
[75] R. Jejurikar, C. Pereira, and R. Gupta, “Leakage aware dynamic voltage scaling for real-time embedded systems,” IEEE Design Automation Conference, pp. 275-280, 2004
[76] Y.-H. Lee, K. P. Reddy, and C. M. Krishna. “Scheduling techniques for reducing leakage power in hard real-time systems,” 15th Euromicro Conference on Real-Time Systems (ECRTS), pp. 105-112, 2003
[77] J.-J. Chen and T.-W. Kuo, “Procrastination determination for periodic real-time tasks in leakage-aware dynamic voltage scaling systems”, International Conference on Computer Aided Design, pp. 289-294, 2007
[78] Y.-P. You, C.-R. Lee, J.-K. Lee, and W.-K. Shih, “Real-time task scheduling for dynamically variable voltage processors,” IEEE workshop on power managements for realtime and embedded systems, 2001
[79] V. Swaminathan and K. Chakrabarty, “Pruning-based, energy-optimal, deterministic I/O device scheduling for hard real-time systems,” ACM Transactions on Embedded Computing Systems (TECS), vol.4 n.1, p.141-167, February 2005
[80] L. Benini, A. Bogliolo, and G. D. Micheli, “A survey of design techniques for system-level dynamic power management,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 8, No. 3, pp. 299-316, June 2000
[81] F. Yao, A. Demers, S. Shenker, “A scheduling model for reduced CPU energy,” FOCS-36: IEEE Annual Symposium on Foundations of Computer Science, pp. 374-382, Milwaukee, WI, October 1995
[82] E. B. van der Tol, E. G. Jaspers, and R. H. Gelderblom, “Mapping of H.264 decoding on a multiprocessor architecture,” Proceedings of SPIE, Vol. SPIE-5022, pp. 707-718, May 2003
[83] X. Zhou, E. Q. Li, and Y.-K. Chen, “Implementation of H.264 decoder on general-purpose processors with media instructions,” Image and Video Communications and Processing, Proceedings of the SPIE, Vol.5022, pp. 224-235, May 2003
[84] M. A. Mesa, A. Ramirez, A. Azevedo, C. Meenderinck, B. Juurlink, and M. Valero, “Scalability of macroblock-level parallelism for H.264 decoding,” Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems, pp. 236-243, 2009
[85] M. Alvarez, A. Ramirez, M. Valero, A. Azevedo, C.H. Meenderinck, and B. H. H. Juurlink, “Performance evaluation of macroblock-level parallelization of H.264 decoding on a cc-NUMA multiprocessor architecture,” Proceedings of the 4CCC: 4th Colombian Computing Conference, April 2009
[86] C. C. Chi, “Parallel H.264 decoding strategies for Cell Broadband Engine,“ Master thesis of Mathematics and Computer Science, Faculty of Electrical Engineering, Delft University of Technology, Netherlands, February 2010