應用於影像壓縮之高效能轉換核心｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳元賀 Chen, Yuan-Ho
論文名稱：	應用於影像壓縮之高效能轉換核心 High Performance Transform Cores for the Applications of Video Compression
指導教授：	張慶元 Chang, Tsin-Yuan
口試委員:	馮武雄謝明得陳竹一黃仲陵黃元豪
學位類別：	博士 Doctor
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2011
畢業學年度：	99
語文別：	英文
論文頁數：	95
中文關鍵詞：	離散餘旋轉換、高效能、小面積、高吞吐率
外文關鍵詞：	DCT, DA-based, space-time scheduling, muti-path DCT
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

Discrete cosine transform (DCT) is a widely used transform engine for the applications of image and video compression. Recently, the development of visual media has been progressed to high-resolution specifications, such as high definition television (HDTV) and digital cinema. Therefore, a high-accuracy and high-throughput rate component is needed to meet the requirements of future specifications. In addition, in order to reduce the manufacturing costs of the integrated circuit (IC), a low hardware cost design is also required. Thus, a high performance video transform engine with high accuracy, small area, and high-throughput rate is desired for very-large-scale integration (VLSI) designs.

In this study, a high-throughput DCT (HT-DCT) core is proposed , which draws on an odd-even decomposition adder-based distributed arithmetic (DA) scheme and an error-compensated adder-tree (ECAT). Instead of the coefficient length of 12-bit DA-precision which is commonly used in previous works, a 9-bit DA-precision coefficient length is chosen for HT-DCT so as to meet peak-signal-to-noise ratio (PSNR) requirements. Thus, the proposed HT-DCT core achieves a throughput rate of 1 G-pels/s with gate counts 22.2 K, meeting the PSNR requirements outlined in the previous works.

On the other hand, another low cost DCT (LC-DCT) core using a spatial and time scheduling strategy, called the space-time scheduling (STS) strategy, that can achieve high image resolutions in real-time systems is also proposed. The proposed STS includes the ability to choose the DA-precision bit length, a hardware sharing architecture that reduces the hardware cost, and the proposed time scheduling strategy which arranges different dimensional computations. The proposed time scheduling strategy can calculate first-dimensional (1st-D) and second-dimensional (2nd-D) transformations simultaneously in single one-dimensional (1-D) DCT core to reach a hardware utilization of 100%. The measurement results show that the LC-DCT core has a latency of 84 clock cycles with a 52 dB PSNR and is operated at 167 MHz with 15.8 K gate counts.

Finally, a multi-path DCT (MP-DCT) core, which employs four computation paths to achieve a high-throughput rate and is implemented by using single 1-D MP-DCT core and one transposed memory (TMEM) to reduce the area cost, is proposed. The proposed 1-D MP-DCT can calculate 1st-D and 2nd-D transformations simultaneously in four parallel streams, and the two-dimensional (2-D) MP-DCT utilizes single 1-D MP-DCT core with one TMEM. Therefore, a high-throughput rate and a low-area cost are achieved in the proposed 2-D MP-DCT core. The implementation results show the proposed 2-D MP-DCT core can achieve a high-throughput rate of 1 G-pels/s with only 20 K gate area.

To conclude, as the current progress of visual media has advanced rapidly, this dissertation aims to cope with the ongoing advancement of high-resolution specifications and hopefully to meet the future needs as much as possible. Therefore, three circuits of HT-DCT, LC-DCT, and MP-DCT are proposed to achieve high performance in high-throughput rate and low cost VLSI designs.

摘要

離散餘弦轉換(DCT)是一個被廣泛應用於影像及視訊壓縮的運算元件。因應高解析度視訊規格的訂定，高精確度和高吞吐率(throughput rate)將是未來的需求。另外，為了減少電路設計的成本，小面積的設計也是非常需要的。因此，我們需要一個具有高精確度、小面積以及高吞吐率的高效能視訊轉換電路。
在此項研究中，提出了一個利用加法器基底的分佈式算術(DA)和誤差補償加法樹(ECAT)的高吞吐率DCT (HT-DCT)電路。在設計中，選用9位元的DA精確係數長度來取代以往的13位元DA精確系數長度，便可符合峰值信號噪訊比(PSNR)的要求。因此，所提出的HT-DCT電路在22K的電路面積下可以達到1 G-pels/s的吞吐率。
另一方面，本研究中還提出一個低成本DCT (LC-DCT)電路，此電路採用空間及時間的規劃策略(STS)。STS主要包含DA精確系數長度的選擇、共用硬體的設計以及時間規劃的策略。藉由DA精確係數長度有效的選擇及共用硬體的設計，使得LC-DCT達到小面積及低成本的設計；另外所提出時間規劃策略可以使一個一維的DCT電路同時運算第一維度及第二維度的DCT運算。量測結果顯示，LC-DCT電路使用15.8 K的面積下以167 MHz的頻率運作可達到PSNR為52 dB的高精確運算。
最後，本研究中整合了HT-DCT及LC-DCT的特點，提出了多運算路徑DCT(MP-DCT)電路，此電路採用單一個一維的DCT電路及一轉置記憶體來達到小面積的設計，而其中的一維DCT是採用4條平行運算路徑的運算單元，其也與LC-DCT中的一維DCT電路相似，可以同時運算第一維度及第二維度的DCT轉換，藉此便可達到小面積及高吞吐率的設計。所提出的二維MP-DCT電路在17.7K的電路面積下可達到1 G-pels/s的高吞吐率。

由此可知，因應視訊媒體的迅速發展，此份學術論文以達到高解析度的規格為目標，並能符合未來的需要。所以，所提出的HT-DCT、LC-DCT和MP-DCT電路達到高吞吐率及低成本VLSI電路設計。

Introduction 1
1 Introduction for Video Compression . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Foundation of Video Compression . . . . . . . . . . . . . . . . . . . 1
1.2 Foundation of Transform Code . . . . . . . . . . . . . . . . . . . . 5
2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Contribution of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . 7
4 Organization of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . 11
Preliminaries 12
1 Related Works for DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Algorithm of 8 × 8 Two-Dimensional Discrete Cosine Transform . . . . . . 16
3 Mathematical Derivation of Distributed Arithmetic . . . . . . . . . . . . . 19
Proposed DA-based HT-DCT Core Design 21
1 Error-Compensated Adder-Tree Architecture . . . . . . . . . . . . . . . . . 22
1.1 Proposed Error-Compensated Scheme . . . . . . . . . . . . . . . . . 23
1.2 Performance Simulation for an Error-Compensated Circuit . . . . . 25
1.3 Proposed ECAT Architecture . . . . . . . . . . . . . . . . . . . . . 27
2 Analysis of the Coefficient Bits . . . . . . . . . . . . . . . . . . . . . . . . 29
3 DA-Based 1-D 8-point DCT Design . . . . . . . . . . . . . . . . . . . . . . 32
4 Proposed 8 × 8 2-D HT-DCT Architecture . . . . . . . . . . . . . . . . . . 36
5 Discussion and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 System Accuracy Test . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Comparison with DA-based DCT . . . . . . . . . . . . . . . . . . . 41
5.3 Comparison with Other 2-D DCT . . . . . . . . . . . . . . . . . . . 42
5.4 Chip Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Brief Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Proposed LC-DCT Core With Space-Time Scheduling Strategy 49
1 Proposed 8×8 2-D LC-DCT Core Design . . . . . . . . . . . . . . . . . . . 51
1.1 Coefficient Bits Choosing . . . . . . . . . . . . . . . . . . . . . . . . 52
1.2 Hardware Sharing Strategy . . . . . . . . . . . . . . . . . . . . . . . 52
1.3 Time Scheduling Strategy . . . . . . . . . . . . . . . . . . . . . . . 61
2 Discussion and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.1 System Accuracy Test . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.2 Comparison With Other DA-based 1-D DCT Architectures . . . . . 67
2.3 Comparison With Other 2-D DCT Architectures . . . . . . . . . . . 68
2.4 Chip Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 73
3 Brief Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Proposed MP-DCT Core With Multiple Computation Streams 75
1 Proposed Multi-path Discrete Cosine Transform Core . . . . . . . . . . . . 77
2 Discussion and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.1 Comparison With Other 2-D DCT Architectures . . . . . . . . . . . 80
2.2 Chip Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.3 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 84
3 Brief Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Conclusions and Future Works 86
1 Conclusions of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 86
2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Bibliography 87

                                

Bibliography
[1] I. Standard, “ISO/IEC 10918-1. information technology - digital compression and coding of continuous-tone still images,” in Part 1: Requirements and guidelines (JPEG). - Genf: ISO, 1991.
[2] W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compression Standard. New York: Van Nostrand Reinhold, 1992.
[3] G. K. Wallace, “The JPEG still picture compression standard,” Commun. of the ACM, vol. 34, pp. 31–44, Feb. 1992.
[4] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 still image coding system: an overview,” IEEE Trans. Consum. Electron., vol. 46, no. 4, pp. 1103–1127, Nov. 2000.
[5] D. Taubman, “High performance scalable image compression with EBCOT,” IEEE Trans. Image Process., vol. 9, no. 7, pp. 1158–1170, Jul. 2000.
[6] A. Bovik, The Essential Guide to Video Processing, 2nd ed. UK: Academic Press, 2009.
[7] D. T. Hoang and J. S. Vitter, Efficient Algorithms for MPEG Video Compression. John Wiley & Sons, 2001.
[8] W. Li, “Overview of fine granularity scalability in MPEG-4 video standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 3, pp. 301–317, Mar. 2001.
[9] International Telecommunication Union, “Video codec for audiovisual services at p64 Kbits/s. ITU-T recommendation H.261,” in ITU-T Recommendation H.261, 1993.
[10] M. L. Liou, “Visual telephony as an ISDN application,” IEEE Commun. Mag., vol. 28, no. 2, pp. 30–38, Feb. 1990.
[11] G. Cote, B. Erol, M. Gallant, and F. Kossentini, “H.263+: video coding at low bit rates,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 7, pp. 849–866, Nov. 1998.
[12] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.
[13] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 9, pp. 1103–1120, Sep. 2007.
[14] I. E. G. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia. England: John Wiley & Sons, 2003.
[15] H. Kalva and J. B. Lee, “The VC-1 video coding standard,” IEEE Multimedia, vol. 14, no. 4, pp. 88–91, Oct.-Dec. 2007.
[16] Y. Wang, J. Ostermann, and Y. Zhang, Video Processing and Communications, 1st ed. New Jersey: Prentice-Hall, 2002.
[17] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transfom,” IEEE Trans. Comput., vol. C-23, no. 1, pp. 90–93, Jan. 1974.
[18] A. Rosenfeld and A. C. Kak, Digital Picture Processing, 2nd ed. Orlando, FL, USA: Academic Press, 1982.
[19] Y. H. Chen, T. Y. Chang, and C. Y. Li, “High throughput DA-based DCT with high accuracy error-compensated adder tree,” IEEE Trans. VLSI Syst., vol. 19, no. 4, pp. 709–714, Apr. 2011.
[20] Y. H. Chen and T. Y. Chang, “A high performance video transform engine by using space-time scheduling strategy,” IEEE Trans. VLSI Syst., 2011, to be published.
[21] A. Madisetti and A. N. Willson, “A 100 MHz 2-D 8 × 8 DCT/IDCT processor for HDTV applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 5, no. 2, pp. 158–165, Apr. 1995.
[22] T. Xanthopoulos and A. P. Chandrakasan, “A low-power DCT core using adaptive bitwidth and arithmetic activity exploiting signal correlations and quantization,” IEEE J. Solid-State Circuits, vol. 35, no. 5, pp. 740–750, May 2000.
[23] S. Yu and E. E. S. Jr., “DCT implementation with distributed arithmetic,” IEEE Trans. Comput., vol. 50, no. 9, pp. 985–991, Sep. 2001.
[24] M. Alam, W. Badawy, and G. Jullien, “A new time distributed DCT architecture for MPEG-4 hardware reference model,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 5, pp. 726–730, May 2005.
[25] W. Pan, “A fast 2-D DCT algorithm via distributed arithmetic optimization,” in Proc. IEEE Int. Conf. Image Processing., vol. 3, 2000, pp. 114–117.
[26] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, “NEDA: A low-power high-performance DCT architecture,” IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955–964, Mar. 2006.
[27] C. Peng, X. Cao, D. Yu, and X. Zhang, “A 250MHz optimized distributed architecture of 2D 8 × 8 DCT,” in Proc. Int. Conf. ASIC, 2007, pp. 189–192.
[28] C. Y. Huang, L. F. Chen, and Y. K. Lai, “A high-speed 2-D transform architecture with unique kernel for multi-standard video applications,” in Proc. IEEE Int. Symp. Circuits Syst., 2008, pp. 21–24.
[29] E. Feig and S. Winograd, “Fast algorithms for the discrete cosine transform,” IEEE Trans. Signal Process., vol. 40, no. 9, pp. 2174–2193, Sep. 1992.
[30] Y. P. Lee, T. H. Chen, L. G. Chen, M. J. Chen, and C. W. Ku, “A cost-effective architecture for 8×8 two-dimensional DCT/IDCT using direct method,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 3, pp. 459–467, Jun. 1997.
[31] C. T. Lin, Y. C. Yu, and L. D. Van, “Cost-effective triple-mode reconfigurable pipeline FFT/IFFT/2-D DCT processor,” IEEE Trans. VLSI Syst., vol. 16, no. 8, pp. 1058–1071, Aug. 2008.
[32] S. C. Hsia and S. H. Wang, “Shift-register-based data transposition for cost-effective discrete cosine transform,” IEEE Trans. VLSI Syst., vol. 15, no. 6, pp. 725–728, Jun. 2007.
[33] J. I. Guo, R. C. Ju, and J. W. Chen, “An efficient 2-D DCT/IDCT core design using cyclic convolution and adder-based realization,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 4, pp. 416–428, Apr. 2004.
[34] H. C. Hsu, K. B. Lee, N. Y. Chang, and T. S. Chang, “Architecture design of shape-adaptive discrete cosine transform and its inverse for MPEG-4 video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 3, pp. 375–386, Mar. 2008.
[35] A. Tumeo, M. Monchiero, G. Palermo, F. Ferrandi, and D. Sciuto, “A pipelined fast 2D-DCT accelerator for FPGA-based SoCs,” in Proc. IEEE Comput. SoC. Annu. Symp. VLSI., 2007, pp. 331–336.
[36] C. C. Sun, P. Donner, and J. Gotze, “Low-complexity multi-purpose IP core for quantized discrete cosine and integer transform,” in Proc. IEEE Int. Symp. Circuits Syst., 2009, pp. 3014–3017.
[37] S. Ghosh, S. Venigalla, and M. Bayoumi, “Design and implementation of a 2D-DCT architecture using coefficient distributed arithmetic,” in Proc. IEEE Comput. SoC. Annu. Symp. VLSI., 2005, pp. 162–166.
[38] L. V. Agostini, I. S. Silva, and S. Bampi, “Pipelined fast 2D DCT architecture for JPEG image compression,” in Proc. IEEE Symp. Integr. Circuits Syst. Design, 2001, pp. 226–231.
[39] Y. C. Lim, “Single-precision multiplier with reduced circuit complexity for signal processing applications,” IEEE Trans. Comput., vol. 41, no. 10, pp. 1333–1336, Oct. 1992.
[40] S. S. Kidambi, F. El-Guibaly, and A. Antoniou, “Area- efficient multipliers for digital signal processing applications,” IEEE Trans. Circuits Syst. II, vol. 43, no. 2, pp. 90–95, Feb. 1996.
[41] L. D. Van and C. C. Yang, “Generalized low-error area-efficient fixed-width multipliers,” IEEE Trans. Circuits Syst. I, vol. 52, no. 8, pp. 1608–1619, Aug. 2005.
[42] S. M. Kim, J. G. Chung, and K. K. Parhi, “Low error fixed-width CSD multiplier with efficient sign extension,” IEEE Trans. Circuits Syst. II, vol. 50, no. 12, pp. 984–993, Dec. 2003.
[43] N. Petra, D. D. Caro, V. Garofalo, E. Napoli, and A. G. M. Strollo, “Truncated binary multipliers with variable correction and minimum mean square error,” IEEE Trans. Circuits Syst. I, vol. 57, no. 6, pp. 1312–1325, Jun. 2010.
[44] C. H. Chang and R. K. Satzoda, “A low error and high performance multiplexer-based truncated multiplier,” IEEE Trans. VLSI Syst., vol. 18, no. 12, pp. 1767–1771, Dec. 2010.
[45] S. J. Jou, M. H. Tsai, and Y. L. Tsao, “Low-error reduced-width Booth multipliers for DSP applications,” IEEE Trans. Circuits Syst. I, vol. 50, no. 11, pp. 1470–1474, Nov. 2003.
[46] K. J. Cho, K. C. Lee, J. G. Chung, and K. K. Parhi, “Design of low-error fixed-width modified Booth multiplier,” IEEE Trans. VLSI Syst., vol. 12, no. 5, pp. 522–531, May 2004.
[47] T. B. Juang and S. F. Hsiao, “Low-error carry-free fixed-width multipliers with low-cost compensation circuits,” IEEE Trans. Circuits Syst. II, vol. 52, no. 6, pp. 299–303, Jun. 2005.
[48] M. A. Song, L. D. Van, and S. Y. Kuo, “Adaptive low-error fixed- width Booth multipliers,” IEICE Trans. Fundamentals, vol. E90-A, no. 6, pp. 1180–1187, Jun. 2007.
[49] J. P. Wang, S. R. Kuang, and S. C. Liang, “High-accuracy fixed-width modified Booth multipliers for lossy applications,” IEEE Trans. VLSI Syst., vol. 19, no. 1, pp. 52–60, Jan. 2011.
[50] Y. H. Chen, T. Y. Chang, and R. Y. Jou, “A statistical error-compensated Booth multiplier and its DCT applications,” in Proc. IEEE Region 10 Conf., 2010, pp. 1146–1149.
[51] C. Y. Li, Y. H. Chen, T. Y. Chang, and J. N. Chen, “A probabilistic estimation bias circuit for fixed-width Booth multiplier and its DCT applications,” IEEE Trans. Circuits Syst. II, vol. 58, no. 4, pp. 215–219, Apr. 2011.
[52] Y. H. Chen, C. Y. Li, and T. Y. Chang, “Area-effective and power-efficient fixed-width Booth multipliers using generalized probabilistic estimation bias,” IEEE J. Emerging Sel. Topics Circuits Syst., submitted for publication.
[53] Y. H. Chen and T. Y. Chang, “A high-accuracy adaptive conditional-probability estimator for fixed-width Booth multipliers,” IEEE Trans. Circuits Syst. I, submitted for publication.
[54] S. Uramoto, Y. Inoue, A. Takabatake, J. Takeda, Y. Yamashita, H. Terane, and M. Yoshimoto, “A 100-MHz 2-D discrete cosine transform core processor,” IEEE J. Solid-State Circuits, vol. 27, no. 4, pp. 492–499, Apr. 1992.
[55] R. E. Atani, M. Baboli, S. Mirzakuchaki, S. E. Atani, and B. Zamanlooy, “Design and implementation of a 118 MHz 2D DCT processor,” in Proc. IEEE Int. Symp. Industrial Electronics., Jul. 2008, pp. 1076–1081.
[56] C. H. Chang, C. L. Wang, and Y. T. Chang, “Efficient VLSI architectures for fast computation of the discrete Fourier transform and its inverse,” IEEE Trans. Signal Process., vol. 48, no. 11, pp. 3206–3216, Nov. 2000.
[57] S. N. Tang, J. W. Tsai, and T. Y. Chang, “A 2.4-Gs/s FFT processor for OFDM-based WPAN applications,” IEEE Trans. Circuits Syst. II, vol. 57, no. 6, pp. 451–455, Jun. 2010.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文