研究生: |
江振國 Chiang, Chen-Kuo |
---|---|
論文名稱: |
基於統計學習的視訊壓縮與分類 Video Encoding and Classification Based on Statistical Learning |
指導教授: |
賴尚宏
Lai, Shang-Hong |
口試委員: |
王家祥
Wang, Jia-Shung 陳朝欽 Chen, Chaur-Chin 陳煥宗 Chen, Hwann-Tzong 賴尚宏 Lai, Shang-Hong 王聖智 Wang, Sheng-Jyh 陳祝嵩 Chen, Chu-Song 劉庭祿 Liu, Tyng-Luh |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2011 |
畢業學年度: | 99 |
語文別: | 英文 |
論文頁數: | 101 |
中文關鍵詞: | 視訊壓縮 、視訊分類 、支援向量機 、稀疏編碼 、統計學習 |
外文關鍵詞: | Video Encoding, Video Classification, Support Vector Machine, Sparse Coding, Statistical Learning |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由Joint Video Team (JVT) 提出的新一代視訊壓縮標準中,因為採用了多項新的技術,所以視訊壓縮的壓縮率、視訊壓縮品質都大大超前前一代的視訊壓縮標準,然而因為整合這些新的技術,整體的編碼時間也大量的升高。在本篇論文中,我們發展了一套基於統計學習的演算法,可以普遍適用於許多視訊壓縮技術元件,來降低視訊編碼執行的複雜度,例如可應用於間模式決定、多參考畫面移動估測與內部預測等視訊壓縮元件。首先,從一定數量的視訊中,抽取出具有代表性的特徵,然後這些特徵會被用來建立部分分類問題的子學習模型,這些子模型學習完成後,會整合成一個完整的分類器。最後,採用一個離線型的預先分類方法,先產生只取離散值的特徵所有可能的組合,然後用學習好的分類器對這些特徵作分類,將結果存成查表形式,這樣在編碼執行期間,只要將取出的特徵取離散值後查表,便可以由建立好的學習表格決定結果,如此將可以大大降低編碼執行的時間。這樣的作法將應用於上述三項視訊編碼的主要元件中,來加速執行效能。
視訊的分類可以在壓縮的過程中同時完成。我們提出了一個創新的在原件層級的字典學習架構,這種架構在稀疏編碼中採用了影像群組的特性,來達到更佳的分類效果。有別於以前的方法選擇能夠把資料重建得最好的字典,我們制訂一個能量最小化的公式,能夠在一個系統架構中,同時最佳化稀疏編碼的字典與元件層級的重要性測量,讓每一個影像群組有一個最具有鑑別度的表示方法。其中,重要性是利用直方圖資訊,測量每一個字典中特徵原件表示本身的影像群組,能表示得多好,然後這些字典會以迭代的方式更新,降低不重要元件的影響力,因此每一個影像群組的稀疏表示法能越來越接近群組本身的特性。
我們在公開可以取得的測試影片上,實測了我們提出的視訊編碼系統與視訊分類方法,實驗結果顯示我們提出基於統計學習的H.264編碼器整體效能比EPZS方法快約4到5倍,整個過程只有輕微的視訊品質下降;另外我們提出的基於學習元件重要性的視訊分類方法,在實際視訊資料庫上比之前的方法準確性更加提升。
The latest video coding standard of Joint Video Team (JVT) significantly outperforms previous standards in terms of coding bitrate and video quality, because it adopts several new compression techniques. However, the computational complexity is also dramatically increased due to these new components. In this thesis, we propose a general statistical learning approach to reduce the computational cost in video encoder. This approach can be easily applied to many components in video encoder, such as intermode decision, multi-reference motion estimation and intra-mode prediction. First, representative features are chosen according to feature analysis from a number of training video sequences. Then, the selected features are used to train the sub-classifiers for some partial classification problems. After the training is finished, these sub-classifiers are integrated to build a complete classifier. Last, an off-line pre-classification approach is employed to compute the results for all possible combinations of the quantized features. After pre-classifying these features with the learned classifiers, the results are stored as a lookup table. During run-time encoding, features are extracted and quantized. The classification results can be determined directly from the lookup table. Thus, the computation time of encoding can be significantly reduced. The proposed statistical learning based approach is applied to three main components in video encoder to speed up the computation.
Video classification can be accomplished while encoding. A novel component-level dictionary learning framework which exploits image group characteristics within sparse coding is proposed in this work. Unlike previous methods, which select the dictionaries that best reconstruct the data, an energy minimization formulation that jointly optimizes the learning of both sparse dictionary and component level importance within one unified framework to give a discriminative representation for image groups is presented here. The importance measures how well each feature component represents the image group property with the dictionary by using histogram information. Then, the dictionaries are updated iteratively to reduce the influence of unimportant components, thus refining the sparse representation for each image group.
We test the proposed learning-based video encoding system and video categorization algorithm on some real video sequences available in public. Our experiments show the proposed learning-based H.264 encoder is about 4 to 5 times faster than the EPZS algorithm, which is included in the H.264 reference code for its efficiency, for the entire encoding process with slight video quality degradation. In addition, the proposed component learning-based algorithm is more accurate than the previous methods for video classification experiments on real video database.
BIBLIOGRAPHY
[1] D.Wu, F. Pan, K. P. Lim, S.Wu, Z. Li, X. Lin, S. Rahardja, and C. C. Ko, “Fast intermode
decision in h.264/avc video coding,” IEEE Trans. Circuits Syst. Video Techn., vol. 15, no. 7, pp. 953–958, 2005.
[2] J.-F. Wang, J.-C. Wang, J.-T. Chen, A.-C. Tsai, and A. Paul, “A novel fast algorithm for intra mode decision in h.264/avc encoders,” in ISCAS, 2006.
[3] F. Pan, X. Lin, S. Rahardja, K. P. Lim, Z. Li, D.Wu, and S.Wu, “Fast mode decision algorithm for intraprediction in h.264/avc video coding,” IEEE Trans. Circuits Syst. Video Techn., vol. 15, no. 7, pp. 813–822, 2005.
[4] M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” Signal Processing, IEEE Transactions on, vol. 54, no. 11, pp. 4311 –4322, 2006.
[5] P. Sallee and B. A. Olshausen, “Learning Sparse Multiscale Image Representations,” NIPS, vol. 15, pp. 1327–1334, 2003. [Online]. Available: http://books.nips.cc/papers/files/nips15/VS11.pdf
[6] G. Peyr´e, “Sparse modeling of textures,” J. Math. Imaging Vis., vol. 34, pp. 17–31, May 2009. [Online]. Available:http://portal.acm.org/citation.cfm?id=1541591.1541616
[7] K. Skretting and J. H. Husøy, “Texture classification using sparse frame-based representations,” EURASIP J. Appl. Signal Process., vol. 2006, pp. 102–102, January 2006. [Online]. Available: http://dx.doi.org/10.1155/ASP/2006/52561
[8] J. Sivic and A. Zisserman, “Video google: a text retrieval approach to object matching in videos,” in Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, 2003, pp. 1470 –1477 vol.2.
[9] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, 2006.
[10] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. 1794 –1801.
[11] S. Hasler, H. Wersing, and E. Korner, “Class-specific sparse coding for learning of object representations,” in Artificial Neural Networks: Biological Inspirations VICANN 2005, 2005.
[12] S. Hasler, H.Wersing, and E. K¨orner, “Combining reconstruction and discrimination with class-specific sparse coding,” Neural Comput., vol. 19, pp. 1897–1918, July 2007. [Online]. Available: http://portal.acm.org/citation.cfm?id=1268757.1268765
[13] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discriminative learned dictionaries for local image analysis,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, 2008, pp. 1 –8.
[14] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in Computer Vision and Pattern Recognition Workshop, 2004. CVPRW ’04. Conference on, 2004, p. 178. [Online]. Available: http://www.vision.caltech.edu/Image Datasets/Caltech101/Caltech101.html
[15] M.-E. Nilsback and A. Zisserman, “A visual vocabulary for flower classification,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, 2006, pp. 1447 – 1454. [Online]. Available: http://www.robots.ox.ac.uk/ vgg/data/flowers/17/index.html
[16] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Computer Vision, Graphics Image Processing, 2008. ICVGIP ’08. Sixth Indian Conference on, 2008, pp. 722 –729. [Online]. Available: http://www.robots.ox.ac.uk/ vgg/data/flowers/102/index.html
[17] Y. S. Z. C. X. Z. J. Cao, Y.D. Zhang and J. Li, “Mcg-webv: A benchmark dataset for web video analysis,” Technical Report, ICT-MCG-09-001, Institute of Computing Technology, 2009. [Online]. Available: http://mcg.ict.ac.cn/mcg-webv.htm
[18] P. Yin, A. M. Tourapis, and J. M. Boyce, “Fast mode decision and motion estimation for jvt/h.264,” in ICIP (3), 2003, pp. 853–856.
[19] T.-Y. Kuo and C.-H. Chan, “Fast variable block size motion estimation for h.264 using likelihood and correlation of motion field,” IEEE Trans. Circuits Syst. Video Techn., vol. 16, no. 10, pp. 1185–1195, 2006.
[20] B. Zhan, B. Hou, and R. Sotudeh, “A novel fast inter mode decision algorithm based on statistic and adaptive adjustment for h.264/avc,” in 15th International Conference on Software, Telecommunications and Computer Networks., 2007, pp. 1–5.
[21] W. Ma, S. Yang, L. Gao, C. Pei, and S. Yan, “Fast mode selection scheme for h.264/avc inter prediction based on statistical learning method,” in ICME, 2009, pp. 17–20.
[22] X. Zhou, C. Yuan, C. Li, and Y. Zhong, “Fast mode decision for p-slices in h.264/avc based on probabilistic learning,” in 11th International Conference on Advanced Communication Technology, ICACT., 2009, pp. 1180–1184.
[23] Y.-W. Huang, B.-Y. Hsieh, S.-Y. Chien, S.-Y. Ma, and L.-G. Chen, “Analysis and complexity reduction of multiple reference frames motion estimation in h.264/avc,” IEEE Trans. Circuits Syst. Video Techn., vol. 16, no. 4, pp. 507–522, 2006.
[24] Q. Sun, X.-H. Chen, X. Wu, and L. Yu, “A content-adaptive fast multiple reference frames motion estimation in h.264.” in ISCAS’07, 2007, pp. 3651–3654.
[25] T.-Y. Kuo and H.-J. Lu, “Efficient reference frame selector for h.264,” IEEE Trans. Circuits Syst. Video Techn., vol. 18, no. 3, pp. 400–405, 2008.
[26] P. Wu and C. Xiao, “An adaptive fast multiple reference frames selection algorithm for h.264/avc,” in ICASSP, 2008, pp. 1017–1020.
[27] C.-C. Cheng and T.-S. Chang, “Fast three step intra prediction algorithm for 4x4 blocks in h.264.” in IEEE Intern. Symp. Circuits and Systems, 2005, pp. 159–1512.
[28] Z. Y.-D. amd DAI FENG and L. Shou-Xun, “Fast 4x4 intra-prediction mode selection for h.264,” in IEEE International Conference on multimedia and expo,(ICME), 2004, pp. 1151–1154.
[29] D.-G. Sim and Y. Kim, “Context-adaptive mode selection for intra-block coding in h.264/mpeg-4 part 10,” Real-Time Imaging, vol. 11, no. 1, pp. 1–6, 2005.
[30] C. An and T. Q. Nguyen, “Statistical learning based intra prediction in h.264,” in ICIP, 2008, pp. 2800–2803.
[31] H.-C. Lin, W.-H. Peng, and H.-M. Hang, “Fast context-adaptive mode decision algorithm for scalable video coding with combined coarse-grain quality scalability (cgs)
and temporal scalability,” IEEE Trans. Circuits Syst. Video Techn., vol. 20, no. 5, pp. 732–748, 2010.
[32] S. Mallat, A wavelet tour of signal processing.
[33] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 2, pp. 210 –227, 2009.
[34] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09. New York, NY, USA: ACM, 2009, pp. 689–696. [Online]. Available: http://doi.acm.org/10.1145/1553374.1553463
[35] G. Obozinski, B. Taskar, and M. I. Jordan, “Joint covariate selection and joint subspace selection for multiple classification problems,” Statistics and Computing, vol. 20, pp. 231–252, April 2010. [Online]. Available: http://dx.doi.org/10.1007/s11222-008-9111-x
[36] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal Of The Royal Statistical Society Series B, vol. 68, no. 1, pp. 49–67, 2006. [Online]. Available: http://econpapers.repec.org/RePEc:bla:jorssb:v:68:y:2006:i:1:p:49-67
[37] J. Zhang, “A probabilistic framework for multi-task learning.” Language Technologies Institute, CMU, Tech. Rep. CMU-LTI-06-006, 2006.
[38] T.-L. L. Yen-Yu Lin and C.-S. Fuh, “Local ensemble kernel learning for object category recognition,” in CVPR, 2007.
[39] X.-T. Yuan and S. Yan, “Visual classification with multi-task joint sparse representation,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010, pp. 3493 –3500.
[40] M. Roach, J. Mason, L. qun Xu, and M. H. Ipswich, “Video genre verification using both acoustic and visual modes,” in In Proc. Int. Workshop Multimedia Signal Processing, 2002.
[41] H. Wang, A. Divakaran, A. Vetro, S.-F. Chang, and H. Sun, “Survey of compressed-domain features used in audio-visual indexing and analysis.” J. Visual Communication and Image Representation, vol. 14, no. 2, pp. 150–183, 2003. [Online]. Available: http://dblp.uni-trier.de/db/journals/jvcir/jvcir14.html
[42] K. Jack, Video Demystified, 4th ed. L L H Technology Publishing, 2005.
[43] J. Fan, H. Luo, J. Xiao, and L. Wu, “Semantic video classification and feature subset selection under context and concept uncertainty,” in JCDL, 2004, pp. 192–201.
[44] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
[45] “http://www.csie.ntu.edu.tw/ cjlin/libsvmtools.”
[46] T. Cover, Elements of information theory. Wiley-Interscience, 1991.
[47] A. M. Tourapis, “Enhanced predictive zonal search for single and multiple frame motion estimation,” in VCIP, 2002, pp. 1069–1079.
[48] G. Bjontegaard, “Calculation of average psnr differences between rd-curves,” Doc. VCEG-M33, 2001.
[49] J. T. M. A. H. Group, “Evaluation sheet for motion estimation,” Draft version 4, 2003.
[50] Y. Su and M.-T. Sun, “Fast multiple reference frame motion estimation for h.264/avc,” IEEE Trans. Circuits Syst. Video Techn., vol. 16, no. 3, pp. 447–452,
2006.
[51] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in In proceedings of NIPS, 2006, pp. 801–808.
[52] B. A. Olshausen and D. J. Fieldt, “Sparse coding with an overcomplete basis set: a strategy employed by v1,” Vision Research, vol. 37, pp. 3311–3325, 1997.
[53] B. Efron, T. Hastie, L. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics, vol. 32, pp. 407–499, 2004. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.8168
[54] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classification and clustering via dictionary learning with structured incoherence and shared features,” in The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010. IEEE, 2010, pp. 3501–3508.
[55] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, 2005, pp. 886 –893 vol. 1.
[56] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 10, pp. 1615 –1630, 2005.
[57] “http://www.vlfeat.org/ vedaldi/code/bag/bag.html.”
[58] “http://www.robots.ox.ac.uk/ vgg/research/caltech/phog.htm.”
[59] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, March 2010. [Online]. Available: http://portal.acm.org/citation.cfm?id=1756006.1756008
[60] P. Gehler and S. Nowozin, “On feature combination for multiclass object classification,” in Computer Vision, 2009 IEEE 12th International Conference on, 29 2009.
[61] “Xiph.org test media. http://media.xiph.org/video/derf/.”
[62] J. Huang, S. R. Kumar, M. Mitra, and W. jing Zhu, “Spatial color indexing and applications,” International Journal of Computer Vision, vol. 35, no. 3, pp. 245–268, 1998.