簡易檢索 / 詳目顯示

研究生: 陳致生
Chen, Edwardson
論文名稱: 基於內容的自動標記與喜好學習應用於音樂資訊檢索
Content-based Automatic Annotation and Preference Learning for Music Information Retrieval
指導教授: 張智星
Jang, Jyh-Shing Roger
口試委員: 張智星
Jang, Jyh-Shing Roger
陳煥宗
陳玲慧
王逸如
冀泰石
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 83
中文關鍵詞: 音樂標記與搜尋語意檢索基於內容的音樂推薦排序回歸協同過濾核函數事後機率最大化調適全域背景模型
外文關鍵詞: music annotation and retrieval, query by semantic description, content-based music recommendation, ordinal regression, collaborative filtering, kernel function, maximum a posterior adaptation, universal background model
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 音樂資訊檢索在近十年來受到了越來越多的關注。在本論文中,我們探討兩種主要人們搜尋歌曲、歌手、專輯等的方式,分別為自動標記與喜好學習預測。相較於利用範例查詢的方法,對於人們來說使用語意概念字來搜尋音樂較自然。這種搜尋方式,又稱語意描述查詢,需要一種準確且自動的方法來幫助人們標記音訊檔案。為了達到此目的,我們提出了一個基於反義字模型的自動標記系統。我們針對每一個標記字建立出相對應的反義字集,而其反意字集是由語意上與其具有相反意義的標記字所建立出來。藉由建立標記字與其反意字的模型,在自動標記的表現上,我們的系統能比原系統具有更好的效果。而在搜尋的表現上,使用了反意字模型的系統也同樣能有較好的效果。另一種人們發掘感興趣的音樂的方式是藉由推薦系統。一些商用系統,例如Amazon、TiVo和Netflix採用協同過濾(collaborative filtering)方法來幫助使用者發掘感興趣的商品,但該方法卻面臨緩開始問題(cold-start problem)。不過,基於內容的方法主要利用音樂本身的特徵,而不是利用使用者過去的交易紀錄來推薦,因此能舒緩此種問題。在本論文的第二部分,我們提出了一個能預測使用者喜好的基於內容的歌手推薦系統。首先利用所有的歌曲建立出全域背景模型(universal background model, UBM),接著利用事後機率最大化調適方法(maximum a posterior adaptation, MAP)建立出各個歌手之聲學特徵。這些聲學特徵與使用者的喜好分數將利用排序回歸方法來訓練排序函數。在本論文中,我們提出了一個保留排序投影(order preserving projection, OPP)演算法。該方法與一個排序回歸方法,PRank,有相似的效能。另外,我們可以核化(kernelize)提出的保留排序投影演算法使其有能力學習非線性的排序函數。藉由導入核方法,我們還可以有效地融合聲學特徵與符號特徵,而這些符號特徵是由標記字所建立。實驗結果顯示,我們的系統可以有效地預測使用者的喜好,並且藉由使用非線性排序函數或融合聲學特徵和符號特徵,系統效能皆能得到進一步的提升。


    Music information retrieval received more and more attention in the past decades. The goal is to find songs, artists, or albums of users’ interests. In this thesis, we focus on two major retrieval approaches, automatic annotation and preference learning recommendation systems. Rather than adopting query-by-example techniques (QBE), searching audio files by a set of semantic concept words is much more natural to associate with music. Such an approach, called query-by-semantic-description (QBSD), needs an accurate and automatic way to help people with tagging lots of audio files. To achieve this demand, we propose an automatic annotation system that uses anti-words for each annotation word based on the concept of supervised multi-class labeling (SML). More specifically, words that are highly associated with the opposite semantic meaning of a word constitute its anti-word set. By modeling both a word and its anti-word set, our annotation system can achieve higher mean per-word precision and recall than the original SML model. Moreover, by constructing the models of the anti-word explicitly, the performance is also significantly improved for the retrieval system. Another major approach for people to discover music is through recommendation which exists frequently in our daily life. Recommenders, such as Amazon, TiVo, and Netflix, adopt collaborative filtering (CF) which often suffers from the so called cold-start problem. However, content-based approach can alleviate this problem since it relies on audio contents instead of users’ past transactions. In the second part of this thesis, we propose a content-based artist recommendation system that can well-predict a user’s tastes. In particular, an artist is characterized by the corresponding acoustical model which is adapted from a universal background model (UBM) through maximum a posterior (MAP) adaptation. These acoustical features, together with their preference rankings, are then used for an ordinal regression algorithm that tries to find a ranking rule which can predict the rank of a new instance. Moreover, an order preserving projection (OPP) algorithm is proposed which is shown to have comparable results with an ordinal regression algorithm, PRank. The proposed linear OPP can also be kernelized to learn the potential nonlinear relationship between music contents and users’ artist rank orders. By introducing the kernel method, we can also efficiently fuse acoustical and symbolic features, i.e. annotation words, under the proposed framework. Experimental results show that the system can successfully predict the user’s tastes and achieve better performance whether using non-linear algorithms of OPP or fusing acoustical and symbolic features.

    Chapter 1. Introduction and Motivation 12 Chapter 2. Automatic Annotation and Retrieval System 14 2.1. Introduction and Related Work 14 2.1.1. Proposed System 18 2.1.2. Supervised Multi-class Labeling (SML) Model 19 2.2. Anti-word Model System 21 2.2.1. Anti-word Set Extraction 22 1) Correlation-based Method: 23 2) Conditional Probability Method: 24 3) Latent Semantic Analysis Method: 25 2.2.2. Parameter Estimation 28 2.2.3. Inference of Annotation and Retrieval Procedures 30 Chapter 3. Experimental Results for Annotation and Retrieval System 33 3.1. Model Evaluation for Annotation 34 3.2. Model Evaluation for Retrieval 38 3.3. Conclusions 43 Chapter 4. Preference Learning System for Content-based Artist Recommender 45 4.1. Introduction and Related Work 46 4.2. The Proposed Recommender 51 4.2.1. System Descriptions 51 4.2.2. MAP Adaptation and Artist-Specific Features 53 4.3. Order Preserving Projection (OPP) 57 4.4. Kernelized Order Preserving Projection (KOPP) 59 Chapter 5. Evaluations of Proposed Recommender 64 5.1. Datasets and Feature Vectors 64 5.2. Experimental Settings 66 5.3. Testing the Performances with Acoustic and Symbolic Features Independently 67 5.4. Testing the Performances with Both Feature Types 69 5.5. Comparing with the Collaborative Filtering Algorithm 72 5.6. Conclusions and Future Work 74

    [1] M. Goto and K. Hirata, “Recent studies on music information processing,” Acoust. Sci. Technol., vol. 25, no. 4, pp. 419–425, 2004.
    [2] R. B. Dannenberg and N. Hu, “Understanding search performance in query-by-humming systems,” in Proc. ISMIR, 2004, pp. 232–237.
    [3] J.-S. Roger Jang and H.-R. Lee, "A General Framework of Progressive Filtering and Its Application to Query by Singing/Humming," IEEE Trans. Audio, Speech, and Language Process., vol. 16, no. 2, pp. 350-358, Feb. 2008.
    [4] J.-S. Roger Jang and M.-Y. Gao, "A Query-by-Singing System based on Dynamic Programming," Int. Workshop Intell. Syst. Resolutions (8th Bellman Continuum), Hsinchu, Taiwan, Dec 2000, pp. 85-89.
    [5] G. Eisenberg, J. M. Batke, and T. Sikora, “Beatbank—An MPEG-7 compliant query by tapping system,” Audio Eng. Soc. Conv., 2004, paper 6136.
    [6] J.-S. Roger Jang, H.-R. Lee, C.-H. Yeh, "Query by Tapping: A New Paradigm for Content-based Music Retrieval from Acoustic Input," 2nd IEEE Pacific-Rim Conf. Multimedia, Beijing, China, October 2001.
    [7] A. Kapur, M. Benning, and G. Tzanetakis, “Query by beatboxing: Music information retrieval for the dj,” in Proc. ISMIR, 2004, pp. 170-178.
    [8] B. Whitman and D. Ellis, “Automatic record reviews,” in Proc. ISMIR, 2004, pp. 470-477.
    [9] M. Slaney, “Semantic-audio retrieval,” in Proc. IEEE ICASSP, 2002, pp. IV-1408-IV-1411.
    [10] P. Cano and M. Koppenberger, “Automatic sound annotation,” in Proc. IEEE Workshop Mach. Learn. Signal Process., 2004, pp. 391-400.
    [11] B. Whitman and R. Rifkin, “Musical query-by-description as a multiclass learning problem,” in IEEE Workshop Multimedia Signal Process., 2002, pp. 153-156.
    [12] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, “Towards musical query-by-semantic description using the CAL500 data set,” in Proc. SIGIR’07, 2007, pp. 439–446.
    [13] D. Torres, D. Turnbull, L. Barrington, and G. Lanckriet, “Identifying words that are musically meaningful,” in Proc. ISMIR’07, 2007, pp. 405–410.
    [14] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, “Semantic Annotation and Retrieval of Music and Sound Effects,” IEEE Trans. Audio, Speech, and Language Process., vol. 16, no. 2, Feb. 2008.
    [15] G. Carneiro, A. B. Chan, P. J. Moreno, and N. Vasconcelos, “Supervised learning of semantic classes for image annotation and retrieval,” IEEE PAMI, 29(3):394–410, 2007.
    [16] M. Mandel and D. Ellis, “Song-level features and support vector machines for music classification,” in Proc. 6th Int. Symp. Music Information Retrieval, London, UK, 2005, pp. 594-599.
    [17] D. Eck, P. Lamere, T. Bertin-Mahieux, and S. Green, “Automatic generation of social tags for music recommendation,” in Proc. NIPS Conf., 2007.
    [18] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. A. Harshman, "Indexing by Latent Semantic Analysis." Journal of the American Society for Information Science, 41 (6): 391–407, 1990.
    [19] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, “Information retrieval using a singular value decomposition model of latent semantic structure,” Proc. SIGIR, pp. 465–480, 1988.
    [20] http://bobon.mirlab.org/bobon/download/CAL500clip.htm
    [21] HTK website, http://htk.eng.cam.ac.uk/.
    [22] S. L. Feng, R. Manmatha, and Victor Lavrenko, “Multiple bernoulli relevance models for image and video annotation,” IEEE CVPR, 2004.
    [23] B.-H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition,” IEEE Trans. Speech Audio Process., vol. 5, pp. 257-265, May 1997.
    [24] N. Vasconcelos, “Image indexing with mixture hierarchies,” in Proc. IEEE CVPR, 2001, pp. 3–10.
    [25] N. Vasconcelos and A. Lippman, “Learning Mixture Hierarchies,” Proc. Neural Inform. Process. Syst. 11, Denver, Colorado, 1998.
    [26] T. Li and G. Tzanetakis, “Factors in automatic musical genre classificati¬on of audio signals,” in IEEE WASPAA, 2003, pp. 143-146.
    [27] M. Slaney, “Mixtures of probability experts for audio retrieval and indexing,” in Proc. IEEE Multimedia Expo., 2002, pp. 345–348.
    [28] D. M. Blei and M. I. Jordan, “Modeling annotated data,” in Proc. ACM SIGIR, 2003, pp. 127–134.
    [29] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Mach. Learning J., 42(1):177-196, 2001.
    [30] Z.-S. Chen, J.-M. Zen, and J.-S. Roger Jang, "Music Annotation and Retrieval System using Anti-Models," Proc. 125th AES Conv., San Francisco, USA, Oct. 2008.
    [31] G. Linden, B. Smith, and J. York, “Amazon.com Recommendations: Item-to-item Collaborative Filtering,” IEEE Internet Computing, vol. 7, no. 1, pp. 76–80, 2003.
    [32] K. Ali and W. van Stam, “TiVo: Making Show Recommendations Using a Distributed Collaborative Filtering Architecture,” in Proc. 10th ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, pp. 394–401, 2004.
    [33] J. Bennet and S. Lanning, “The Netflix Prize,” KDD Cup and Workshop, 2007. www.netflixprize.com.
    [34] Y. Koren, “Factor in the Neighbors: Scalable and Accurate Collaborative Filtering,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 4, no. 1, 2010.
    [35] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens: an open architecture for collaborative filtering of netnews,” in Proceedings of the ACM Conference on Computer Supported Cooperative Work, pp. 175-186, New York, NY, USA, 1994.
    [36] X. Su and T. M. Khoshgoftaar, “Collaborative filtering for multi-class data using belief nets algorithms,” in Proceedings of the International Conference on Tools with Artificial Intelligence (ICTAI ’06), pp. 497-504, 2006.
    [37] A. I. Schein, A. Popescul, L. H. Ungar, and D.M. Pennock, “Methods and metrics for cold-start recommendations,” in Proceedings 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 253–260, 2002.
    [38] B. Logan, “Music recommendation from song sets,” in Proc. ISMIR, pp. 425–428, 2004.
    [39] C. Basu, H. Hirsh, and W. W. Cohen, “Recommendation as classification: Using social and content-based information in recommendation.” in Proceedings of the 15th National Conference on Artificial Intelligence, pp. 714-720, Madison, WI, 1998.
    [40] R. J. Mooney, and L. Roy, “Content-based book recommending using learning for text categorization.” in Proceedings of the 5th ACM Conference on Digital Libraries, pp. 195-204, 2000.
    [41] P. Melville, R. J. Mooney, and R. Nagarajan, “Content-boosted collaborative filtering for improved recommendations,” in Proceedings of the 18th National Conference on Artificial Intelligence (AAAI ’02), pp. 187-192, Edmonton, Canada, 2002.
    [42] M. Claypool, A. Gokhale, T. Miranda, et al., “Combining content-based and collaborative filters in an online newspaper,” in Proceedings of the SIGIR Workshop on Recommender Systems: Algorithms and Evaluation, Berkeley, Calif, USA, 1999.
    [43] X. Su and T. Khoshgoftaar, “A survey of collaborative filtering techniques,” Advances in Artificial Intelligence, 2009.
    [44] J. Reed and C.-H. Lee, “A Preference Ranking Model Using a Discriminatively-Trained Classifier”, in Proc. ISMIR, Philadelphia, PA., Sept. 2008.
    [45] K. Crammer and Y. Singer, “Pranking with ranking,” Neural Information Processing Systems Conference (NIPS), pp. 641–647, 2001.
    [46] X. Chen, H. Wang, and X. Lin, “Learning to rank with a novel kernel perceptron method,” Proceeding of the 18th ACM conference on Information and knowledge management, pp. 505-512, 2009.
    [47] A. Shashua and A. Levin, “Ranking with large margin principles: two approaches,” Neural Information Processing Systems Conference (NIPS), pp. 937–944, 2002.
    [48] M. I. Mandel and D. P. W. Ellis, “Song-level features and support vector machines for music classification,” in Proc. ISMIR Conf., 2005.
    [49] L. Barrington, A. Chan, D. Turnbull, and G. Lanckriet, “Audio information retrieval using semantic similarity,” in Proceedings of International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2007.
    [50] J. H. Kim, B. Tomasik, and D. Turnbull, “Using artist similarity to propagate semantic information,” in Proc. ISMIR Conf., 2009.
    [51] C.-H. Lee, F. K. Soong, and K. K. Paliwal, Automatic Speech and Speaker Recognition, Springer, 1996.
    [52] W. Campbell, D. E. Sturim, D. Reynolds, and A. Solomonoff, “SVM based speaker verification using a GMM supervector kernel and NAP variability compensation,” in Proceedings of International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 97-100, 2006.
    [53] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Transactions on Speech and Acoustic Processing, vol. 2, no. 2, pp. 291-298, 1994.
    [54] C. H. You, K-A. Lee, and H. Li, “An SVM kernel with GMM-supervector based on the Bhattacharyya distance for speaker recognition,” in IEEE Signal processing letters, vol. 16, no. 1, pp. 49–52, 2009.
    [55] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, New York, USA, 2004.
    [56] R. Herbrich, T. Graepel, and K. Obermayer, “Large margin rank boundaries for ordinal regression,” Advances in Large Margin Classifiers, pp. 115-132, 2000.
    [57] T. Joachims, “Training Linear SVMs in Linear Time,” in Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006.
    [58] Y. Freund, R. Iyer, R. Schapire, and Y. Singer, “An efficient boosting algorithm for combining preference,” Journal of Machine Learning Research, vol. 4, pp. 933-969, 2003.
    [59] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to Rank using Gradient Descent,” in Proceedings of ACM ICML, pp.89-96, 2005.
    [60] B.-H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition,” IEEE Transactions on Speech and Acoustic Processing, vol. 5, no. 3, pp. 257-265, 1997.
    [61] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1–38, 1977.
    [62] J.-J. Aucouturier, and F. Pachet, “Music similarity measures: what’s the use?” in Proceedings of the International Symposium on Music Information Retrieval (ISMIR), pp. 157-163, 2002.
    [63] N. Vasconcelos, “Image indexing with mixture hierarchies,” in Proceedings of IEEE CVPR, pp. 3-10, 2001.
    [64] C. M. Bishop, Pattern recognition and machine learning, Springer, 2006.
    [65] J.-J. Aucouturier, F. Pachet, and M. Sandler, “The Way It Sounds: Timbre Models for Analysis and Retrieval of Music Signals,” IEEE Transactions on Multimedia, vol. 7, no. 6, pp. 1028–1035, December, 2005.
    [66] K. Yoshii, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, “Hybrid collaborative and content-based music recommendation using probabilistic model with latent user preferences,” in Proc. ISMIR Conf., pp. 296–301, 2006.
    [67] M. A. A. Cox and T. F. Cox, Multidimensional scaling, Springer, 2008.
    [68] G. H. Golub and C. F. Van Loan, Matrix Computations (3rd Edition), Johns Hopkins University Press, 1996.
    [69] D. Turnbull, L. Barrington, and G. Lanckriet, “Five approaches to collecting tags for music,” in Proc. ISMIR Conf., 2008.
    [70] J. Reed and C.-H. Lee, “On the importance of modeling temporal information in music tag annotation,” in Proceedings of International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 1873-1876, 2009.
    [71] A. Berenzweig, B. Logan, D. P. W. Ellis, and B. Whitman, “A large-scale evaluation of acoustic and subjective music-similarity measures,” Computer Music Journal, vol. 28, no. 2, pp. 63-76, 2004.
    [72] A. E. Rosenberg, C.-H. Lee, and F. K. Soong, “Cepstral channel normalization techniques for HMM-based speaker verification,” ICSLP, pp. 1835-1838, 1994.
    [73] Yahoo! Webscope, “Yahoo! Music User Ratings of Musical Artists, Version 1.0,” http://research.yahoo.com/Academic Relations
    [74] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl, “Evaluating collaborative filtering recommender system,” ACM Transactions on Information Systems, vol. 22, no. 1, pp. 5-53, 2004.
    [75] L. Barrington, D. Turnbull, M. Yazdani, and G. Lanckriet, “Combining audio content and social context for semantic music discovery,” in Proceedings of 32nd ACM SIGIR, 2009.
    [76] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan, “Learning the kernel matrix with semi-definite programming,” Journal of Machine Learning Research, vol. 5, pp. 27-72, 2004.
    [77] O. Celma, “Music Recommendation and Discovery in the Long Tail,” Ph.D. thesis, Music Technology Group, Pompeu Fabra Univ., Barcelona, Spain, 2008.
    [78] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, “Towards musical query-by-semantic description using the CAL500 data set,” in Proc. SIGIR’07, pp. 439–446, 2007.
    [79] D. Torres, D. Turnbull, L. Barrington, and G. Lanckriet, “Identifying words that are musically meaningful,” in Proc. ISMIR’07, pp. 405–410, 2007.
    [80] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, “Semantic Annotation and Retrieval of Music and Sound Effects,” IEEE Trans. Audio, Speech, and Language Process., vol. 16, no. 2, Feb. 2008.
    [81] N. Soundscan, “State of the industry,” National Association of Recording Merchandisers, 2007.
    [82] Y. Koren and R. M. Bell, “Advances in collaborative filtering,” in Recommender Systems Handbook, pp. 145-186, Springer, 2011.
    [83] J. Nocedal and S.J. Wright, Numerical Optimization 2nd Edition, Springer, 2006.
    [84] Million Song Dataset, official website by Thierry Bertin-Mahieux, available at: http://labrosa.ee.columbia.edu/millionsong/.
    [85] L.I. Kuncheva and C.Whitaker, “Measures of diversity in classifier ensembles,” Machine Learning, voll. 51, pp. 181–207, 2003.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE