基於實例查詢演算法之聲音檢索輔助標注工具

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳玉璇 Chen, Yu-Hsuan
論文名稱：	基於實例查詢演算法之聲音檢索輔助標注工具 An Assistant Annotation Tool for Audio Retrieval based on Query by Example
指導教授：	劉奕汶 Liu, Yi-Wen
口試委員:	陳宜欣 Chen, Yi-Shin 白明憲 Bai, Ming-Sian 蘇文鈺 Su, Wen-Yu
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2021
畢業學年度：	109
語文別：	英文
論文頁數：	47
中文關鍵詞：	聲音檢索、實例查詢、聲音指紋、資料標註
外文關鍵詞：	Audio Fingerprinting, Audio Retrieval, Query-by-example, Data Annotation
相關次數：	點閱：75 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

資料標注是通過對語音、影像、文字等資料進行標注的一個過程，它在人工智
慧及機器學習中相當重要，主要用於訓練統計模型以理解內容並提供相對應的結
果。然而手動標記耗費時間與人力，因此建立一個能夠降低這些成本的輔助標註系
統將會非常有幫助。專注於語音資料標註的話，若能有聲音檢索工具以查詢並找出
聲音片段，將可以大幅縮短標記的時間。在聲音實例查詢以及語意實例查詢方面，
Shazam 以及Musiwave 提出的聲音指紋(Audio Fingerprinting) 讓使用者可以用
環境中的錄音片段去查詢該歌曲。本篇論文將聲音指紋方法應用於輔助標注系統
中，透過各種環境以及不同方法比較的一系列實驗中，以數據量化並分析該系統的
檢索性能以及噪聲穩健性。本篇論文亦設計了一個互動性的使用介面提供使用性測
試並收集回饋，跟一般手動標記的標注工具介面相比，該系統能夠不失標注品質
下縮短使用者35% 的標注時間，不過目前檢索準確性平均約80%，可以再更好一
些。

Data annotation is the process of labeling image, videos, audios, and text data. It
is quite critical in Artificial Intelligence (AI) and machine learning (ML) for training
a statistical model to understand the input and react appropriately. However, manually
labeling requires time and labor costs. It would be worthwhile to build an assistant
annotation tool to reduce the cost of manually labeling. Concentrating on labeling
audio data, when audio retrieval tool is available, it can locate the queries and
help quickly label relevant segments. Among previous work in content-based audio
retrieval, query-by-acoustic example (QBAE) and query-by-semantic-example
(QBSE) are two classic approaches. Among QBAE, a well-known algorithm called
Audio Fingerprinting (AF) proposed by Shazam [1] and Musiwave [2] allows users
to search a desired song by a short query recorded in the environment. In this thesis,
we implemented the AF methods to construct an assistant annotation system,
and conducts a set of experiments to validate the feasibility. The proposed system
is called QBEAT (Query-by-example Annotation Tool). With the quantitative
analysis under different environments and the comparison with cross correlation (a
conventional method in audio retrieval), we can assess the noise robustness and the
retrieval performance of QBEAT. In addition, an interactive user interface is built
for usability testing, which gathers feedback from the participants. In contrast to
manual annotation interfaces, the proposed system shortens the labeling time without
the loss in labeling performance, even though there is still space to improve the
accuracy of audio retrieval.

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Related Work 3
3 Methodology 6
3.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Audio Fingerprinting (AF) . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.2 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 User Interface (UI) . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Experiments and Results 15
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Experiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Audio Fingerprinting . . . . . . . . . . . . . . . . . . . . . 19
4.3.2 Comparison of Methods . . . . . . . . . . . . . . . . . . . 25
4.4 Usability Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Conclusions 38
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
References 40
Appendix 43
Appendix A 43
A.1 Dataset Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.2 Recordings for Usability Testing . . . . . . . . . . . . . . . . . . . 44
A.3 Information for participants . . . . . . . . . . . . . . . . . . . . . . 45
A.4 Suggestions from the oral defense committees . . . . . . . . . . . . 46
A.4.1 陳宜欣教授. . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.4.2 白明憲教授. . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.4.3 蘇文鈺教授. . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.4.4 劉奕汶教授. . . . . . . . . . . . . . . . . . . . . . . . . . 47
                                

[1] A. Wang et al., “An industrial strength audio search algorithm.,” in Ismir,
vol. 2003, pp. 7–13, Citeseer, 2003.
[2] J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system.,” in
Ismir, vol. 2002, pp. 107–115, 2002.
[3] E. Vazquez-Fernandez and D. Gonzalez-Jimenez, “Face recognition for authentication
on mobile devices,” Image and Vision Computing, vol. 55, pp. 31–
33, 2016.
[4] S. M. Kuo, S. Mitra, and W.-S. Gan, “Active noise control system for
headphone applications,” IEEE Transactions on Control Systems Technology,
vol. 14, no. 2, pp. 331–335, 2006.
[5] P. van Hengel and J. Anemüller, “Audio event detection for in-home care,” in
Int. Conf. on Acoustics (NAG/DAGA), 2009.
[6] S. Adrián-Martínez, M. Bou-Cabo, I. Felis, C. D. Llorens, J. A. Martínez-
Mora, M. Saldaña, and M. Ardid, “Acoustic signal detection through the crosscorrelation
method in experiments with different signal to noise ratio and reverberation
conditions,” in International conference on Ad-Hoc Networks and
wireless, pp. 66–79, Springer, 2014.
[7] E. Wold, T. Blum, D. Keislar, and J. Wheaten, “Content-based classification,
search, and retrieval of audio,” IEEE multimedia, vol. 3, no. 3, pp. 27–36,
1996.
[8] C. Wan, M. Liu, and L. Wang, “Content-based sound retrieval for web application,”
in Asia-Pacific Conference on Web Intelligence, pp. 389–393, Springer,
2001.
[9] N. Borjian, “A survey on query-by-example based music information retrieval,”
International Journal of Computer Applications, vol. 158, no. 8, 2017.
[10] W.-H. Tsai, H.-M. Yu, H.-M. Wang, et al., “Query-by-example technique for
retrieving cover versions of popular songs with similar melodies.,” in ISMIR,
vol. 5, pp. 183–190, 2005.
[11] M. Helén and T. Virtanen, “A similarity measure for audio query by example
based on perceptual coding and compression,” in Proc. 10th Int. Conf. Digital
Audio Effects (DAFX), 2007.
[12] A. Wang, “The shazam music recognition service,” Communications of the
ACM, vol. 49, no. 8, pp. 44–48, 2006.
[13] M. Slaney, “Semantic-audio retrieval,” in 2002 IEEE International Conference
on Acoustics, Speech, and Signal Processing, vol. 4, pp. IV–4108, IEEE, 2002.
[14] L. Barrington, A. Chan, D. Turnbull, and G. Lanckriet, “Audio information
retrieval using semantic similarity,” in 2007 IEEE International Conference
on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 2, pp. II–725,
IEEE, 2007.
[15] A. Mesaros, T. Heittola, and K. Palomäki, “Query-by-example retrieval of
sound events using an integrated similarity measure of content and label,” in
2013 14th International Workshop on Image Analysis for Multimedia Interactive
Services (WIAMIS), pp. 1–4, IEEE, 2013.
[16] T. Pedersen, S. Patwardhan, J. Michelizzi, et al., “Wordnet:: Similaritymeasuring
the relatedness of concepts.,” in AAAI, vol. 4, pp. 25–29, 2004.
[17] S. Adavanne, P. Pertilä, and T. Virtanen, “Sound event detection using spatial
features and convolutional recurrent neural network,” in 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 771–775, IEEE, 2017.
[18] S. Adavanne, A. Politis, and T. Virtanen, “Multichannel sound event detection
using 3d convolutional neural networks for learning inter-channel features,”
in 2018 international joint conference on neural networks (IJCNN), pp. 1–7,
IEEE, 2018.
[19] A. Kumar and B. Raj, “Audio event detection using weakly labeled data,”
in Proceedings of the 24th ACM international conference on Multimedia,
pp. 1038–1047, 2016.
[20] R. Chu, B. Niu, S. Yao, and J. Liu, “Peak-based philips fingerprint robust to
pitch-shift for audio identification,” IEEE MultiMedia, vol. 28, no. 1, pp. 74–
82, 2020.
[21] R. M. Haralick, S. R. Sternberg, and X. Zhuang, “Image analysis using mathematical
morphology,” IEEE transactions on pattern analysis and machine
intelligence, no. 4, pp. 532–550, 1987.
[22] J. Liang, J. Piper, and J.-Y. Tang, “Erosion and dilation of binary images by
arbitrary structuring elements using interval coding,” Pattern Recognition Letters,
vol. 9, no. 3, pp. 201–209, 1989.
[23] D. Eastlake and P. Jones, “Us secure hash algorithm 1 (sha1),” 2001.
[24] Q. Huang and J. Tang, “Age-related hearing loss or presbycusis,” European
Archives of Oto-rhino-laryngology, vol. 267, no. 8, pp. 1179–1191, 2010.
[25] J. P. Ogle and D. P. Ellis, “Fingerprinting to identify repeated sound events
in long-duration personal audio recordings,” in 2007 IEEE International Conference
on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 1, pp. I–
233, IEEE, 2007.
[26] B. Kim and B. Pardo, “A human-in-the-loop system for sound event detection
and annotation,” ACM Transactions on Interactive Intelligent Systems (TiiS),
vol. 8, no. 2, pp. 1–23, 2018.
[27] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C.
Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled
dataset for audio events,” in 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 776–780, IEEE, 2017.
[28] A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic scene
classification and sound event detection,” in 2016 24th European Signal Processing
Conference (EUSIPCO), pp. 1128–1132, IEEE, 2016.

簡易檢索 / 詳目顯示

相關論文