研究生: |
孔繁傑 Kung, Fan-Jie |
---|---|
論文名稱: |
利用視訊與聲訊雙重處理進行說話者位置偵測 Detection of the location of talkers via video and audio bimodal processing |
指導教授: |
劉奕汶
Liu, Yi-Wen |
口試委員: |
呂忠津
Lu, Chung-Chin 李沛群 Li, Pei-Chun 康仕仲 Kang, Shih-Chung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 中文 |
論文頁數: | 70 |
中文關鍵詞: | 到達時間差 、聲源追蹤 、人臉偵測 、人臉辨識 、聲訊 、視訊 |
外文關鍵詞: | TDOA, source detection, face detection, face recognition, audio, video |
相關次數: | 點閱:4 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來越來越多的研究從事聲訊與視訊的結合來做聲源定位,可以減低單一使用聲訊在充滿雜訊以及聲音迴響的環境下估計聲源方位所造成的誤差。本論文就是以兩支麥克風與筆記型電腦上的網路攝影機針對說話者做聲源定位。在聲訊方面是利用雙曲線的定義估計聲源的角度。在視訊方面是利用Viola與Jones提出的人臉偵測演算法偵測到人臉之後,再利用Turk與Pentland提出利用主成份分析法(Principal Component Analysis, PCA)找到每個人不同的eigenface來做人臉辨識。
因此本論文的系統架構是先利用視訊偵測到人臉在影像中的大小估計人到網路攝影機的垂直距離,在結合利用雙曲線定義所估計出的聲源角度,求得當聲源是人時,以網路攝影機(即兩支麥克風的中點)為中心的二維平面座標。本論文除了可以偵測說話者的方位以及辨識說話者的身分之外,同時也利用聲源角度的資訊輔助影像針對人臉旋轉偵測。並且在假設聲源之間彼此的訊號不相關(uncorrelated)時,可以利用視訊偵測到的人臉個數與聲訊利用交互相關函數來估計在室內環境下的潛在聲源個數。
本研究方法,實驗量測結果得知:在利用視訊結合聲訊針對人做聲源定位的二維平面座標誤差不超過5cm。並且假設聲源之間彼此的訊號不相關以及筆記型電腦上的網路攝影機視角範圍限制在-25°~25°之下,只要利用兩支麥克風就可以偵測到兩個聲源同時發聲。
Much research has been investigated regarding the source detection by joining audio and video methods recently. The audio-video method performs better in bias reduction for source detection in the noisy and reverberant environment than using the audio method alone. In this thesis, we design a system for talker detection by using two microphones and the web camera. For audio, we use the definition of hyperbolic surface to estimate the direction of sound sources relative to the microphones. For video, we use Viola-Jones algorithm to detect the face. Afterwards, we use Turk-Pentland algorithm to find the eigenface by principal component analysis, and later use the eigenface to recognize the face.
The location of a talking person is determined in two steps. First, we estimate the normal distance between the talker and the imaging plane of the camera by the size of the talker’s face in the image. Then, an estimate of two-dimensional location of the talker is obtained by considering the angle of the talker relative to the camera (or the center of two microphones). Because of using video and audio information jointly, the system can identify the talker, and face detection can be made robust against rotations thanks to the availability of audio information. In addition, when there are multiple talkers in the room, the number of sound sources can be estimated under the assumption that the sources are uncorrelated; this can be achieved either by counting the number of faces in video or calculating the cross correlation function between signals obtained by two microphones.
Experiments were conducted and results showed that the bias for estimating the location of a single talker is less than 5cm. Experiments for double talker estimation were also conducted, and we demonstrated that, in principle, we can only use two microphones to detect two sources as long as that they are uncorrelated.
1. Pingali, G., “Integrated audio-visual processing for object localization and tracking,” in Proceedings of the SPIE, vol. 3310, 1997. p. 206-213.
2. Byoung, G.L.; JongSuk, C.; Daijin, K.; Munsang, K “Sound source localization in reverberant environment using visual information,” Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, 2010. p. 3542-3547.
3. Schmidt, R., “Multiple emitter location and signal parameter estimation,” Antennas and Propagation, IEEE Transactions on, 1986. 34(3): p. 276-280.
4. Boyd, S.; Vandenberghe, L., Convex optimization. New York: Cambridge University Press, 2004. Chap. 1-7.
5. Cekli, S., “Position detection with spherical interpolation least squares based on time difference of arrivals using separated acoustic signals by independent component analysis,” Signal Processing and Communications Applications Conference (SIU), 2012. p.1-4.
6. Fischell, D.; Coker, C.H., “A speech direction finder ,” Acoustics, Speech and Signal Processing(ICASSP), IEEE International Conference on, 1984 vol. 9, p. 128-131.
7. 張智星. “語音訊號處理”. 網路線上課程. available from: http://www.cs.nthu.edu.tw/~jang.
8. Strobel, N.; Spors, S.; Rabenstein, R., “Joint Audio-Video Object Localization and Tracking,” Signal Processing Magazine, IEEE, 2001. 18(1): p. 22-31
9. 胡文正, “水下水平線陣列位置不確定性對聲源定位的影響”, 海下技術研究所2006, 國立中山大學: 高雄市.
10. 梁翰銘, “利用粒子濾波器與麥克風陣列進行直角坐標上多聲源之追蹤”, 電機工程學系研究所2012, 國立清華大學: 新竹市.
11. Chapra, S.C., Applied Numerical Methods with MATLAB for Engineers and Scientisys. New York: McGraw-Hill Science, 2006. Chap. 10-11.
12. Manolakis, D.G.; Ingle, V.K., Applied Digital Signal Processing. Singapore: Cambridge University Press, 2012. p. 296-297.
13. Oppenheim, A.V.; Willsky, A.S.; Nawab, S.H., Signals & Systems. Pearson, 1997. p. 284-288.
14. Qiang, C., Laizhong, S., “The Lagrange interpolation polynomial algorithm error analysis,” Computer Science and Service System (CSSS), International Conference on, 2011. p. 3719-3722.
15. Viola, P.; Jones, M., “Robust Real-Time Face Detection,” International Journal of Computer Vision, Vol.57, 2004. p. 137-154.
16. 莊函潔, “結合影像與音訊之身份認證系統”, 資訊工程研究所2008, 國立中山大學: 高雄市.
17. 嚴逸緯, “結合人行特徵偵測與幾何分析應用於人臉姿態估測與手勢辨識”, 電機工程學系研究所2009, 國立成功大學: 台南市.
18. Jensen, O.H., “Implementing the Viola-Jones Face Detection Algorithm,” M.S thesis, Technical University of Denmark., Kgs. Lyngby, 2008.
19. 鄭兆翔, “達更大偵測範圍的改良人臉偵測系統”, 電機工程學系研究所2007, 國立清華大學: 新竹市.
20. Freund, Y.; Schapire, R.E., “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, 1997. 55(1): p. 119-139.
21. Turk, M.A.; Pentland, A.P., “Face recognition using eigenfaces,” Computer Vision and Pattern Recognition, 1991. Proceedings CVPR '91., IEEE Computer Society Conference on, vol. 9, p. 586-591.
22. Turk, M.A.; Pentland, A.P., “Eigenfaces for Recognition,” Journal of Cognitive Neuroscience, 1991.
23. Pearson, k., “On lines and planes of closest fit to systems of points in space,” Philosophical Magazine 2, p. 559-572, 1901.
24. Bernhard, G.; Gernot, K., “Relative Information Loss in the PCA,” Proc. IEEE Information Theory Workshop: p. 562–566, 2012.