簡易檢索 / 詳目顯示

研究生: 蘇于恩
Su, Yu-En
論文名稱: 發揮 GPT 及其衍生功能的力量:揭示兩種獨特的應用
Harnessing the Power of GPT and Its Derivative Functions: Unveiling Two Distinct Applications
指導教授: 張正尚
Chang, Cheng-Shang
口試委員: 李端興
Lee, Duan-Shin
黃昱智
Huang, Yu-Chih
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 通訊工程研究所
Communications Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 53
中文關鍵詞: GPTAI自動生成日記語音AI助手
外文關鍵詞: GPT, AI Automatic Diary Generation, Voice AI Assistant
相關次數: 點閱:51下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 這篇論文探討了 GPT 和其他人工智慧模型在兩個不同系統中的創新應用。這兩個系統分別是 AI 自動日記生成系統和語音 AI 助理。首先,AI 自動日記生成系統整合了多種數據類型,包括位置信息、照片和音頻記錄,用以生成個人化的日記。透過DALL·E 3 模型,每篇日記還可以生成相應的圖像,從而增強其可讀性和吸引力。這種方式使得日記的撰寫變得更加便捷。其次,語音 AI 助理系統則結合了 Whisper(自動語音識別系統)、GPT 助理和 VITS(文字轉語音模型),接受語音輸入並以特定個體的聲音和個性進行輸出,實現模擬與特定人物進行即時語音對話的能力。具體而言,它能夠模仿劉炯朗校長獨特的說話風格,實現自然的即時對話。這兩個系統展示了人工智慧在自動化任務和改善人機互動方面的有效應用。


    This thesis explores the innovative application of the Generative Pre-trained Transformer (GPT) and other AI models in two distinct systems: an AI automatic diary generation system and a voice AI assistant. Firstly, the AI automatic diary generation system integrates various data types, such as location information, photos, and audio ecordings, to produce comprehensive and personalized diaries. By employing the DALL·E 3 model, it generates corresponding images for each diary, thereby enhancing readability and attractiveness. Secondly, the voice AI assistant system combines the Whisper (Automatic Speech Recognition system), GPT assistant, and VITS (Text-to-Speech model) to simulate the voice and personality of specific individuals. Specifically, it can emulate the unique speaking style of President Chung-Laung Liu, enabling natural and informative conversations. These two systems demonstrate the efficacy of AI in automating tasks and improving human-computer interaction.

    Contents 1 List of Figures 6 1 Introduction 7 2 Tool Overview 12 2.1 GPT (Generative Pre-trained Transformer) . . . . . . . . . . . . . . . . . 12 2.2 Whisper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 DALL·E 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 GPTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 GPT Assistant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Conditional Variational Autoencoder with Adversarial Learning for Endto-End Text-to-Speech (VITS) . . . . . . . . . . . . . . . . . . . . . . . . 15 3 AI Automatic Diary Generation System 16 3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Location Information . . . . . . . . . . . . . . . . . . . . . . . . . 18 1 3.2.2 Photos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.3 Audio Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 GPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 DALL·E 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 GPTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6 System Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Voice AI Assistant System 31 4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.1 GPT Assistant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.2 VITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Implementation Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 System Message . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.2 Response Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5 Conclusions 46

    [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., Language Models are Few-Shot Learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
    [2] OpenAI, “GPT-4 technical report,” 2024.
    [3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
    [4] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “LLaMA 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
    [5] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
    [6] J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pretraining with frozen image encoders and large language models,” in International conference on machine learning. PMLR, 2023, pp. 19 730–19 742.
    [7] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
    [8] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-Shot Text-to-Image Generation,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 8821– 8831. [Online]. Available: https://proceedings.mlr.press/v139/ramesh21a.html
    [9] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, Hierarchical Text-Conditional Image Generation with CLIP Latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022.
    [10] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo et al., “Improving Image Generation with Better Captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023.
    [11] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 28 492–28 518. [Online]. Available: https://proceedings.mlr.press/v202/radford23a.html
    [12] J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 5530–5540. [Online]. Available: https://proceedings.mlr.press/v139/kim21f.html
    [13] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
    [14] H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, “Capabilities of GPT-4 on medical challenge problems,” arXiv preprint arXiv:2303.13375, 2023.
    [15] M. A. Hedderich, N. N. Bazarova, W. Zou, R. Shim, X. Ma, and Q. Yang, “A Piece of Theatre: Investigating How Teachers Design LLM Chatbots to Assist Adolescent Cyberbullying Education,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–17.
    [16] A. Meyers, N. Johnston, V. Rathod, A. Korattikara, A. Gorban, N. Silberman, S. Guadarrama, G. Papandreou, J. Huang, and K. P. Murphy, “Im2Calories: Towards an Automated Mobile Vision Food Diary,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015, pp. 1233–1241.
    [17] A. C. Prelipcean, G. Gid´ofalvi, and Y. O. Susilo, “MEILI: A travel diary collection, annotation and automation system,” Computers, Environment and Urban Systems, vol. 70, pp. 24–34, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0198971517305240
    [18] A. Ichikura, K. Kawaharazuka, Y. Obinata, K. Shinjo, K. Okada, and M. Inaba, “Automatic diary generation system including information on joint experiences between humans and robots,” in International Conference on Intelligent Autonomous Systems. Springer, 2023, pp. 399–412.
    [19] S. Siriwardhana, C. Gupta, T. Kaluarachchi, V. Dissanayake, S. Ellawela, and S. Nanayakkara, “Can AI Models Summarize Your Diary Entries? Investigating Utility of Abstractive Summarization for Autobiographical Text,” International Journal of Human–Computer Interaction, pp. 1–19, 2023.
    [20] Z. Khanjani, G. Watson, and V. P. Janeja, “Audio deepfakes: A survey,” Frontiers in Big Data, vol. 5, p. 1001063, 2023.
    [21] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2Wav: End-To-End Speech Synthesis,” in Proceedings of International Conference on Learning Representations (ICLR), 2017.
    [22] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3617–3621.
    [23] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks,” in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 266–273.
    [24] K.-W. Kim, S.-W. Park, J. Lee, and M.-C. Joe, “Assem-vc: Realistic voice conversion by assembling modern speech synthesis techniques,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6997–7001.
    [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 6000–6010, 2017.
    [26] A. Shanthini, C. P. Rao, and G. Vadivu, “AI BOT: An intelligent personal assistant,” in Proceeding of the International Conference on Computer Networks, Big Data and IoT (ICCBI-2018). Springer, 2020, pp. 552–559.
    [27] P. Kunekar, A. Deshmukh, S. Gajalwad, A. Bichare, K. Gunjal, and S. Hingade, “AI-based Desktop Voice Assistant,” in 2023 5th Biennial International Conference on Nascent Technologies in Engineering (ICNTE), 2023, pp. 1–4.
    [28] S. Subhash, P. N. Srivatsa, S. Siddesh, A. Ullas, and B. Santhosh, “Artificial Intelligence-based Voice Assistant,” in 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), 2020, pp. 593–596.
    [29] G. Terzopoulos and M. Satratzemi, “Voice Assistants and Artificial Intelligence in Education,” in Proceedings of the 9th Balkan Conference on Informatics, ser. BCI’19. New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available: https://doi.org/10.1145/3351556.3351588
    [30] B. G. Dellaert, S. B. Shu, T. A. Arentze, T. Baker, K. Diehl, B. Donkers, N. J. Fast, G. H¨aubl, H. Johnson, U. R. Karmarkar et al., “Consumer decisions with artificially intelligent voice assistants,” Marketing Letters, vol. 31, pp. 335–347, 2020.
    [31] IC Broadcasting Co., Ltd. and L. Chung-Laung, “President Liu’s radio program,” 2019. [Online]. Available: https://www.ic975.com
    [32] Plachtaa, SayaSS, and Wwwwhy230825, “VITS-Fast-Fine-tuning,” 2023. [Online]. Available: https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/ README.md

    QR CODE