Development of Audio-Visual Speech Recognition using Deep-Learning Technique

Authors

  • Chun Kit How Faculty of Manufacturing and Mechatronics Engineering Technology, Universiti Malaysia Pahang, 26600 Pahang, Malaysia.
  • Ismail Mohd Khairuddin Faculty of Manufacturing and Mechatronics Engineering Technology, Universiti Malaysia Pahang, 26600 Pahang, Malaysia.
  • Mohd Azraai Mohd Razman Universiti Malaysia Pahang
  • Anwar P. P. Abdul Majeed Faculty of Manufacturing and Mechatronics Engineering Technology, Universiti Malaysia Pahang, 26600 Pahang, Malaysia.
  • Wan Hasbullah Mohd Isa Faculty of Manufacturing and Mechatronics Engineering Technology, Universiti Malaysia Pahang, 26600 Pahang, Malaysia.

DOI:

https://doi.org/10.15282/mekatronika.v4i1.8625

Keywords:

Audio-Visual, Speech Recognition, Deep-Learning, Emotion, Spectrogram

Abstract

Deep learning is a technique with artificial intelligent (AI) that simulate humans’ learning behavior. Audio-visual speech recognition is important for the listener understand the emotions behind the spoken words truly. In this thesis, two different deep learning models, Convolutional Neural Network (CNN) and Deep Neural Network (DNN), were developed to recognize the speech’s emotion from the dataset. Pytorch framework with torchaudio library was used. Both models were given the same training, validation, testing, and augmented datasets. The training will be stopped when the training loop reaches ten epochs, or the validation loss function does not improve for five epochs. At the end, the highest accuracy and lowest loss function of CNN model in the training dataset are 76.50% and 0.006029 respectively, meanwhile the DNN model achieved 75.42% and 0.086643 respectively. Both models were evaluated using confusion matrix. In conclusion, CNN model has higher performance than DNN model, but needs to improvise as the accuracy of testing dataset is low and the loss function is high. 

Downloads

Published

2022-06-27

How to Cite

[1]
C. K. How, I. Mohd Khairuddin, M. A. Mohd Razman, A. P. P. Abdul Majeed, and W. H. Mohd Isa, “Development of Audio-Visual Speech Recognition using Deep-Learning Technique”, MEKATRONIKA, vol. 4, no. 1, pp. 88–95, Jun. 2022.

Issue

Section

Original Article

Most read articles by the same author(s)

1 2 3 > >>