ISSN 0021-3454 (print version)
ISSN 2500-0381 (online version)
Menu

11
Issue
vol 67 / November, 2024
Article

DOI 10.17586/0021-3454-2020-63-11-1027-1033

UDC 004.522

COMPARATIVE STUDY OF NEURAL NETWORK ARCHITECTURES FOR INTEGRATED SPEECH RECOGNITION SYSTEM

I. S. Kipyatkova
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), Saint Petersburg, 199178, Russian Federation; senior researcher


A. A. Karpov
St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), Saint Petersburg, 199178, Russian Federation; Professor, Head of Laboratory


Read the full article 

Abstract. The problem of improving the architecture of an integral neural-network model of Russian speech recognition is discussed. The considered model is created by combining the codec model with the attention mechanism, and the model based on the connectional temporal classification. Application of such neural network architectures as Highway Network, residual connections, dense connections, in the end-to-end model is studied. In addition, the use of the gumbel-softmax function instead of the softmax activation function during decoding is investigated. The models are trained using transfer learning method with English as non-target language, and then trained on a small corpus of continuous Russian speech with duration of 60 hours. The developed models are reported to demonstrate a higher accuracy of speech recognition in comparison with the basic end-to-end model. The results of experiments on recognition of continuous Russian speech are presented: the best result is 10.8% in terms of the number of incorrectly recognized characters and 29.1% in terms of the number of incorrectly recognized words.
Keywords: speech recognition, end-to-end models, highway networks, residual connection, dense connection, Russian speech

References:
  1. Markovnikov N., Kipyatkova I. Informatics and Automation (SPIIRAS Proceedings), 2018, no. 58, pp. 77–110. (in Russ.)
  2. Markovnikov N.М., Kipyatkova I.S. Information and Control Systems, 2019, no. 4, pp. 45–53. (in Russ.)
  3. Markovnikov N., Kipyatkova I. Lecture Notes in Computer Science, Springer LNAI 11658, SPECOM 2019, 2019, рр. 337–347.
  4. Watanabe S. et al. Proceedinds of Interspeech-2018, 2018, рр. 2207–2211.
  5. Kim S., Hori T., Watanabe S. IEEE Intern. Conf. on Acoustics, Speech and Signal Processing (ICASSP-2017), 2017, рр. 4835–4839.
  6. Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. The Journal of Machine Learning Research, 2014, no. 1(15), pp. 1929–1958.
  7. Szegedy C., Vanhoucke V., Ioffe S., Shlens J., Wojna Z. IEEE Conference on computer vision and pattern recognition, 2016, рр. 2818–2826.
  8. Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556. 2014.
  9. Glorot X., Bordes A., Bengio Y. Proceedinds of the 14th Intern. Conf. on Artificial Intelligence and Statistics, 2011, рр. 315–323.
  10. Chorowski J.K., Bahdanau D., Serdyuk D., Cho K., Bengio Y. Advances in neural information processing systems, 2015, рр. 577–585.
  11. Kipyatkova I. Lecture Notes in Computer Science, Springer, LNCS 10458. SPECOM-2017, 2017, рр. 362–369.
  12. Kipyatkova I., Karpov A. Lecture Notes in Computer Science, Springer LNAI 8113. SPECOM 2013, 2013, рр. 219–226.
  13. Srivastava R.K., Greff K., Schmidhuber J. Highway networks, arXiv preprint arXiv:1505.00387. 2015.
  14. He K., Zhang X., Ren S., Sun J. IEEE Conf. on Computer Vision and Pattern Recognition, 2016, рр. 770–778.
  15. Ioffe S., Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167. 2015.
  16. Iandola F., Moskewicz M., Karayev S., Girshick R., Darrell T., Keutzer K. Densenet: Implementing efficient convnet descriptor pyramids, arXiv preprint arXiv:1404.1869. 2014.
  17. Karpov A.А., Kipyatkova I.S. Journal of Instrument Engineering, 2012, no. 11(55), pp. 38–43. (in Russ.)
  18. Freitag M., Al-Onaizan Y. Beam search strategies for neural machine translation, arXiv preprint arXiv:1702.01806. 2017.
  19. Jang E., Gu S., Poole B. Categorical reparameterization with gumbel-softmax, arXiv preprint arXiv:1611.01144. 2016.