Recognition and Processing of Speech Signals Using Neural Networks (original) (raw)
References
- K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, D. Nahamoo, Direct acoustics-to-word models for English conversational speech recognition, in Interspeech (2017), pp. 959–963
- A. Avila, J. Monteiro, D. O’Shaughnessy, T. Falk, Speech emotion recognition on mobile devices based on modulation spectral feature pooling and deep neural networks, in IEEE ISSPIT (2017)
- T. Backstrom, Speech Coding: With Code-Excited Linear Prediction (Springer, Berlin, 2017)
Book Google Scholar - L. Bai, P. Weber, P. Jancovic, M. Russell, Exploring how phone classification neural networks learn phonetic information by visualising and interpreting bottleneck features, in Interspeech (2018), pp. 1472–1476
- Y. Bengio, A.C. Courville, P. Vincent, Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Article Google Scholar - C. Bishop, Pattern Recognition and Machine Learning (Springer, Berlin, 2006)
MATH Google Scholar - C.-C. Chiu, et al., State-of-the-art speech recognition with sequence-to-sequence models, in ICASSP (2017)
- J.K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Yoshua Bengio, Attention-based models for speech recognition, in NIPS (2015), pp. 1–16
- R. Collobert, C. Puhrsch, G. Synnaeve, Wav2Letter: an end-to-end ConvNet-based speech recognition system. arXiv:1609.03193 (2016)
- S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP 28, 357–366 (1980)
Article Google Scholar - H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in ICASSP (2015), pp. 708–712
- E. Fosler-Lussier, Y. He, P. Jyothi, R. Prabhavalkar, Conditional random fields in speech, audio, and language processing. Proc. IEEE 101, 1054–1075 (2013)
Article Google Scholar - P. Ghahremani, H. Hadian, H. Lv, D. Povey, S. Khudanpur, Acoustic modeling from frequency domain representations of speech, in Interspeech (2018), pp. 1596–1600
- I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)
MATH Google Scholar - A. Graves, A. Mohamed, G. Geoffrey Hinton, Speech recognition with deep recurrent neural networks, in ICASSP (2013), pp. 6645–6649
- A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in International Conference on Machine Learning, Pittsburgh, PA (2006)
- G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012)
Article Google Scholar - X.D. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Prentice Hall, Englewood Cliffs, 2001)
Google Scholar - Y. Huang, A. Sethy, B. Ramabhadran, Fast neural network language model lookups at N-gram speed, in Interspeech (2017), pp. 274–278
- I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18(187), 1–30 (2017)
MathSciNet MATH Google Scholar - I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 2002)
MATH Google Scholar - M. Jordan, E. Sudderth, M. Wainwright, A. Wilsky, Major advances and emerging developments of graphical models. IEEE Signal Process. Mag. 27(6), 17–138 (2010)
Article Google Scholar - S. Kullback, R.A. Leibler, On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar - M. Kutner, J. Neter, C. Nachtsheim, W. Wasserman, Applied Linear Statistical Models (McGraw-Hill, New York, 2004)
Google Scholar - Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015)
Article Google Scholar - B. Li, et al., Acoustic modeling for Google home, in Interspeech (2017), pp. 399–403
- W. Li, G. Cheng, F. Ge, P. Zhang, Y. Yan, Investigation on the combination of batch normalization and dropout in BLSTM-based acoustic modeling for ASR, in Interspeech (2018), pp. 2888–2492
- L. Lu, L. Kong, C. Dyer, N.A. Smith, S. Renals, Segmental recurrent neural networks for end-to-end speech recognition, in Interspeech (2016), pp. 385–389
- S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, Y. Bengio, SampleRNN: an unconditional end-to-end neural audio generation model. arXiv:1612.07837 (2016)
- S.K. Moore, IBM’s new do-it-all AI chip. IEEE Spectrum, August, pp. 10–11 (2018)
- K. Mustafa, I.C. Bruce, Robust formant tracking for continuous speech with speaker variability. IEEE Trans. Audio Speech Lang. Process. 14, 2 (2006)
Article Google Scholar - T. Nagamine, M.L. Seltzer, N. Mesgarani, Exploring how deep neural networks form phonemic categories, in Interspeech (2015), pp. 1912–1916
- T. Nagamine, M.L. Seltzer, N. Mesgarani, On the role of nonlinear transformations in deep neural network acoustic models, in Interspeech (2016), pp. 803–807
- T. Nagamine, N. Mesgarani, Understanding the representation and computation of multilayer perceptrons: a case study in speech recognition, in International Conference on Machine Learning, Sydney, Australia, PMLR 70 (2017)
- M. Nussbaum-Thom, J. Cui, B. Ramabhadran, V. Goel, Acoustic modeling using bidirectional gated recurrent convolutional units, in Interspeech (2016), pp. 390–394
- D. O’Shaughnessy, Speech Communications: Human and Machine (IEEE Press, New York, 2000)
MATH Google Scholar - D. O’Shaughnessy, Automatic speech recognition: history, methods and challenges. Pattern Recogn. 41, 2965–2979 (2008)
Article MATH Google Scholar - D. O’Shaughnessy, Interacting with computers by voice: automatic speech recognition and synthesis. IEEE Proc. 91, 1272–1305 (2003)
Article Google Scholar - W. Ping, K. Peng, A. Gibiansky, S.O. Arık, A. Kannan, S. Narang, Deep voice 3: scaling text-to-speech with convolutional sequence learning, in ICLR (2018)
- R. Prabhavalkar, T.N. Sainath, B. Li, K. Rao, N. Jaitly, An analysis of “attention” in sequence-to-sequence models, in Interspeech (2017), pp. 3702–3706
- Y. Qian, M. Bi, T. Tan, K. Yu, Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 2263–2276 (2016)
Article Google Scholar - L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993)
Google Scholar - M. Ratajczak, S. Tschiatschek, F. Pernkopf, Frame and segment level recurrent neural networks for phone classification, in Interspeech (2017), pp. 1318–1322
- T.N. Sainath, B. Li, Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks, in Interspeech (2016), pp. 813–817
- G. Saon, et al., English conversational telephone speech recognition by humans and machines, in Interspeech (2017), pp. 132–136
- J. Sotelo, S. Mehri, K. Kumar, J.F. Santosy, K. Kastner, A. Courvillez, Y. Bengio, Char2wav: End-to-end speech synthesis, in ICLR (2017)
- S. Sun, C.-F. Yeh, M. Ostendorf, M.-Y.Hwang, L. Xie, Training augmentation with adversarial examples for robust speech recognition, in Interspeech (2018), pp. 2404–2408
- W. Sun, F. Su, L. Wang, Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 278, 34–40 (2018)
Article Google Scholar - I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems (NIPS) (2014), pp. 3104–3112
- L. ten Bosch, L. Boves, Information encoding by deep neural networks: what can we learn? in Interspeech (2018), pp. 1457–1461
- A. Tjandra, S. Sakti, S. Nakamura, Sequence-to-sequence ASR optimization via reinforcement learning, in ICASSP (2018)
- A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, WaveNet: a generative model for raw audio. arXiv:1609.03499 (2016)
- Y. Wang, et al., Tacotron: towards end-to-end speech synthesis, in Interspeech (2017), pp. 4006–4010
- W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, A. Stolcke, The Microsoft 2017 conversational speech recognition system, in ICASSP (2018)
- Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent, Y. Bengio, A. Courville, Towards end-to-end speech recognition with deep convolutional neural networks, in Interspeech (2016), pp. 410–414
- Z. Zhang, J. Geiger, J. Pohjalainen, A. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: an overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5), 49 (2018)
Article Google Scholar