Recognition and Processing of Speech Signals Using Neural Networks (original) (raw)

References

K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, D. Nahamoo, Direct acoustics-to-word models for English conversational speech recognition, in Interspeech (2017), pp. 959–963
A. Avila, J. Monteiro, D. O’Shaughnessy, T. Falk, Speech emotion recognition on mobile devices based on modulation spectral feature pooling and deep neural networks, in IEEE ISSPIT (2017)
T. Backstrom, Speech Coding: With Code-Excited Linear Prediction (Springer, Berlin, 2017)
Book Google Scholar
L. Bai, P. Weber, P. Jancovic, M. Russell, Exploring how phone classification neural networks learn phonetic information by visualising and interpreting bottleneck features, in Interspeech (2018), pp. 1472–1476
Y. Bengio, A.C. Courville, P. Vincent, Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Article Google Scholar
C. Bishop, Pattern Recognition and Machine Learning (Springer, Berlin, 2006)
MATH Google Scholar
C.-C. Chiu, et al., State-of-the-art speech recognition with sequence-to-sequence models, in ICASSP (2017)
J.K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Yoshua Bengio, Attention-based models for speech recognition, in NIPS (2015), pp. 1–16
R. Collobert, C. Puhrsch, G. Synnaeve, Wav2Letter: an end-to-end ConvNet-based speech recognition system. arXiv:1609.03193 (2016)
S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP 28, 357–366 (1980)
Article Google Scholar
H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in ICASSP (2015), pp. 708–712
E. Fosler-Lussier, Y. He, P. Jyothi, R. Prabhavalkar, Conditional random fields in speech, audio, and language processing. Proc. IEEE 101, 1054–1075 (2013)
Article Google Scholar
P. Ghahremani, H. Hadian, H. Lv, D. Povey, S. Khudanpur, Acoustic modeling from frequency domain representations of speech, in Interspeech (2018), pp. 1596–1600
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)
MATH Google Scholar
A. Graves, A. Mohamed, G. Geoffrey Hinton, Speech recognition with deep recurrent neural networks, in ICASSP (2013), pp. 6645–6649
A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in International Conference on Machine Learning, Pittsburgh, PA (2006)
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012)
Article Google Scholar
X.D. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Prentice Hall, Englewood Cliffs, 2001)
Google Scholar
Y. Huang, A. Sethy, B. Ramabhadran, Fast neural network language model lookups at N-gram speed, in Interspeech (2017), pp. 274–278
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18(187), 1–30 (2017)
MathSciNet MATH Google Scholar
I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 2002)
MATH Google Scholar
M. Jordan, E. Sudderth, M. Wainwright, A. Wilsky, Major advances and emerging developments of graphical models. IEEE Signal Process. Mag. 27(6), 17–138 (2010)
Article Google Scholar
S. Kullback, R.A. Leibler, On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
M. Kutner, J. Neter, C. Nachtsheim, W. Wasserman, Applied Linear Statistical Models (McGraw-Hill, New York, 2004)
Google Scholar
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015)
Article Google Scholar
B. Li, et al., Acoustic modeling for Google home, in Interspeech (2017), pp. 399–403
W. Li, G. Cheng, F. Ge, P. Zhang, Y. Yan, Investigation on the combination of batch normalization and dropout in BLSTM-based acoustic modeling for ASR, in Interspeech (2018), pp. 2888–2492
L. Lu, L. Kong, C. Dyer, N.A. Smith, S. Renals, Segmental recurrent neural networks for end-to-end speech recognition, in Interspeech (2016), pp. 385–389
S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, Y. Bengio, SampleRNN: an unconditional end-to-end neural audio generation model. arXiv:1612.07837 (2016)
S.K. Moore, IBM’s new do-it-all AI chip. IEEE Spectrum, August, pp. 10–11 (2018)
K. Mustafa, I.C. Bruce, Robust formant tracking for continuous speech with speaker variability. IEEE Trans. Audio Speech Lang. Process. 14, 2 (2006)
Article Google Scholar
T. Nagamine, M.L. Seltzer, N. Mesgarani, Exploring how deep neural networks form phonemic categories, in Interspeech (2015), pp. 1912–1916
T. Nagamine, M.L. Seltzer, N. Mesgarani, On the role of nonlinear transformations in deep neural network acoustic models, in Interspeech (2016), pp. 803–807
T. Nagamine, N. Mesgarani, Understanding the representation and computation of multilayer perceptrons: a case study in speech recognition, in International Conference on Machine Learning, Sydney, Australia, PMLR 70 (2017)
M. Nussbaum-Thom, J. Cui, B. Ramabhadran, V. Goel, Acoustic modeling using bidirectional gated recurrent convolutional units, in Interspeech (2016), pp. 390–394
D. O’Shaughnessy, Speech Communications: Human and Machine (IEEE Press, New York, 2000)
MATH Google Scholar
D. O’Shaughnessy, Automatic speech recognition: history, methods and challenges. Pattern Recogn. 41, 2965–2979 (2008)
Article MATH Google Scholar
D. O’Shaughnessy, Interacting with computers by voice: automatic speech recognition and synthesis. IEEE Proc. 91, 1272–1305 (2003)
Article Google Scholar
W. Ping, K. Peng, A. Gibiansky, S.O. Arık, A. Kannan, S. Narang, Deep voice 3: scaling text-to-speech with convolutional sequence learning, in ICLR (2018)
R. Prabhavalkar, T.N. Sainath, B. Li, K. Rao, N. Jaitly, An analysis of “attention” in sequence-to-sequence models, in Interspeech (2017), pp. 3702–3706
Y. Qian, M. Bi, T. Tan, K. Yu, Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 2263–2276 (2016)
Article Google Scholar
L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993)
Google Scholar
M. Ratajczak, S. Tschiatschek, F. Pernkopf, Frame and segment level recurrent neural networks for phone classification, in Interspeech (2017), pp. 1318–1322
T.N. Sainath, B. Li, Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks, in Interspeech (2016), pp. 813–817
G. Saon, et al., English conversational telephone speech recognition by humans and machines, in Interspeech (2017), pp. 132–136
J. Sotelo, S. Mehri, K. Kumar, J.F. Santosy, K. Kastner, A. Courvillez, Y. Bengio, Char2wav: End-to-end speech synthesis, in ICLR (2017)
S. Sun, C.-F. Yeh, M. Ostendorf, M.-Y.Hwang, L. Xie, Training augmentation with adversarial examples for robust speech recognition, in Interspeech (2018), pp. 2404–2408
W. Sun, F. Su, L. Wang, Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 278, 34–40 (2018)
Article Google Scholar
I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems (NIPS) (2014), pp. 3104–3112
L. ten Bosch, L. Boves, Information encoding by deep neural networks: what can we learn? in Interspeech (2018), pp. 1457–1461
A. Tjandra, S. Sakti, S. Nakamura, Sequence-to-sequence ASR optimization via reinforcement learning, in ICASSP (2018)
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, WaveNet: a generative model for raw audio. arXiv:1609.03499 (2016)
Y. Wang, et al., Tacotron: towards end-to-end speech synthesis, in Interspeech (2017), pp. 4006–4010
W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, A. Stolcke, The Microsoft 2017 conversational speech recognition system, in ICASSP (2018)
Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent, Y. Bengio, A. Courville, Towards end-to-end speech recognition with deep convolutional neural networks, in Interspeech (2016), pp. 410–414
Z. Zhang, J. Geiger, J. Pohjalainen, A. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: an overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5), 49 (2018)
Article Google Scholar

Download references