Artículo

Ferrer, L.; Lei, Y.; McLaren, M.; Scheffer, N. "Study of senone-based deep neural network approaches for spoken language recognition" (2016) IEEE/ACM Transactions on Audio Speech and Language Processing. 24(1):105-116
Estamos trabajando para incorporar este artículo al repositorio
Consulte el artículo en la página del editor
Consulte la política de Acceso Abierto del editor

Abstract:

This paper compares different approaches for using deep neural networks (DNNs) trained to predict senone posteriors for the task of spoken language recognition (SLR). These approaches have recently been found to outperform various baseline systems on different datasets, but they have not yet been compared to each other or to a common baseline. Two of these approaches use the DNNs to generate feature vectors which are then processed in different ways to predict the score of each language given a test sample. The features are extracted either from a bottleneck layer in the DNN or from the output layer. In the third approach, the standard i-vector extraction procedure is modified to use the senones as classes and the DNN to predict the zeroth order statistics. We compare these three approaches and conclude that the approach based on bottleneck features followed by i-vector modeling outperform the other two approaches. We also show that score-level fusion of some of these approaches leads to gains over using a single approach for short-duration test samples. Finally, we demonstrate that fusing systems that use DNNs trained with several languages leads to improvements in performance over the best single system, and we propose an adaptation procedure for DNNs trained with languages with less available data. Overall, we show improvements between 40% and 70% relative to a state-of-the-art Gaussian mixture model (GMM) i-vector system on test durations from 3 seconds to 120 seconds on two significantly different tasks: the NIST 2009 language recognition evaluation task and the DARPA RATS language identification task. © 2015 IEEE.

Registro:

Documento: Artículo
Título:Study of senone-based deep neural network approaches for spoken language recognition
Autor:Ferrer, L.; Lei, Y.; McLaren, M.; Scheffer, N.
Filiación:Computer Science Department, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, C1428EGA Autonomous City of Buenos, Buenos, Argentina
CONICET, C1425FQB Autonomous of Buenos Aires, Buenos Aires, Argentina
Speech Technology and Research Laboratory, SRI International, Menlo Park, CA 94025, United States
Facebook, Inc., Menlo Park, CA 94025, United States
Palabras clave:Deep neural networks (DNNs); Senones; Spoken language recognition (SLR); Forecasting; Gaussian distribution; Speech recognition; Vectors; Bottleneck features; Deep neural networks; Gaussian Mixture Model; Language identification; Language recognition; Score-level fusion; Senones; Spoken language recognition; Computational linguistics
Año:2016
Volumen:24
Número:1
Página de inicio:105
Página de fin:116
DOI: http://dx.doi.org/10.1109/TASLP.2015.2496226
Título revista:IEEE/ACM Transactions on Audio Speech and Language Processing
Título revista abreviado:IEEE ACM Trans. Audio Speech Lang. Process.
ISSN:23299290
Registro:https://bibliotecadigital.exactas.uba.ar/collection/paper/document/paper_23299290_v24_n1_p105_Ferrer

Referencias:

  • Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Kingsbury, B., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups (2012) IEEE Signal Process. Mag., 29 (6), pp. 82-97. , Nov
  • Dahl, G.E., Yu, D., Deng, L., Acero, A., Context-dependent pretrained deep neural networks for large-vocabulary speech recognition (2012) IEEE Trans. Audio, Speech, Lang. Process., 20 (1), pp. 30-42. , Jan
  • Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P., Front-end factor analysis for speaker verification (2011) IEEE Trans. Audio, Speech, Lang. Process., 19 (4), pp. 788-798. , May
  • Martínez-González, D., Plchot, O., Burget, L., Glembek, O., Matejka, P., Language recognition in ivectors space (2011) Proc. Interspeech, Florence, Italy, , Aug
  • Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D.A., Dehak, R., Language recognition via i-vectors and dimensionality reduction (2011) Proc. Interspeech, Florence, Italy, , Aug
  • Lei, Y., Scheffer, N., Ferrer, L., McLaren, M., A novel scheme for speaker recognition using a phonetically-aware deep neural network (2014) Proc. ICASSP, Florence, Italy, , May
  • Lei, Y., Ferrer, L., Lawson, A., McLaren, M., Scheffer, N., Application of convolutional neural networks to language identification in noisy conditions (2014) Proc. Odyssey'14, Joensuu, Finland, , Jun
  • Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., Alam, J., Deep neural networks for extracting Baum-Welch statistics for speaker recognition (2014) Proc. Odyssey'14, Joensuu, Finland, , Jun
  • Ferrer, L., Lei, Y., McLaren, M., Scheffer, N., Spoken language recognition based on senone posteriors (2014) Proc. Interspeech, Singapore, , Sep
  • Song, Y., Jiang, B., Bao, Y., Wei, S., Dai, L.-R., I-vector representation based on bottleneck features for language identification (2013) Electron. Lett., 49 (24), pp. 1569-1570
  • Jiang, B., Song, Y., Wei, S., Liu, J.-H., McLoughlin, I.V., Dai, L.-R., Deep bottleneck features for spoken language identification (2014) PLOS One, pp. 1-11. , Jul
  • Matejka, P., Zhang, L., Ng, T., Mallidi, S.H., Glembek, O., Ma, J., Zhang, B., Neural network bottleneck features for language identification (2014) Proc. Odyssey'14, Joensuu, Finland, , Jun
  • Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martnez-González, D., Gonzalez-Rodriguez, J., Moreno P, J., Automatic language identification using deep neural networks (2014) Proc. ICASSP, Florence, Italy, pp. 5337-5341. , May
  • Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L.J., Bordel, G., On the use of phone log-likelihood ratios as features in spoken language recognition (2012) Proc. IEEE Workshop Spoken Lang. Technol. (SLT'12), pp. 274-279. , Miami, FL, USA
  • Matejka, P., Schwarz, P., Cernocky, J., Chytil, P., Phonotactic language identification using high quality phoneme recognition (2005) Proc. Interspeech'05
  • Shen, W., Campbell, W., Gleason, T., Reynolds, D., Singer, E., Experiments with lattice-based PPRLM language identification (2006) Proc. Odyssey'06, San Juan, Puerto Rico, , Jun
  • Stolcke, A., Akbacak, M., Ferrer, L., Kajarekar, S., Richey, C., Scheffer, N., Shriberg, E., Improving language recognition with multilingual phone recognition and speaker adaptation transforms (2010) Proc. Odyssey'10, Brno, Czech Republic, , Jun
  • D'Haro, L.F., Glembek, O., Plchot, O., Matejka, P., Soufifar, M., Cordoba, R., Cernocky, J., Phonotactic language recognition using i-vectors and phoneme posteriogram counts (2012) Proc. Interspeech, Portland, OR, USA, , Sep
  • Young, S.J., Odell, J.J., Woodland, P.C., Tree-based state tying for high accuracy acoustic modelling (1994) Proc. Workshop Human Lang. Technol. (HLT'94)
  • Deng, L., Yu, D., Deep convex network: A scalable architecture for speech pattern classification (2011) Proc. Interspeech, Florence, Italy, , Aug
  • Huang, P., Deng, L., Hasegawa-Johnson, M., He, X., Random features for kernel deep convex network (2013) Proc. ICASSP, , Vancouver, BC, USA, May
  • Mohamed, A., Graves, A., Jaitly, N., Hybrid speech recognition with deep bidirectional LSTM (2013) Proc. IEEE Workshop Speech Recognit. Understand., Olomouc, Czech Republic, , Dec
  • Le Cun, Y., Bengio, Y., (1995) Convolutional Networks for Images, Speech, and Time-Series, pp. 255-258. , Cambridge, MA, USA: MIT Press
  • Scheffer, N., Lei, Y., Ferrer, L., Factor analysis back ends for MLLR transforms in speaker recognition (2011) Proc. Interspeech, Florence, Italy, , Aug
  • Kenny, P., Ouellet, P., Dehak, N., Gupta, V., Dumouchel, P., A study of inter-speaker variability in speaker verification (2008) IEEE Trans. Audio, Speech, Lang. Process., 16 (4), pp. 980-988. , Jul
  • Matejka, P., Plchot, O., Soufifar, M., Glembek, O., D'haro Enríquez, L.F., Veselý, K., Ma, J., Dehak, N., Patrol team language identification system for DARPA RATS P1 evaluation (2012) Proc. Interspeech, Portland, OR, USA, , Sep
  • Lawson, A., McLaren, M., Lei, Y., Mitra, V., Scheffer, N., Ferrer, L., Graciarena, M., Improving language identification robustness to highly channel-degraded speech through multiple system fusion (2013) Proc. Interspeech, Lyon, France, , Aug
  • McLaren, M., Lawson, A., Lei, Y., Scheffer, N., Adaptive Gaussian backend for robust language identification (2013) Proc. Interspeech, Lyon, France, , Aug
  • Penagarikano, M., Varona, A., Diez, M., Rodriguez-Fuentes, L.J., Bordel, G., Study of different backends in a state-of-The-Art language recognition system (2012) Proc. Interspeech, Portland, OR, USA, , Sep
  • Brummer, N., Van Leeuwen, D.A., On calibration of language recognition scores (2006) Proc. Odyssey'06, San Juan, Puerto Rico, , Jun
  • Van Leeuwen, D.A., Brummer, N., Channel-dependent GMM and multi-class logistic regression models for language recognition (2006) Proc. Odyssey'06, San Juan, Puerto Rico, , Jun
  • NIST LRE09 Evaluation Plan, , http://www.itl.nist.gov/iad/mig/tests/lre/2009/LRE09_EvalPlan_v6.pdf, [Online]. Available
  • Bielefeld, B., Language identification using shifted delta cepstrum (1994) Proc. 14th Annu. Speech Res. Symp.
  • Ferrer, L., Bratt, H., Burget, L., Cernocky, H., Glembek, O., Graciarena, M., Lawson, A., Scheffer, N., Promoting robustness for speaker modeling in the community: The PRISM evaluation set (2011) Procs. SRE11 Anal. Workshop, Atlanta, GA, USA, , Dec
  • Jancik, Z., Plchot, O., Brümmer, N., Burget, L., Glembek, O., Hubeika, V., Karafiát, M., Strasheim, A., Data selection and calibration issues in automatic language recognition-investigation with BUT-AGNITIO NIST LRE 2009system (2010) Proc. Odyssey'10, Brno, Czech Republic, , Jun
  • D'haro Enríquez, L.F., Glembek, O., Plchot, O., Matejka, P., Soufifar, M., Córdoba Herralde, R., Cernock, J., Phonotactic language recognition using i-vectors and phoneme posteriogram counts (2012) Proc. Interspeech, Portland, OR, USA, , Sep
  • Walker, K., Strassel, S., The RATS radio traffic collection system (2012) Proc. Odyssey'12: Speaker Lang. Recognit. Workshop
  • DARPA RATS Program, , http://www.darpa.mil/Our_Work/I2O/Programs/Robust_Automatic_Transcription_of_Speech_(RATS).aspx, [Online]. Available
  • Ma, J.Z., Zhang, B., Matsoukas, S., Mallidi, S.H.R., Li, F., Hermansky, H., Improvements in language identification on the rats noisy speech corpus (2013) Proc. Interspeech, Lyon, France, , Aug
  • Kim, C., Stern, R.M., Power-normalized cepstral coefficients (PNCC) for robust speech recognition (2012) Proc. ICASSP, Kyoto, Japan, pp. 4101-4104. , Mar
  • McLaren, M., Lei, Y., Improved speaker recognition using DCT coefficients as features (2015) Proc. ICASSP, Brisbane, Australia, pp. 4430-4434. , May
  • McLaren, M., Scheffer, N., Graciarena, M., Ferrer, L., Lei, Y., Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion (2013) Proc. ICASSP, Vancouver, BC, Canada, , May
  • McLaren, M., Graciarena, M., Lei, Y., Softsad: Integrated framebased speech confidence for speaker recognition (2015) Proc. ICASSP, Brisbane, Australia, pp. 4694-4698. , May

Citas:

---------- APA ----------
Ferrer, L., Lei, Y., McLaren, M. & Scheffer, N. (2016) . Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 24(1), 105-116.
http://dx.doi.org/10.1109/TASLP.2015.2496226
---------- CHICAGO ----------
Ferrer, L., Lei, Y., McLaren, M., Scheffer, N. "Study of senone-based deep neural network approaches for spoken language recognition" . IEEE/ACM Transactions on Audio Speech and Language Processing 24, no. 1 (2016) : 105-116.
http://dx.doi.org/10.1109/TASLP.2015.2496226
---------- MLA ----------
Ferrer, L., Lei, Y., McLaren, M., Scheffer, N. "Study of senone-based deep neural network approaches for spoken language recognition" . IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 24, no. 1, 2016, pp. 105-116.
http://dx.doi.org/10.1109/TASLP.2015.2496226
---------- VANCOUVER ----------
Ferrer, L., Lei, Y., McLaren, M., Scheffer, N. Study of senone-based deep neural network approaches for spoken language recognition. IEEE ACM Trans. Audio Speech Lang. Process. 2016;24(1):105-116.
http://dx.doi.org/10.1109/TASLP.2015.2496226