Study of senone-based deep neural network approaches for spoken language recognition

Ferrer, L.; Lei, Y.; McLaren, M.; Scheffer, N.

doi:10.1109/TASLP.2015.2496226

Navegar

Documento Últimos Documentos Autor FCEN - Año Autor FCEN - Revista Año - Revista Revista - Año SubjectPcEn Colores Type

Colección

Artículo

Ferrer, L.; Lei, Y.; McLaren, M.; Scheffer, N. "Study of senone-based deep neural network approaches for spoken language recognition" (2016) IEEE/ACM Transactions on Audio Speech and Language Processing. 24(1):105-116

https://bibliotecadigital.exactas.uba.ar/collection/paper/document/paper_23299290_v24_n1_p105_Ferrer

Estamos trabajando para incorporar este artículo al repositorio

Consulte el artículo en la página del editor

Consulte la política de Acceso Abierto del editor

Abstract:

This paper compares different approaches for using deep neural networks (DNNs) trained to predict senone posteriors for the task of spoken language recognition (SLR). These approaches have recently been found to outperform various baseline systems on different datasets, but they have not yet been compared to each other or to a common baseline. Two of these approaches use the DNNs to generate feature vectors which are then processed in different ways to predict the score of each language given a test sample. The features are extracted either from a bottleneck layer in the DNN or from the output layer. In the third approach, the standard i-vector extraction procedure is modified to use the senones as classes and the DNN to predict the zeroth order statistics. We compare these three approaches and conclude that the approach based on bottleneck features followed by i-vector modeling outperform the other two approaches. We also show that score-level fusion of some of these approaches leads to gains over using a single approach for short-duration test samples. Finally, we demonstrate that fusing systems that use DNNs trained with several languages leads to improvements in performance over the best single system, and we propose an adaptation procedure for DNNs trained with languages with less available data. Overall, we show improvements between 40% and 70% relative to a state-of-the-art Gaussian mixture model (GMM) i-vector system on test durations from 3 seconds to 120 seconds on two significantly different tasks: the NIST 2009 language recognition evaluation task and the DARPA RATS language identification task. © 2015 IEEE.

Registro:

Documento:	Artículo
Título:	Study of senone-based deep neural network approaches for spoken language recognition
Autor:	Ferrer, L.; Lei, Y.; McLaren, M.; Scheffer, N.
Filiación:	Computer Science Department, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, C1428EGA Autonomous City of Buenos, Buenos, Argentina CONICET, C1425FQB Autonomous of Buenos Aires, Buenos Aires, Argentina Speech Technology and Research Laboratory, SRI International, Menlo Park, CA 94025, United States Facebook, Inc., Menlo Park, CA 94025, United States
Palabras clave:	Deep neural networks (DNNs); Senones; Spoken language recognition (SLR); Forecasting; Gaussian distribution; Speech recognition; Vectors; Bottleneck features; Deep neural networks; Gaussian Mixture Model; Language identification; Language recognition; Score-level fusion; Senones; Spoken language recognition; Computational linguistics
Año:	2016
Volumen:	24
Número:	1
Página de inicio:	105
Página de fin:	116
DOI:	http://dx.doi.org/10.1109/TASLP.2015.2496226
Título revista:	IEEE/ACM Transactions on Audio Speech and Language Processing
Título revista abreviado:	IEEE ACM Trans. Audio Speech Lang. Process.
ISSN:	23299290
Registro:	https://bibliotecadigital.exactas.uba.ar/collection/paper/document/paper_23299290_v24_n1_p105_Ferrer

Referencias:

Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Kingsbury, B., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups (2012) IEEE Signal Process. Mag., 29 (6), pp. 82-97. , Nov
Dahl, G.E., Yu, D., Deng, L., Acero, A., Context-dependent pretrained deep neural networks for large-vocabulary speech recognition (2012) IEEE Trans. Audio, Speech, Lang. Process., 20 (1), pp. 30-42. , Jan
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P., Front-end factor analysis for speaker verification (2011) IEEE Trans. Audio, Speech, Lang. Process., 19 (4), pp. 788-798. , May
Martínez-González, D., Plchot, O., Burget, L., Glembek, O., Matejka, P., Language recognition in ivectors space (2011) Proc. Interspeech, Florence, Italy, , Aug
Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D.A., Dehak, R., Language recognition via i-vectors and dimensionality reduction (2011) Proc. Interspeech, Florence, Italy, , Aug
Lei, Y., Scheffer, N., Ferrer, L., McLaren, M., A novel scheme for speaker recognition using a phonetically-aware deep neural network (2014) Proc. ICASSP, Florence, Italy, , May
Lei, Y., Ferrer, L., Lawson, A., McLaren, M., Scheffer, N., Application of convolutional neural networks to language identification in noisy conditions (2014) Proc. Odyssey'14, Joensuu, Finland, , Jun
Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., Alam, J., Deep neural networks for extracting Baum-Welch statistics for speaker recognition (2014) Proc. Odyssey'14, Joensuu, Finland, , Jun
Ferrer, L., Lei, Y., McLaren, M., Scheffer, N., Spoken language recognition based on senone posteriors (2014) Proc. Interspeech, Singapore, , Sep
Song, Y., Jiang, B., Bao, Y., Wei, S., Dai, L.-R., I-vector representation based on bottleneck features for language identification (2013) Electron. Lett., 49 (24), pp. 1569-1570
Jiang, B., Song, Y., Wei, S., Liu, J.-H., McLoughlin, I.V., Dai, L.-R., Deep bottleneck features for spoken language identification (2014) PLOS One, pp. 1-11. , Jul
Matejka, P., Zhang, L., Ng, T., Mallidi, S.H., Glembek, O., Ma, J., Zhang, B., Neural network bottleneck features for language identification (2014) Proc. Odyssey'14, Joensuu, Finland, , Jun
Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martnez-González, D., Gonzalez-Rodriguez, J., Moreno P, J., Automatic language identification using deep neural networks (2014) Proc. ICASSP, Florence, Italy, pp. 5337-5341. , May
Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L.J., Bordel, G., On the use of phone log-likelihood ratios as features in spoken language recognition (2012) Proc. IEEE Workshop Spoken Lang. Technol. (SLT'12), pp. 274-279. , Miami, FL, USA
Matejka, P., Schwarz, P., Cernocky, J., Chytil, P., Phonotactic language identification using high quality phoneme recognition (2005) Proc. Interspeech'05
Shen, W., Campbell, W., Gleason, T., Reynolds, D., Singer, E., Experiments with lattice-based PPRLM language identification (2006) Proc. Odyssey'06, San Juan, Puerto Rico, , Jun
Stolcke, A., Akbacak, M., Ferrer, L., Kajarekar, S., Richey, C., Scheffer, N., Shriberg, E., Improving language recognition with multilingual phone recognition and speaker adaptation transforms (2010) Proc. Odyssey'10, Brno, Czech Republic, , Jun
D'Haro, L.F., Glembek, O., Plchot, O., Matejka, P., Soufifar, M., Cordoba, R., Cernocky, J., Phonotactic language recognition using i-vectors and phoneme posteriogram counts (2012) Proc. Interspeech, Portland, OR, USA, , Sep
Young, S.J., Odell, J.J., Woodland, P.C., Tree-based state tying for high accuracy acoustic modelling (1994) Proc. Workshop Human Lang. Technol. (HLT'94)
Deng, L., Yu, D., Deep convex network: A scalable architecture for speech pattern classification (2011) Proc. Interspeech, Florence, Italy, , Aug
Huang, P., Deng, L., Hasegawa-Johnson, M., He, X., Random features for kernel deep convex network (2013) Proc. ICASSP, , Vancouver, BC, USA, May
Mohamed, A., Graves, A., Jaitly, N., Hybrid speech recognition with deep bidirectional LSTM (2013) Proc. IEEE Workshop Speech Recognit. Understand., Olomouc, Czech Republic, , Dec
Le Cun, Y., Bengio, Y., (1995) Convolutional Networks for Images, Speech, and Time-Series, pp. 255-258. , Cambridge, MA, USA: MIT Press
Scheffer, N., Lei, Y., Ferrer, L., Factor analysis back ends for MLLR transforms in speaker recognition (2011) Proc. Interspeech, Florence, Italy, , Aug
Kenny, P., Ouellet, P., Dehak, N., Gupta, V., Dumouchel, P., A study of inter-speaker variability in speaker verification (2008) IEEE Trans. Audio, Speech, Lang. Process., 16 (4), pp. 980-988. , Jul
Matejka, P., Plchot, O., Soufifar, M., Glembek, O., D'haro Enríquez, L.F., Veselý, K., Ma, J., Dehak, N., Patrol team language identification system for DARPA RATS P1 evaluation (2012) Proc. Interspeech, Portland, OR, USA, , Sep
Lawson, A., McLaren, M., Lei, Y., Mitra, V., Scheffer, N., Ferrer, L., Graciarena, M., Improving language identification robustness to highly channel-degraded speech through multiple system fusion (2013) Proc. Interspeech, Lyon, France, , Aug
McLaren, M., Lawson, A., Lei, Y., Scheffer, N., Adaptive Gaussian backend for robust language identification (2013) Proc. Interspeech, Lyon, France, , Aug
Penagarikano, M., Varona, A., Diez, M., Rodriguez-Fuentes, L.J., Bordel, G., Study of different backends in a state-of-The-Art language recognition system (2012) Proc. Interspeech, Portland, OR, USA, , Sep
Brummer, N., Van Leeuwen, D.A., On calibration of language recognition scores (2006) Proc. Odyssey'06, San Juan, Puerto Rico, , Jun
Van Leeuwen, D.A., Brummer, N., Channel-dependent GMM and multi-class logistic regression models for language recognition (2006) Proc. Odyssey'06, San Juan, Puerto Rico, , Jun
NIST LRE09 Evaluation Plan, , http://www.itl.nist.gov/iad/mig/tests/lre/2009/LRE09_EvalPlan_v6.pdf, [Online]. Available
Bielefeld, B., Language identification using shifted delta cepstrum (1994) Proc. 14th Annu. Speech Res. Symp.
Ferrer, L., Bratt, H., Burget, L., Cernocky, H., Glembek, O., Graciarena, M., Lawson, A., Scheffer, N., Promoting robustness for speaker modeling in the community: The PRISM evaluation set (2011) Procs. SRE11 Anal. Workshop, Atlanta, GA, USA, , Dec
Jancik, Z., Plchot, O., Brümmer, N., Burget, L., Glembek, O., Hubeika, V., Karafiát, M., Strasheim, A., Data selection and calibration issues in automatic language recognition-investigation with BUT-AGNITIO NIST LRE 2009system (2010) Proc. Odyssey'10, Brno, Czech Republic, , Jun
D'haro Enríquez, L.F., Glembek, O., Plchot, O., Matejka, P., Soufifar, M., Córdoba Herralde, R., Cernock, J., Phonotactic language recognition using i-vectors and phoneme posteriogram counts (2012) Proc. Interspeech, Portland, OR, USA, , Sep
Walker, K., Strassel, S., The RATS radio traffic collection system (2012) Proc. Odyssey'12: Speaker Lang. Recognit. Workshop
DARPA RATS Program, , http://www.darpa.mil/Our_Work/I2O/Programs/Robust_Automatic_Transcription_of_Speech_(RATS).aspx, [Online]. Available
Ma, J.Z., Zhang, B., Matsoukas, S., Mallidi, S.H.R., Li, F., Hermansky, H., Improvements in language identification on the rats noisy speech corpus (2013) Proc. Interspeech, Lyon, France, , Aug
Kim, C., Stern, R.M., Power-normalized cepstral coefficients (PNCC) for robust speech recognition (2012) Proc. ICASSP, Kyoto, Japan, pp. 4101-4104. , Mar
McLaren, M., Lei, Y., Improved speaker recognition using DCT coefficients as features (2015) Proc. ICASSP, Brisbane, Australia, pp. 4430-4434. , May
McLaren, M., Scheffer, N., Graciarena, M., Ferrer, L., Lei, Y., Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion (2013) Proc. ICASSP, Vancouver, BC, Canada, , May
McLaren, M., Graciarena, M., Lei, Y., Softsad: Integrated framebased speech confidence for speaker recognition (2015) Proc. ICASSP, Brisbane, Australia, pp. 4694-4698. , May

Citas:

---------- APA ----------

Ferrer, L., Lei, Y., McLaren, M. & Scheffer, N. (2016) . Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 24(1), 105-116.
http://dx.doi.org/10.1109/TASLP.2015.2496226

---------- CHICAGO ----------

Ferrer, L., Lei, Y., McLaren, M., Scheffer, N. "Study of senone-based deep neural network approaches for spoken language recognition" . IEEE/ACM Transactions on Audio Speech and Language Processing 24, no. 1 (2016) : 105-116.
http://dx.doi.org/10.1109/TASLP.2015.2496226

---------- MLA ----------

Ferrer, L., Lei, Y., McLaren, M., Scheffer, N. "Study of senone-based deep neural network approaches for spoken language recognition" . IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 24, no. 1, 2016, pp. 105-116.
http://dx.doi.org/10.1109/TASLP.2015.2496226

---------- VANCOUVER ----------

Ferrer, L., Lei, Y., McLaren, M., Scheffer, N. Study of senone-based deep neural network approaches for spoken language recognition. IEEE ACM Trans. Audio Speech Lang. Process. 2016;24(1):105-116.
http://dx.doi.org/10.1109/TASLP.2015.2496226