Artículo

Estamos trabajando para incorporar este artículo al repositorio
Consulte el artículo en la página del editor
Consulte la política de Acceso Abierto del editor

Abstract:

This paper introduces Emilia, a speech corpus created to build a female voice in Spanish spoken in Buenos Aires for the Aromo text-to-speech system. Aromo is a unit selection text-to-speech system, which employs diphones as units of synthesis. The key requirements and design criteria for Emilia were: to synthesize any text in Spanish into high-quality speech with a minimum corpus size. The text corpus was designed to guarantee the phonetic and prosodic coverage. A three-stage strategy was used: in the first stage, 741 sentences were designed with all of the syllables of Spanish spoken in Argentina, with and without stress, and in all positions within the word; in the second stage, 852 sentences were added to balance out the distribution of the diphones; and after a perceptual evaluation of the quality of synthesized speech, in the third and final stage, 625 sentences were added to achieve the specified unit coverage, and to introduce sentences with more complex syntactic and prosodic structures. Issues from all three corpus building stages are reported. The paper also presents the results from the quality perceptual evaluations of the synthesized voice. Emilia has a duration of three hours and 15 minutes; its speech quality synthesized with Aromo system is similar to the level obtained with commercial systems, with a real-time ratio less than one. © 2019, Springer Nature B.V.

Registro:

Documento: Artículo
Título:Emilia: a speech corpus for Argentine Spanish text to speech synthesis
Autor:Torres, H.M.; Gurlekian, J.A.; Evin, D.A.; Cossio Mercado, C.G.
Filiación:Laboratorio de Investigaciones Sensoriales, INIGEM, CONICET-UBA, Av. Córdoba 2351, 9 Piso Sala 2. C.A.B.A. (1120), Buenos Aires, Argentina
Center for Research and Transfer in Acoustics (CINTRA), UTN-FRC UA CONICET, Master M. López esq. Argentine Red Cross, University City, Córdoba Capital, X5016ZAA, Argentina
Departamento de Computación, FCEN, UBA, University City, Buenos Aires, C1428EGA, Argentina
Palabras clave:Argentine Spanish; Phonetic corpus; Phonetic transcription; Speech corpus design; Text-to-speech
Año:2019
DOI: http://dx.doi.org/10.1007/s10579-019-09447-7
Título revista:Language Resources and Evaluation
Título revista abreviado:Lang. Resour. Eval.
ISSN:1574020X
Registro:https://bibliotecadigital.exactas.uba.ar/collection/paper/document/paper_1574020X_v_n_p_Torres

Referencias:

  • Adell, J., Bonafonte, A., Gomez, J., Castro, M., Comparative study of automatic phone segmentation methods for TTS (2005) Proceedings of the ICASSP’05, pp. 309-312. , https://doi.org/10.1109/ICASSP.2005.1415112
  • Aguilar, L., Fernzández, J., Garrido, J., Llisterri, J., Monzón, A.M.L., Crespo, M.R., Evaluation of a Spanish text-to-speech system (1994) Proceedings of the Second ESCA/IEEE Workshop on Speech Synthesis, pp. 207-210. , https://www.isca-speech.org/archive_open/archive_papers/ssw2/ssw2_207.pdf
  • Alıas, F., Iriondo, I., Barnola, P., Multi-domain text classification for unit selection text-to-speech synthesis (2003) In Procedings of the 15Th International Congress of Phonetic Sciences, pp. 2341-2344. , https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2003/papers/p15_2341.pdf
  • Alvarez, Y.V., Huckvale, M., The reliability of the ITU-T P.85 standard for the evaluation of text-to-speech systems (2002) In Proceedings of the 7Th International Conference on Speech & Language Processing, pp. 329-332. , https://www.isca-speech.org/archive/archive_papers/icslp_2002/i02_0329.pdf
  • Andersen, O., Hoequist, C., Keeping rare events rare (2003) Proceedings of the Eighth European Conference on Speech Communication & Technology, pp. 2-1337. , https://www.isca-speech.org/archive/archive_papers/eurospeech_2003/e03_1337.pdf
  • Badino, L., Barolo, C., Quazza, S., Language independent phoneme mapping for foreign TTS (2004) Proceedings of the Fifth ISCA Workshop on Speech Synthesis, pp. 127-137. , https://www.isca-speech.org/archive_open/archive_papers/ssw5/ssw5_217.pdf, Pittsburgh, PA, USA
  • Bayerl, P.S., Paul, K.I., What determines inter-coder agreement in manual annotations? A meta-analytic investigation (2011) Computational Linguistics, 37 (4), pp. 699-725
  • Bellegarda, J.R., Unit-centric feature mapping for inventory pruning in unit selection text-to-speech synthesis (2008) IEEE Transactions on Audio, Speech, and Language Processing, 16 (1), pp. 74-82
  • Benoît, C., Grice, M., Hazan, V., The SUS test: A method for the assessment of TTS synthesis intelligibility (1966) Speech Communication, 18 (4), pp. 381-392
  • Betz, S., Carlmeyer, B., Wagner, P., Wrede, B., Interactive hesitation synthesis: Modelling and evaluation (2018) Multimodal Technologies and Interaction, 2 (1), p. 9
  • Beutnagel, M., Conkie, A., Interaction of units in a unit selection database (1999) In Proceedings of the Sixth European Conference on Speech Communication and Technology, 3, pp. 1063-1066. , https://www.isca-speech.org/archive/archive_papers/eurospeech_1999/e99_1063.pdf
  • Black, A.W., Lenzo, K.A., Limited domain synthesis (2000) Proceedings of the 6Th International Conference on Spoken Language Processing, 2, pp. 411-414. , https://www.isca-speech.org/archive/archive_papers/icslp_2000/i00_2411.pdf
  • Black, A.W., Lenzo, K.A., (2003) Building synthetic voices, , http://festvox.org/bsv/bsv.pdf, Language Technologies Institute, Carnegie Mellon University and Cepstral LLC 4:2
  • Boëffard, O., Variable-length acoustic units inference for text-to-speech synthesis (2001) Proceedings of the 7Th European Conference on Speech Communication and Technology, pp. 983-986. , https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_0983.pdf
  • Bonafonte, A., Höge, H., Kiss, I., Moreno, A., Ziegenhain, U., Heuvel, H., Hain, H., Garcia, M., TC-STAR: Specifications of language resources and evaluation for speech (2006) Proceedings of the 5Th Interantional Conference on Language Resources and Evaluation, pp. 311-314. , http://nlp.lsi.upc.edu/publications/papers/tc_star_spec.pdf
  • Bonafonte, A., Höge, H., Tropf, H.S., Moreno, A., Heuvel, H., Sündermann, D., Ziegenhain, U., Jokisch, O., TTS baselines and specifications (2005) In Deliverable D8 of the EU Project TC-STAR Technology and Corpora for Speech to Speech Translation (FP6-506738), , http://nlp.lsi.upc.edu/publications/papers/tc_star_spec.pdf
  • Bozkurt, B., Ozturk, O., Dutoit, T., Text design for TTS speech corpus building using a modified greedy selection (2003) Proceedings of the Eighth European Conference on Speech Communication and Technology, pp. 277-280. , https://www.isca-speech.org/archive/archive_papers/eurospeech_2003/e03_0277.pdf
  • Breen, A.P., Jackson, P., Non-uniform unit selection and the similarity metric within BT’s laureate TTS system (1998) Proceedings of the Third ESCA Workshop on Speech Synthesis, pp. 373-376. , https://www.isca-speech.org/archive_open/archive_papers/ssw3/ssw3_201.pdf
  • Campbell, N., Chatr: A high-definition speech re-sequencing system (1996) Proceedings of the 3Rd ASA/ASJ Joint Meeting, pp. 1223-1228. , http://www.speech-data.jp/nick/feast/proceeding/asa-asj%201996_12.pdf
  • Campbell, N., Developments in corpus-based speech synthesis: Approaching natural conversational speech (2005) IEICE Transactions on Information and Systems, 88 (3), pp. 376-383
  • Chalamandaris, A., Tsiakoulis, P., Raptis, S., Karabetsos, S., Corpus design for a unit selection TTS system with application to Bulgarian (2011) Human Language Technology Challenges for Computer Science and Linguistics, 6562, pp. 35-46
  • Chevelu, J., Barbot, N., Boeffard, O., Delhay, A., Comparing set-covering strategies for optimal corpus design (2008) Proceedings of the 23Rd European Signal Processing Conference, pp. 2951-2956. , http://lrec-conf.org/proceedings/lrec2008/pdf/750_paper.pdf
  • Chevelu, J., Lolive, D., Do not build your TTS training corpus randomly (2015) Proceedings of the Signal Processing Conference, pp. 350-354. , https://doi.org/10.1109/EUSIPCO.2015.7362403, IEEE
  • Chu, M., Chen, Y., Zhao, Y., Li, Y., Soong, F., A study on how human annotations benefit the TTS voice (2006) In Proceedings of the Blizzard Challenge Workshop 2006., , http://www.festvox.org/blizzard/bc2006/msra_blizzard2006.pdf
  • Chu, M., Peng, H., An objective measure for estimating MOS of synthesized speech (2001) Proceedings of the Eventh European Conference on Speech Communication and Technology, 3, pp. 2087-2090. , https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_2087.pdf
  • Coelho, L., Hain, H.U., Jokisch, O., Braga, D., Towards an objective voice preference definition for the portuguese language (2009) In Proceedings of the Joint Sig-Il/Microsoft Workshop on Speech and Language Technologies for Iberian Languages, pp. 67-70. , http://www.isca-speech.org/archive_open/sltech_2009/papers/isl9_067.pdf
  • Colantoni, L., Gurlekian, J., Convergence and intonation: Historical evidence from Buenos Aires Spanish (2004) Bilingualism: Language and Cognition, 7 (2), pp. 107-119
  • Coloma, G., Illustrations of the IPA: Argentine Spanish (2018) Journal of the International Phonetic Association, 48, pp. 243-250
  • Cryer, H., Home, S., (2010) Review of methods for evaluating synthetic speech, , https://www.rnib.org.uk/sites/default/files/2010_02_Evaluating_synthetic_speech_review.doc, RNIB Centre for Accessible Information, Birmingham: Technical report #8
  • Dutoit, T., (1997) An introduction to text-to-speech synthesis. Text, speech and language technology, , Kluwer Academic, Dordrecht
  • Dybkjær, L., Hemsen, H., (2007) Evaluation of text and speech systems, , Springer, Berlin
  • Eisen, B., Reliability of speech segmentation and labelling at different levels of transcription (1993) Proccedings of 3Rd European Conference on Speech Communication and Technology, 1, pp. 673-676. , https://www.isca-speech.org/archive/archive_papers/eurospeech_1993/e93_0673.pdf
  • (1993) ESPS version 5.0 programs manual, , Entropic Research Laboratory, Washington, D.C
  • Falk, T.H., Moller, S., Towards signal-based instrumental quality diagnosis for text-to-speech systems (2008) IEEE Signal Processing Letters, 15, pp. 781-784
  • Febrer, A., Padrell, J., Bonafonte, A., Generation of unit databases for the UPC text-to-speech system (1998) Proceedings of the International Workshop on Speech and Computer, pp. 26-29. , http://www.lsi.upc.edu/~nlp/papers/febrer98b.pdf
  • Fernández-Torné, A., Matamala, A., Text-to-speech vs. Human voiced audio descriptions: A reception study in films dubbed into catalan (2015) The Journal of Specialised Translation, 24, pp. 61-88. , http://www.jostrans.org/issue24/art_fernandez.php
  • François, H., Boëffard, O., Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem (2001) In Proceedings of the Seventh European Conference on Speech Communication and Technology, pp. 829-832. , https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_0829.pdf
  • François, H., Boëffard, O., The greedy algorithm and its application to the construction of a continuous speech database (2002) Procedings of the Third International Conference on Language Resources and Evaluation, pp. 1420-1426. , http://lrec.elra.info/proceedings/lrec2002/pdf/265.pdf
  • Fujisaki, H., Hirose, K., Analysis of voice fundamental frequency contours for declarative sentences of japanese (1984) Journal of Acoustic Society of Japan, 5 (4), pp. 233-242. , https://www.jstage.jst.go.jp/article/ast1980/5/4/5_4_233/_pdf
  • Grůber, M., Matoušek, J., Tihelka, D., Hanzlicek, Z., Reducing footprint of unit selection TTS system by removing linguistic segments with rarely selected units (2014) Proceedings of the 12Th International Conference on Signal Processing, pp. 494-499. , https://doi.org/10.1109/ICOSP.2014.7015054
  • Grůber, M., Tihelka, D., Matoušek, J., Evaluation of various unit types in the unit selection approach for the czech language using the festival system (2007) In Proceedings of the 6Th ISCA Workshop on Speech Synthesis, pp. 276-281. , http://www.isca-speech.org/archive_open/archive_papers/ssw6/ssw6_276.pdf
  • Guirao, M., Jurado, M.G., (1993) Estudio estadístico del español, , CONICET, Bue Aires
  • Gurlekian, J.A., Colantoni, L., Torres, H.M., El alfabeto fonético SAMPA y el diseño de córpora fonéticamente balanceados (2001) Fonoaudiológica, 47 (3), pp. 58-70
  • Gurlekian, J.A., Cossio-Mercado, C., Torres, H.M., Vaccari, M.E., Subjective evaluation of a high quality text-to-speech system for argentine spanish (2012) Proceedings of VII Jornadas En Tecnologí Del Habla and III Iberian Sltech Workshop, pp. 241-250. , https://www.researchgate.net/profile/Christian_Cossio-Mercado/publication/265955190_Subjective_Evaluation_of_a_High_Quality_Text-to-Speech_System_for_Argentine_Spanish/links/552ef53d0cf2acd38cbbdad4.pdf, IberSPEECH 2012
  • Gurlekian, J.A., Rodríguez, H., Colantoni, L., Torres, H.M., Development of a prosodic database for an argentine spanish text to speech system (2001) Proceedings of the IRCS Workshop on Linguistic Databases, SIAM, pp. 99-104. , http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.25.5050&rep=rep1&type=pdf, (b), B. Bird, M. Liberman
  • Gurlekian, J.A., Torres, H.M., Evin, D., Guía para la segmentación y transcripción fonética para las tecnologías del habla (2014) Fonoaudiológica, 61 (2), pp. 24-27
  • Hall, J.L., Application of multidimensional scaling to subjective evaluation of coded speech (2001) The Journal of the Acoustical Society of America, 110 (4), pp. 2167-2182
  • Hansakunbuntheung, C., Rugchatjaroen, A., Wutiwiwatchai, C., Space reduction of speech corpus based on quality perception for unit selection speech synthesis (2005) Proceedings of the 6Th International Symposium on Natural Language Processing, pp. 127-132. , https://www.researchgate.net/profile/Chatchawarn_Hansakunbuntheung/publication/228957899_Space:reduction_of_speech_corpus_based_on_quality_perception_for_unit_selection_speech_synthesis/links/0912f510bb45091b12000000.pdf
  • Harris, J., (1983) Syllable structure and Stress in Spanish, , The MIT Press, Cambridge
  • Hinterleitner, F., Norrenbrock, C., Möller, S., Is intelligibility still the main problem? A review of perceptual quality dimensions of synthetic speech (2013) In Proceedings of the Eighth ISCA Workshop on Speech Synthesis, pp. 147-151. , http://ssw8.talp.cat/papers/ssw8_PS2-1_Hinterleitner.pdf
  • Hinterleitner, F., Norrenbrock, C., Möller, S., Heute, U., (2014) Text-to-speech synthesis. Quality of experience, pp. 179-193. , Springer, Berlin
  • Hinterleitner, F., Zabel, S., Möller, S., Leutelt, L., Norrenbrock, C., Predicting the quality of synthesized speech using reference-based prediction measures (2011) Proceedings of the 22Th Konferenz Elektronische Sprachsignalverarbeitung, pp. 99-106. , http://www.qu.tu-berlin.de/fileadmin/fg41/publications/hinterleitner_2011_predicting-the-quality-of-synthesized-speech-using-reference.-.based-prediction-measures.pdf
  • Hirst, D., Rilliard, A., Aubergé, V., Comparison of subjective evaluation and an objective evaluation metric for prosody in text-to-speech synthesis (1998) Proceedings of the Third ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, pp. 293-306. , https://www.isca-speech.org/archive_open/archive_papers/ssw3/ssw3_001.pdf
  • Hoeckel, C., The reliability of manual labelling of continuous speech (1989) Proceedings of the ESCA Workshop on Speech Input/Output Assessment an Speech Databases, 2, pp. 2179-2182. , http://www.isca-speech.org/archive_open/archive_papers/sioa_89/sia_2179.pdf
  • Hon, H., Acero, A., Huang, X., Liu, J., Plumpe, M., Automatic generation of synthesis units for trainable text to speech systems (1998) In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’98), 1, pp. 293-306. , https://doi.org/10.1109/ICASSP.1998.674425
  • Karabetsos, S., Tsiakoulis, P., Chalamandaris, A., Raptis, S., Embedded unit selection text-to-speech synthesis for mobile devices (2009) IEEE Transactions on Consumer Electronics, 55 (2), pp. 613-621
  • Kawai, H., Toda, T., An evaluation of automatic phone segmentation for concatenative speech synthesis (2004) Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1, pp. 1-677. , https://doi.org/10.1109/ICASSP.2004.1326076
  • Kawai, H., Tsuzaki, M., Study on time-dependent voice quality variation in a large-scale single speaker speech corpus used for speech synthesis (2002) In Proceedings of the IEEE Workshop on Speech Synthesis, pp. 15-18. , https://doi.org/10.1109/WSS.2002.1224362
  • Kelly, A.C., Berthelsen, H., Campbell, N., Chasaide, A.N., Gobl, C., Corpus design techniques for irish speech synthesis (2009) Proceedings of the China Ireland ICT Conference, pp. 264-265. , http://www.eeng.dcu.ie/ciict/2009/proceedings.pdf
  • King, S., Measuring a decade of progress in text-to-speech (2014) Loquens, 1 (1)
  • Kishore, S., Black, A., Unit size in unit selection speech synthesis (2003) Proceedings of the Eurospeech 2003, pp. 1317-1320. , https://www.isca-speech.org/archive/archive_papers/eurospeech_2003/e03_1317.pdf
  • Krul, A., Damnati, G., Yvon, F., Boidin, C., Moudenc, T., Approaches for adaptive database reduction for text-to-speech synthesis (2007) Proceedings of the Eighth Annual Conference of the International Speech Communication Association, 3, pp. 2881-2884. , https://www.isca-speech.org/archive/archive_papers/interspeech_2007/i07_2881.pdf
  • Kurtic, E., (2004) Polyglot Voice Design for Unit Selection Speech Synthesis, , https://www.era.lib.ed.ac.uk/bitstream/handle/1842/2070/Emina%20Kurtic.pdf?sequence=1&isAllowed=y, Master’s thesis, School of Philosophy, Psychology and Language Sciences, University of Edinburgh
  • Lambert, T., Braunschweiler, N., Buchholz, S., How (Not) to select your voice corpus: Random selection vs. phonologically balanced (2007) In Proceedings of the 6Th ISCA Workshop on Speech Synthesis, pp. 22-24. , https://isca-speech.org/archive_open/archive_papers/ssw6/ssw6_264.pdf
  • Lewis, E., Tatham, M., Word and syllable concatenation in text-to-speech synthesis (1999) Proceedings of the Sixth European Conference on Speech Communications and Technology, 2, pp. 615-618. , https://www.isca-speech.org/archive/archive_papers/eurospeech_1999/e99_0615.pdf
  • Llisterri, J., (1999), http://liceu.uab.es/~joaquim/publicacions/RESLA_99.pdf, ). Transcripción, etiquetado y codificación de corpus orales. Revista Española de Lingüística Aplicada, Monográfico: Panorama de la Investigación en Lingüística Informática, (pp, 53–82); Lu, H., Zhang, W., Shao, X., Lei, Q.Z.W., Zhou, H., Breen, A., Pruning redundant synthesis units based on static and delta unit appearance frequency (2015) Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, pp. 269-273. , https://www.isca-speech.org/archive/interspeech_2015/papers/i15_0269.pdf
  • Marino, J.B., Nogueiras, A., Pachès-Leal, P., Bonafonte, A., The demiphone: An efficient contextual subword unit for continuous speech recognition (2000) Speech Communication, 32 (3), pp. 187-197
  • Matoušek, J., Psutka, J., Design of speech corpus for text-to-speech synthesis (2001) Proceedings of the 7Th Conference on Speech Communication and Technology, pp. 2047-2050. , https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_2047.pdf
  • Matoušek, J., Tihelka, D., Romportl, J., Building of a speech corpus optimised for unit selection TTS synthesis (2008) In Proceedings of 6Th International Conference on Language Resources and Evaluation, pp. 1296-1299. , http://www.lrec-conf.org/proceedings/lrec2008/pdf/329_paper.pdf
  • Mayo, C., Clark, R.A., King, S., Multidimensional scaling of listener responses to synthetic speech (2005) Proceedings of the 9Th European Conference on Speech Communication and Technology, pp. 1725-1728. , https://www.isca-speech.org/archive/archive_papers/interspeech_2005/i05_1725.pdf
  • McPherson, I., (1975) Spanish phonology: Descriptive and historical, , Manchester Univiversity Press, Manchester
  • Mendelson, J., Aylett, M., Beyond the listening test: An interactive approach to TTS evaluation (2017) In Proceedings of the 18Th Annual Conference of the International Speech Communication Association, pp. 20-24. , https://doi.org/10.21437/Interspeech.2017-1438
  • Möbius, B., Corpus-based speech synthesis: Methods and challenges (2000) AIMS, Arbeitspapiere Des Instituts für Maschinelle Sprachverarbeitung, 6 (4), pp. 87-116. , http://www.ims.uni-stuttgart.de/~moebius/papers/unitsel.pdf
  • Möbius, B., Rare events and closed domains: Two delicate concepts in speech synthesis (2003) International Journal of Speech Technology, 6 (1), pp. 57-71
  • Möller, S., Hinterleitner, F., Falk, T.H., Polzehl, T., Comparison of approaches for instrumentally predicting the quality of text-to-speech systems (2010) In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, pp. 1325-1328. , https://www.isca-speech.org/archive/archive_papers/interspeech_2010/i10_1325.pdf
  • Ni, J., Hirai, T., Kawai, H., Toda, T., Tokuda, K., Tsuzaki, M., Sakai, S., Nakamura, S., ATRECSS: ATR english speech corpus for speech synthesis (2007) In Proceedings of the 6Th ISCA Workshop on Speech Synthesis, Paper 002, , https://www.isca-speech.org/archive_open/archive_papers/blizzard_2007/blz3_002.pdf
  • Niebuhr, O., Michaud, A., Speech data acquisition: The underestimated challenge (2015) In Kalipho-Kieler Arbeiten Zur Linguistik Und Phonetik, 3, pp. 1-42. , https://halshs.archives-ouvertes.fr/halshs-01026295v4/document
  • Norrenbrock, C.R., Hinterleitner, F., Heute, U., Möller, S., Quality prediction of synthesized speech based on perceptual quality dimensions (2015) Speech Communication, 66, pp. 17-35
  • Oliveira, L.C., Paulo, S., Figueira, L., Mendes, C., Nunes, A., Godinho, J., Methodologies for designing and recording speech databases for corpus based synthesis (2008) Proceedings of the 6Th International Conference on Language Resources and Evaluation, pp. 2921-2925. , http://www.lrec-conf.org/proceedings/lrec2008/pdf/741_paper.pdf
  • (1990) Studies Toward the Unification of Picture Assessment Methodology, , https://www.itu.int/dms_pub/itu-r/opb/rep/R-REP-BT.1082-1-1990-PDF-E.pdf, Technical report, ITU
  • (1996) Methods for Subjective Determination of Transmission Quality. Technical Report, ITU, , https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.800-199608-I!!PDF-E&type=items
  • (1994) Method for Subjective Performance Assessment of the Quality of Speech Voice Output Devices, , https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.85-199406-I!!PDF-E&type=items, Technical report, ITU
  • Peterson, G.E., Wang, W.S.Y., Sivertsen, E., Segmentation techniques in speech synthesis (1958) The Journal of the Acoustical Society of America, 30 (8), pp. 739-742
  • Pitt, M.A., Johnson, K., Hume, E., Kiesling, S., Raymond, W., The Buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability (2005) Speech Communication, 45 (1), pp. 89-95
  • Prudon, R., D’Alessandro, C., A selection/concatenation text to speech synthesis system: Databases development, system design, comparative evaluation (2001) Proceedings of the 4Th Speech Synthesis Workshop (SSW4-2001), , https://www.isca-speech.org/archive_open/archive_papers/ssw4/ssw4_138.pdf, paper 138
  • Rodríguez, H., (2000) Construcción de una base de datos para el desarrollo de sistemas de conversión de texto a habla, , University of La Plata, Buenos Aires, licenciature thesis
  • Rosenberg, A., Ramabhadran, B., Bias and statistical significance in evaluating speech synthesis with mean opinion scores (2017) Proceedings of the 18Th Annual Conference of the International Speech Communication Association, pp. 3976-3980. , https://doi.org/10.21437/Interspeech.2017-479
  • (1992) Dictionary of the Spanish language, , Espasa Calpe, Madrid
  • Rutten, P., Aylett, M.P., Fackrell, J., Taylor, P., A statistically motivated database pruning technique for unit selection synthesis (2002) Proceedings of the Seventh International Conference on Spoken Language Processing, pp. 125-128. , https://www.isca-speech.org/archive/archive_papers/icslp_2002/i02_0125.pdf
  • Sainz, I., Navas, E., Hernáez, I., Bonafonte, A., Campillo, F., TTS evaluation campaign with a common spanish database (2010) In Proceedings of the Seventh International Conference on Language Resources and Evaluation, pp. 2155-2160. , http://www.lrec-conf.org/proceedings/lrec2010/pdf/456_Paper.pdf
  • Schiel, F., Baumann, A., Draxler, C., Ellbogen, T., Hoole, P., Steffen, A., (2012) The Validation of Speech Corpora. Munchen: Bavarian Archive for Speech Signals, , https://epub.ub.uni-muenchen.de/13698/1/schiel_13698.pdf
  • Sityaev, D., Knill, K., Burrows, T., Comparison of the ITU-T P.85 standard to other methods for the evaluation of text-to-speech systems (2006) Proceedings of the Ninth International Conference on Spoken Language Processing, pp. 2743-2746. , https://www.isca-speech.org/archive/archive_papers/interspeech_2006/i06_1233.pdf
  • Streijl, R.C., Winkler, S., Hands, D.S., Mean opinion score (mos) revisited: Methods and applications, limitations and alternatives (2016) Multimedia Systems, 22 (2), pp. 213-227
  • Syrdal, A., Wightman, C., Conkie, A., Stylianou, Y., Beutnagel, M., Schroeter, J., Strom, V., Makashay, M., Corpus-based techniques in the AT&t nextgen synthesis system (2000) In Proceedings of the 6Th International Conference on Spoken Language Processing, 3, pp. 410-415. , https://www.isca-speech.org/archive/archive_papers/icslp_2000/i00_3410.pdf
  • Syrdal, A.K., Conkie, A., Stylianou, Y., Exploration of acoustic correlates in speaker selection for concatenative synthesis (1998) Proceedings of the International Conference on Spoken Language Processing, 6, pp. 2743-2746. , https://www.isca-speech.org/archive/archive_papers/icslp_1998/i98_0882.pdf
  • Taylor, P., (2009) Text-to-speech synthesis, , Cambridge University Press, Cambridge
  • Torres, H.M., (2012) Creación De Un Corpus De Texto Para La construcción De Un Sistema TTS, , http://www.lis.secyt.gov.ar/informes/2012.pdf, Informe técnico 0325-2043, Laboratorio de Investigaciones Sensoriales, UBA-CONICET, Buenos Aires, Argentina
  • Torres, H.M., (2013) Medición De La Velocidad De conversión Del Sistema TTS Aromo, , http://www.lis.secyt.gov.ar/informes/2013.pdf, Informe técnico 0325-2043, Laboratorio de Investigaciones Sensoriales, UBA-CONICET, Buenos Aires, Argentina
  • Torres, H.M., Gurlekian, J., Automatic determination of phrase breaks for argentine spanish (2004) In Proceedings of the Speech Prosody 2004, pp. 553-556. , http://www.isca-speech.org/archive_open/sp2004/sp04_553.pdf
  • Torres, H.M., Gurlekian, J.A., Acoustic speech unit segmentation for concatenative synthesis (2008) Computer Speech and Language, 22, pp. 196-206
  • Torres, H.M., Gurlekian, J.A., Parameter estimation and prediction from text for a superpositional intonation model (2009) Proceedings of the 20 Konferenz Elektronische Sprachsignalverarbeitung, pp. 238-247. , https://www.researchgate.net/publication/265963364_Parameter_estimation_and_prediction_from_text_for_a_superpositional_intonation_model
  • Torres, H.M., Gurlekian, J.A., Novel estimation method for the superpositional intonation model (2016) IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24 (1), pp. 151-160
  • Torres, H.M., Gurlekian, J.A., Mercado, C., Aromo: Argentine spanish TTS system (2012) In Proceedings of VII Jornadas En Tecnología Del Habla and III Iberian Sltech Workshop, pp. 416-421. , https://www.researchgate.net/profile/Christian_Cossio-Mercado/publication/265952108_Aromo_Argentine_Spanish_TTS_System/links/570c37ea08aee0660351b0b9.pdf
  • Umbert, M., Moreno, A., Agüero, P., Bonafonte, A., Spanish synthesis corpora (2006) Proceedings of the International Conference of Language Resources and Evaluation, pp. 2102-2105. , http://www.lrec-conf.org/proceedings/lrec2006/pdf/590_pdf.pdf
  • Vainio, M., Jarvikivi, J., Werner, S., Volk, N., Valikangas, J., Effect of prosodic naturalness on segmental acceptability in synthetic speech (2002) Proceedings of 2002 IEEE Workshop on Speech Synthesis, pp. 143-146. , https://doi.org/10.1109/WSS.2002.1224394
  • Valentini-Botinhao, C., Yamagishi, J., King, S., Evaluation of objective measures for intelligibility prediction of HMM-based synthetic speech in noise (2011) In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5112-5115. , https://doi.org/10.1109/ICASSP.2011.5947507
  • van den Heuvel, H., Iskra, D., Sanders, E., de Vriend, F., Validation of spoken language resources: An overview of basic aspects (2008) Language Resources and Evaluation, 42 (1), pp. 41-73
  • van Santen, J.P.H., Prosodic modelling in text-to-speech synthesis (1997) Proceedings of the 5Th European Conference on Speech Communication and Technology, 5, pp. 2511-2514. , https://www.isca-speech.org/archive/archive_papers/eurospeech_1997/e97_KN19.pdf
  • Viswanathan, M., Viswanathan, M., Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (mos) scale (2005) Computer Speech & Language, 19 (1), pp. 55-83
  • Watson, A., Mullin, J., Smallwood, L., Wilson, G., (2001) New techniques for assessing audio and video quality in real-time interactive communication, , http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.494.6094&rep=rep1&type=pdf, In Tutorial at IHM-HCI, Lille, France
  • Zhang, W., Liu, Y., Deng, Y., Pang, M., Automatic construction for a TTS corpus with limited text (2010) Proccedings of the 2010 International Conference on Measuring Technology and Mechatronics Automation, 1, pp. 707-710. , https://doi.org/10.1109/ICMTMA.2010.796

Citas:

---------- APA ----------
Torres, H.M., Gurlekian, J.A., Evin, D.A. & Cossio Mercado, C.G. (2019) . Emilia: a speech corpus for Argentine Spanish text to speech synthesis. Language Resources and Evaluation.
http://dx.doi.org/10.1007/s10579-019-09447-7
---------- CHICAGO ----------
Torres, H.M., Gurlekian, J.A., Evin, D.A., Cossio Mercado, C.G. "Emilia: a speech corpus for Argentine Spanish text to speech synthesis" . Language Resources and Evaluation (2019).
http://dx.doi.org/10.1007/s10579-019-09447-7
---------- MLA ----------
Torres, H.M., Gurlekian, J.A., Evin, D.A., Cossio Mercado, C.G. "Emilia: a speech corpus for Argentine Spanish text to speech synthesis" . Language Resources and Evaluation, 2019.
http://dx.doi.org/10.1007/s10579-019-09447-7
---------- VANCOUVER ----------
Torres, H.M., Gurlekian, J.A., Evin, D.A., Cossio Mercado, C.G. Emilia: a speech corpus for Argentine Spanish text to speech synthesis. Lang. Resour. Eval. 2019.
http://dx.doi.org/10.1007/s10579-019-09447-7