2017

Guangpu Huang, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain, Antoine Laurent, Rasa Lileikyte, Abdel Massouadi

ICASSP 2017, The 42st IEEE International Conference on Acoustics, Speech and Signal Processing

This paper reports on investigations of using two tecnniques for language model text data augmentation for low-resourced automatic speech recognition and keyword search. Low-resourced languages are characterized by limited training materials, which typically results in high out-of-vocabulary (OOV) rates and poor language model estimates. One technique makes use of recurrent neural networks (RNNs) using word or subword units. Word-based RNNs keep the same system vocabulary, so they cannot reduce the OOV, whereas subword units can reduce the OOV but generate many false combinations. A complementary technique is based on automatic machine translation, which requires parallel texts and is able to add words to the vocabulary. These methods were accessed on 10 languages in the context of the Babel program and NIST OpenKWS evaluation. Although improvements vary across languages with both methods, small gains were generally observed in terms of word error rate reduction and improved keyword search performance.

.bib [Huang17] | .pdf
Rasa Lileikyte, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain, Antoine Laurent, Guangpu Huang

ICASSP 2017, The 42st IEEE International Conference on Acoustics, Speech and Signal Processing

In this paper we aim to enhance keyword search for conversational telephone speech under low-resourced conditions. Two techniques to improve the detection of out-of-vocabulary keywords are assessed in this study: using extra text resources to augment the lexicon and language model, and via subword units for keyword search. Two approaches for data augmentation are explored to extend the limited amount of transcribed conversational speech: using conversational-like Web data and texts generated by recurrent neural networks. Contrastive comparisons of subword-based systems are performed to evaluate the benefits of multiple subword decodings and single decoding. Keyword search results are reported for all the techniques, but only some improve performance. Results are reported for the Mongolian and Igbo languages using data from the 2016 Babel program.

.bib [Lileikyte17] | .pdf

2016

G. Gelly, J.L. Gauvain, L. Lamel, A. Laurent, V.B. Le, A. Messaoudi

Odyssey 2016

This paper describes our development work to design a language recognition system that can discriminate closely related languages and dialects of the same language. The work was a joint effort by LIMSI and Vocapia Research in preparation for the NIST 2015 Language Recognition Evaluation (LRE). The language recognition system results from a fusion of four core classifiers: a phonotactic component using DNN acoustic models, two purely acoustic components using a RNN model and and i-vector model, and a lexical component. Each component generates language posterior probabilities optimized to maximize the LID NCE, making their combination simple and robust. The motivation for using multiple components representing different speech knowledge is that some dialect distinctions may not be manifest at the acoustic level. We report experiments on the NIST LRE15 data and provide an analysis of the results and some post-evaluation contrasts. The 2015 LRE task focused on the identification of 20 languages clustered in 6 groups (Arabic, Chinese, English, French, Slavic and Iberic) of similar languages. Results are reported using the NIST Cavg metric which served as the primary metric for the OpenLRE15 evaluation. Results are also reported for the EER and the LER.

.bib [Gelly16] | .pdf
Antoine Laurent, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain

ICASSP 2016, The 41st IEEE International Conference on Acoustics, Speech and Signal Processing

In this paper we investigate various techniques in order to build effective speech to text (STT) and keyword search (KWS) systems for low resource conversational speech. Subword decoding and graphemic mappings were assessed in order to detect out-of-vocabulary keywords. To deal with the limited amount of transcribed data, semi-supervised training and data selection methods were investigated. Robust acoustic features produced via data augmentation were evaluated for acoustic modeling. For language modeling, automatically retrieved conversational-like Webdata was used, as well as neural network based models. We report STT improvements with all the techniques, but interestingly only some improve KWS performance. Results are reported for the Swahili language in the context of the 2015 OpenKWS Evaluation.

.bib [Laurent16] | .pdf
A. Gorin, R. Lileikyte, G. Huang, L. Lamel, J.L. Gauvain, A. Laurent

Interspeech 2016, Annual Conference of the International Speech Communication Association

This research extends our earlier work on using machine translation (MT) and word-based recurrent neural networks to augment language model training data for keyword search in conversational Cantonese speech. MT-based data augmentation is applied to two language pairs: English-Lithuanian and English-Amharic. Using filtered N-best MT hypotheses for language modeling is found to perform better than just using the 1-best translation. Target language texts collected from the Web and filtered to select conversational-like data are used in several manners. In addition to using Web data for training the language model of the speech recognizer, we further investigate using this data to improve the language model and phrase table of the MT system to get better translations of the English data. Finally, generating text data with a character-based recurrent neural network is investigated. This approach allows new word forms to be produced, providing a way to reduce the out-of-vocabulary rate and thereby improve keyword spotting performance. We study how these different methods of language model data augmentation impact speech-to-text and keyword spotting performance for the Lithuanian and Amharic languages. The best results are obtained by combining all of the explored methods.

.bib [Gorin16] | .pdf

2015

Thiago Fraga-Silva, Jean-Luc Gauvain, Lori Lamel, Antoine Laurent, Viet-Bac Le, Abdel Messaoudi

Interspeech 2015, Annual Conference of the International Speech Communication Association

This paper presents first results in using active learning (AL) for training data selection in the context of the IARPA-Babel program. Given an initial training data set, we aim to automatically select additional data (from an untranscribed pool data set) for manual transcription. Initial and selected data are then used to build acoustic and language models for speech recognition. The goal of the AL task is to outperform a baseline system built using a pre-defined data selection with the same amount of data, the Very Limited Language Pack (VLLP) condition. AL methods based on different selection criteria have been explored. Compared to the VLLP baseline, improvements are obtained in terms of Word Error Rate and Actual Term Weighted Values for the Lithuanian language. A description of methods and an analysis of the results are given. The AL selection also outperforms the VLLP baseline for other IARPA- Babel languages, and will be further tested in the upcoming NIST OpenKWS 2015 evaluation.

.bib [Fraga15] | .pdf
Thiago Fraga-Silva, Antoine Laurent, Jean-Luc Gauvain, Lori Lamel, Viet-Bac Le, Abdel Messaoudi

ASRU 2015, 2015 IEEE Automatic Speech Recognition and Understanding Workshop

This paper extends recent research on training data selection for speech transcription and keyword spotting system development. The techinques were explored in the context of the IARPA-Babel Active Learning (AL) task for 6 languages. Different selection criteria were explored with the goal of improving over a system built using a predefined 3 hour training data set. Four variants of the entropy-based criterion were explored: words, triphones, phones as well as the use of HMM-states previously introduced in (see IS 2015 bellow). The influence of the number of HMM-states was assessed as well as whether automatic or manual reference transcripts were used. The combination of selection criteria was investigated, and a novel multi-stage selection method proposed. These methods were also assessed using larger data sets than were permitted in the Babel AL task. Results are reported for the 6 languages. The multi-stage selection was also applied to the surprise language (Swahili) in the NIST OpenKWS 2015 evaluation.

.bib [Fraga15b] | .pdf

2014

Laurent, A., Lamel, L.

SLTU 2014, Spoken Language Technologies for Under-resourced languages

This paper investigates the development of a speech-to-text transcription system for the Korean language in the context of the DGA RAPID Rapmat project. Korean is an alpha-syllabary language spoken by about 78 million people worldwide. As only a small amount of manually transcribed audio data were available, the acoustic models were trained on audio data downloaded from several Korean websites in an unsupervised manner, and the language models were trained on web texts. The reported word and character error rates are estimates, as development corpus used in these experiments was also constructed from the untranscribed audio data, the web texts and automatic transcriptions. Several variants for unsupervised acoustic model training were compared to assess the influence of the vocabulary size (200k vs 2M), the type of language model (words vs characters), the acoustic unit (phonemes vs half-syllables), as well as incremental batch vs iterative decoding of the untranscribed audio corpus.

.bib [Laurent14] | .pdf
Bredin, H., Laurent, A., Sarkar, A., Le, V.-B., Barras, Claude, Rosset, Sophie

Odyssey 2014, The Speaker and Language Recognition Workshop

We address the problem of named speaker identification in TV broadcast which consists in answering the question ``who speaks when?'' with the real identity of speakers, using person names automatically obtained from speech transcripts. While existing approaches rely on a first speaker diarization step followed by a local name propagation step to speaker clusters, we propose a unified framework called person instance graph where both steps are jointly modeled as a global optimization problem, then solved using integer linear programming. Moreover, when available, acoustic speaker models can be added seamlessly to the graph structure for joint named and acoustic speaker identification -- leading to a 10% error decrease (from 45% down to 35%) over a state-of-the-art i-vector speaker identification system on the REPERE TV broadcast corpus.

.bib [Laurent14b] | .pdf
Laurent, A., Camelin, N., Raymond, C.

Interspeech 2014, 15th Annual Conference of the International Speech Communication Association

In this article, we tackle the problem of speaker role detection from broadcast news shows. In the literature, many proposed solutions are based on the combination of various features coming from acoustic, lexical and semantic information with a machine learning algorithm. Many previous studies mention the use of boosting over decision stumps to combine efficiently these features. In this work, we propose a modification of this state-of-the-art machine learning algorithm changing the weak learner (decision stumps) by small decision trees, denoted bonsai trees. Experiments show that using bonsai trees as weak learners for the boosting algorithm largely improves both system error rate and learning time.

.bib [Laurent14e] | .pdf
Laurent, A., Hartmann, W., Lamel, L.

ISCSLP@Interspeech 2014, 15th Annual Conference of the International Speech Communication Association

This paper investigates unsupervised training strategies for the Korean language in the context of the DGA RAPID Rapmat project. As with previous studies, we begin with only a small amount of manually transcribed data to build preliminary acoustic models. Using the initial models, a larger set of untranscribed audio data is decoded to produce approximate transcripts. We compare both GMM and DNN acoustic models for both the unsupervised transcription and the final recognition system. While the DNN acoustic models produce a lower word error rate on the test set, training on the transcripts from the GMM system provides the best overall performance. We also achieve better performance by expanding the original phone set. Finally, we examine the efficacy of automatically building a test set by comparing system performance both before and after manually correcting the test set.

.bib [Laurent14d] | .pdf
Laurent, A., Meignier, S., Deléglise, P.

In Computer Speech And Language

Accurate phonetic transcription of proper nouns can be an important resource for commercial applications that embed speech technologies, such as audio indexing and vocal phone directory lookup. However, an accurate phonetic transcription is more difficult to obtain for proper nouns than for regular words. Indeed, phonetic transcription of a proper noun depends on both the origin of the speaker pronouncing it and the origin of the proper noun itself.This work proposes a method that allows the extraction of phonetic transcriptions of proper nouns using actual utterances of those proper nouns, thus yielding transcriptions based on practical use instead of mere pronunciation rules.The proposed method consists in a process that first extracts phonetic transcriptions, and then iteratively filters them. In order to initialize the process, an alignment dictionary is used to detect word boundaries. A rule-based grapheme-to-phoneme generator (LIA_PHON), a knowledge-based approach (JSM), and a Statistical Machine Translation based system were evaluated for this alignment. As a result, compared to our reference dictionary (BDLEX supplemented by LIA_PHON for missing words) on the ESTER 1 French broadcast news corpus, we were able to significantly decrease the Word Error Rate (WER) on segments of speech with proper nouns, without negatively affecting the WER on the rest of the corpus.

.bib [Laurent14c] | .pdf
Bouaziz, M., Laurent, A., Estève, Y.

JEP 2014, Journées d'Etudes sur la Parole

Certains Systèmes de Reconnaissance Automatique de la Parole (SRAP) atteignent des taux d'erreur de l'ordre de 10%. Toutefois, notamment dans le cadre de l'indexation automatique des documents multimédia sur le web, les SRAP se trouvent face à la problématique des mots hors-vocabulaire. En effet, les entités nommées en constituent une grande partie et sont remarquablement importantes pour les tâches d'indexation. Nous mettons en œuvre, dans ce travail, la solution du décodage hybride en utilisant les syllabes comme unités sous-lexicales. Cette méthode est intégrée au sein du SRAP LIUM'08 développé par le Laboratoire d'Informatique de l'Université du Maine. Avec une légère dégradation de la performance générale du système, environ 31% des noms de personne hors vocabulaire sont correctement reconnus.

.bib [Laurent14j5] | .pdf
Bonneau-Maynard, H., Segal, N., Bilinski, E., Gauvain, J.-L., Gong, L., Lamel, L., Laurent, A., Yvon, F., Despres, J., Josse, Y., Le, V.-B.

JEP 2014, Journées d'Etudes sur la Parole

Le projet RAPMAT vise à développer des systèmes de traduction de la parole en s’intéressant aux deux traitements constitutifs de la chaîne complète : la reconnaissance de la parole (RAP) et la traduction (TA). Dans la situation classique, les modèles statistiques utilisés par les deux systèmes sont estimés indépendemment, à partir de données de différentes natures (transcriptions manuelles de données de parole pour la RAP et corpus bilingues issus de données textuelles pour la TA). Nous proposons une approche semi-supervisée pour l'adaptation des modèles de traduction à la traduction de parole, dans laquelle les modèles de TA sont entraînés en intégrant des transcriptions manuelles et automatiques de la parole traduites automatiquement. L'approche est expérimentée sur la direction de traduction français vers anglais. Un prototype de démonstration sur smartphones, incluant notamment la traduction de parole pour les paires de langues français/anglais et français/chinois a été développé pour permettre la collecte de données.

.bib [Laurent14j4] | .pdf
Laurent, A., Lamel, L.

JEP 2014, Journées d'Etudes sur la Parole

Ce papier décrit le développement d'un système de reconnaissance automatique de la parole pour le coréen. Le coréen est une langue alpha-syllabique, parlée par environ 78 millions de personnes dans le monde. Le développement de ce système a été mené en utilisant très peu de données annotées manuellement. Les modèles acoustiques ont été adaptés de manière non supervisée en utilisant des données provenant de différents sites d'actualités coréens. Le corpus de développement contient des transcriptions approximatives des documents audio : il s'agit d'un corpus transcrit automatiquement et aligné avec des données provenant des mêmes sites Internet. Nous comparons différentes approches dans ce travail, à savoir, des modèles de langue utilisant des unités différentes pour l'apprentissage non supervisé et pour le décodage (des caractères et des mots avec des vocabulaires de différentes tailles), l'utilisation de phonèmes et d'unités ``demi-syllabiques'' et deux approches différentes d'apprentissage non supervisé.

.bib [Laurent14j3] | .pdf
Laurent, A., Guinaudeau, C., Roy, A.

JEP 2014, Journées d'Etudes sur la Parole

Cet article décrit les méthodes mises en place pour permettre l'analyse d'un corpus composé de documents audiovisuels diffusés au cours des 80 dernières années : le corpus MATRICE. Nous proposons une exploration des données permettant de mettre en évidence les différents thèmes et évènements abordés dans le corpus. Cette exploration est, dans un premier temps, effectuée sur des notices documentaires produites manuellement par les documentalistes de l'Institut National de l'Audiovisuel. Puis, nous montrons, grâce à une étude qualitative et une technique de clustering automatique, que les transcriptions automatiques permettent également d'effectuer une analyse du corpus faisant émerger des thèmes cohérents avec les données traitées.

.bib [Laurent14j2] | .pdf
Laurent, A., Camelin, N., Raymond, C.

JEP 2014, Journées d'Etudes sur la Parole

Dans ce travail, nous nous intéressons au problème de la détection du rôle du locuteur dans les émissions d'actualités radiotélévisées. Dans la littérature, les solutions proposées sont de combiner des indicateurs variés provenant de l'acoustique, de la transcription et/ou de son analyse par des méthodes d'apprentissage automatique. De nombreuses études font ressortir l'algorithme de boosting sur des règles de décision simples comme l'un des plus efficaces à combiner ces différents descripteurs. Nous proposons ici une modification de cet algorithme état-de-l'art en remplaçant ces règles de décision simples par des mini arbres de décision que nous appelons bonzaïs. Les expériences comparatives menées sur le corpus EPAC montrent que cette modification améliore largement les performances du système tout en réduisant le temps d'apprentissage de manière conséquente.

.bib [Laurent14j1] | .pdf
Guinaudeau, C., Laurent, A., Bredin, H.

MediaEval 2014 Social Event Detection Task. Working Notes Proceedings

This paper provides an overview of the Social Event Detection (SED) system developed at LIMSI for the 2014 campaign. Our approach is based on a hierarchical agglomerative clustering that uses textual metadata, user-based knowledge and geographical information. These different sources of knowledge, either used separately or in cascade, reach good results for the full clustering subtask with a normalized mutual information equal to 0.95 and F1 scores greater than 0.82 for our best run.

.bib [guinaudeau14] | .pdf

2012

El-Khoury, E., Laurent, A., Meignier, S., Petitrenaud, S.

ICASSP 2012, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent12] | .pdf
Dufour, R., Laurent, A., Estève, Y.

JEP 2012, Journées d'Etudes sur la Parole

.bib [Laurent12j1] | .pdf

2011

Laurent, A., Meignier, S., Merlin, T., Deléglise, P.

ICASSP 2011, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent11] | .pdf

2010

Estève, Y., Deléglise, P., Meignier, S., Petitrenaud, S., Schwenk, H., Barrault, L., Bougares, F., Dufour, R., Jousse, V., Laurent, A., Rousseau, A.

Workshop CMU SPU

.bib [Esteve10] | .pdf
Laurent, A., Meignier, S., Deléglise, P.

JEP 2010, Journées d'Etudes sur la Parole

.bib [Laurent10j1] | .pdf
Laurent, A., Meignier, S., Merlin, T., Deléglise, P.

Interspeech 2010, 11th Annual Conference of the International Speech Communication Association

.bib [Laurent10-b] | .pdf

2009

Laurent, A., Deléglise, P., Meignier, S.

Interspeech 2009, 10th Annual Conference of the International Speech Communication Association

.bib [Laurent09b] | .pdf
Laurent, A., Merlin, T., Meignier, S., Estève, Y., Deléglise, P.

ICASSP 2009, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent09] | .pdf

2008

Laurent, A., and Meignier, S., Estève, Y., Deléglise, P.

JEP 2008, Journées d'Etudes sur la Parole

.bib [Laurent08b] | .pdf
Laurent, A., Merlin, T., Meignier, S., Estève, Y., Deléglise, P.

LREC 2008, Language Resources and Evaluation Conference

.bib [Laurent08] | .pdf

2014

Laurent, A., Meignier, S., Deléglise, P.

In Computer Speech And Language

Accurate phonetic transcription of proper nouns can be an important resource for commercial applications that embed speech technologies, such as audio indexing and vocal phone directory lookup. However, an accurate phonetic transcription is more difficult to obtain for proper nouns than for regular words. Indeed, phonetic transcription of a proper noun depends on both the origin of the speaker pronouncing it and the origin of the proper noun itself.This work proposes a method that allows the extraction of phonetic transcriptions of proper nouns using actual utterances of those proper nouns, thus yielding transcriptions based on practical use instead of mere pronunciation rules.The proposed method consists in a process that first extracts phonetic transcriptions, and then iteratively filters them. In order to initialize the process, an alignment dictionary is used to detect word boundaries. A rule-based grapheme-to-phoneme generator (LIA_PHON), a knowledge-based approach (JSM), and a Statistical Machine Translation based system were evaluated for this alignment. As a result, compared to our reference dictionary (BDLEX supplemented by LIA_PHON for missing words) on the ESTER 1 French broadcast news corpus, we were able to significantly decrease the Word Error Rate (WER) on segments of speech with proper nouns, without negatively affecting the WER on the rest of the corpus.

.bib [Laurent14c] | .pdf

2017

Guangpu Huang, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain, Antoine Laurent, Rasa Lileikyte, Abdel Massouadi

ICASSP 2017, The 42st IEEE International Conference on Acoustics, Speech and Signal Processing

This paper reports on investigations of using two tecnniques for language model text data augmentation for low-resourced automatic speech recognition and keyword search. Low-resourced languages are characterized by limited training materials, which typically results in high out-of-vocabulary (OOV) rates and poor language model estimates. One technique makes use of recurrent neural networks (RNNs) using word or subword units. Word-based RNNs keep the same system vocabulary, so they cannot reduce the OOV, whereas subword units can reduce the OOV but generate many false combinations. A complementary technique is based on automatic machine translation, which requires parallel texts and is able to add words to the vocabulary. These methods were accessed on 10 languages in the context of the Babel program and NIST OpenKWS evaluation. Although improvements vary across languages with both methods, small gains were generally observed in terms of word error rate reduction and improved keyword search performance.

.bib [Huang17] | .pdf
Rasa Lileikyte, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain, Antoine Laurent, Guangpu Huang

ICASSP 2017, The 42st IEEE International Conference on Acoustics, Speech and Signal Processing

In this paper we aim to enhance keyword search for conversational telephone speech under low-resourced conditions. Two techniques to improve the detection of out-of-vocabulary keywords are assessed in this study: using extra text resources to augment the lexicon and language model, and via subword units for keyword search. Two approaches for data augmentation are explored to extend the limited amount of transcribed conversational speech: using conversational-like Web data and texts generated by recurrent neural networks. Contrastive comparisons of subword-based systems are performed to evaluate the benefits of multiple subword decodings and single decoding. Keyword search results are reported for all the techniques, but only some improve performance. Results are reported for the Mongolian and Igbo languages using data from the 2016 Babel program.

.bib [Lileikyte17] | .pdf

2016

G. Gelly, J.L. Gauvain, L. Lamel, A. Laurent, V.B. Le, A. Messaoudi

Odyssey 2016

This paper describes our development work to design a language recognition system that can discriminate closely related languages and dialects of the same language. The work was a joint effort by LIMSI and Vocapia Research in preparation for the NIST 2015 Language Recognition Evaluation (LRE). The language recognition system results from a fusion of four core classifiers: a phonotactic component using DNN acoustic models, two purely acoustic components using a RNN model and and i-vector model, and a lexical component. Each component generates language posterior probabilities optimized to maximize the LID NCE, making their combination simple and robust. The motivation for using multiple components representing different speech knowledge is that some dialect distinctions may not be manifest at the acoustic level. We report experiments on the NIST LRE15 data and provide an analysis of the results and some post-evaluation contrasts. The 2015 LRE task focused on the identification of 20 languages clustered in 6 groups (Arabic, Chinese, English, French, Slavic and Iberic) of similar languages. Results are reported using the NIST Cavg metric which served as the primary metric for the OpenLRE15 evaluation. Results are also reported for the EER and the LER.

.bib [Gelly16] | .pdf
Antoine Laurent, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain

ICASSP 2016, The 41st IEEE International Conference on Acoustics, Speech and Signal Processing

In this paper we investigate various techniques in order to build effective speech to text (STT) and keyword search (KWS) systems for low resource conversational speech. Subword decoding and graphemic mappings were assessed in order to detect out-of-vocabulary keywords. To deal with the limited amount of transcribed data, semi-supervised training and data selection methods were investigated. Robust acoustic features produced via data augmentation were evaluated for acoustic modeling. For language modeling, automatically retrieved conversational-like Webdata was used, as well as neural network based models. We report STT improvements with all the techniques, but interestingly only some improve KWS performance. Results are reported for the Swahili language in the context of the 2015 OpenKWS Evaluation.

.bib [Laurent16] | .pdf
A. Gorin, R. Lileikyte, G. Huang, L. Lamel, J.L. Gauvain, A. Laurent

Interspeech 2016, Annual Conference of the International Speech Communication Association

This research extends our earlier work on using machine translation (MT) and word-based recurrent neural networks to augment language model training data for keyword search in conversational Cantonese speech. MT-based data augmentation is applied to two language pairs: English-Lithuanian and English-Amharic. Using filtered N-best MT hypotheses for language modeling is found to perform better than just using the 1-best translation. Target language texts collected from the Web and filtered to select conversational-like data are used in several manners. In addition to using Web data for training the language model of the speech recognizer, we further investigate using this data to improve the language model and phrase table of the MT system to get better translations of the English data. Finally, generating text data with a character-based recurrent neural network is investigated. This approach allows new word forms to be produced, providing a way to reduce the out-of-vocabulary rate and thereby improve keyword spotting performance. We study how these different methods of language model data augmentation impact speech-to-text and keyword spotting performance for the Lithuanian and Amharic languages. The best results are obtained by combining all of the explored methods.

.bib [Gorin16] | .pdf

2015

Thiago Fraga-Silva, Jean-Luc Gauvain, Lori Lamel, Antoine Laurent, Viet-Bac Le, Abdel Messaoudi

Interspeech 2015, Annual Conference of the International Speech Communication Association

This paper presents first results in using active learning (AL) for training data selection in the context of the IARPA-Babel program. Given an initial training data set, we aim to automatically select additional data (from an untranscribed pool data set) for manual transcription. Initial and selected data are then used to build acoustic and language models for speech recognition. The goal of the AL task is to outperform a baseline system built using a pre-defined data selection with the same amount of data, the Very Limited Language Pack (VLLP) condition. AL methods based on different selection criteria have been explored. Compared to the VLLP baseline, improvements are obtained in terms of Word Error Rate and Actual Term Weighted Values for the Lithuanian language. A description of methods and an analysis of the results are given. The AL selection also outperforms the VLLP baseline for other IARPA- Babel languages, and will be further tested in the upcoming NIST OpenKWS 2015 evaluation.

.bib [Fraga15] | .pdf
Thiago Fraga-Silva, Antoine Laurent, Jean-Luc Gauvain, Lori Lamel, Viet-Bac Le, Abdel Messaoudi

ASRU 2015, 2015 IEEE Automatic Speech Recognition and Understanding Workshop

This paper extends recent research on training data selection for speech transcription and keyword spotting system development. The techinques were explored in the context of the IARPA-Babel Active Learning (AL) task for 6 languages. Different selection criteria were explored with the goal of improving over a system built using a predefined 3 hour training data set. Four variants of the entropy-based criterion were explored: words, triphones, phones as well as the use of HMM-states previously introduced in (see IS 2015 bellow). The influence of the number of HMM-states was assessed as well as whether automatic or manual reference transcripts were used. The combination of selection criteria was investigated, and a novel multi-stage selection method proposed. These methods were also assessed using larger data sets than were permitted in the Babel AL task. Results are reported for the 6 languages. The multi-stage selection was also applied to the surprise language (Swahili) in the NIST OpenKWS 2015 evaluation.

.bib [Fraga15b] | .pdf

2014

Laurent, A., Lamel, L.

SLTU 2014, Spoken Language Technologies for Under-resourced languages

This paper investigates the development of a speech-to-text transcription system for the Korean language in the context of the DGA RAPID Rapmat project. Korean is an alpha-syllabary language spoken by about 78 million people worldwide. As only a small amount of manually transcribed audio data were available, the acoustic models were trained on audio data downloaded from several Korean websites in an unsupervised manner, and the language models were trained on web texts. The reported word and character error rates are estimates, as development corpus used in these experiments was also constructed from the untranscribed audio data, the web texts and automatic transcriptions. Several variants for unsupervised acoustic model training were compared to assess the influence of the vocabulary size (200k vs 2M), the type of language model (words vs characters), the acoustic unit (phonemes vs half-syllables), as well as incremental batch vs iterative decoding of the untranscribed audio corpus.

.bib [Laurent14] | .pdf
Bredin, H., Laurent, A., Sarkar, A., Le, V.-B., Barras, Claude, Rosset, Sophie

Odyssey 2014, The Speaker and Language Recognition Workshop

We address the problem of named speaker identification in TV broadcast which consists in answering the question ``who speaks when?'' with the real identity of speakers, using person names automatically obtained from speech transcripts. While existing approaches rely on a first speaker diarization step followed by a local name propagation step to speaker clusters, we propose a unified framework called person instance graph where both steps are jointly modeled as a global optimization problem, then solved using integer linear programming. Moreover, when available, acoustic speaker models can be added seamlessly to the graph structure for joint named and acoustic speaker identification -- leading to a 10% error decrease (from 45% down to 35%) over a state-of-the-art i-vector speaker identification system on the REPERE TV broadcast corpus.

.bib [Laurent14b] | .pdf
Laurent, A., Camelin, N., Raymond, C.

Interspeech 2014, 15th Annual Conference of the International Speech Communication Association

In this article, we tackle the problem of speaker role detection from broadcast news shows. In the literature, many proposed solutions are based on the combination of various features coming from acoustic, lexical and semantic information with a machine learning algorithm. Many previous studies mention the use of boosting over decision stumps to combine efficiently these features. In this work, we propose a modification of this state-of-the-art machine learning algorithm changing the weak learner (decision stumps) by small decision trees, denoted bonsai trees. Experiments show that using bonsai trees as weak learners for the boosting algorithm largely improves both system error rate and learning time.

.bib [Laurent14e] | .pdf
Laurent, A., Hartmann, W., Lamel, L.

ISCSLP@Interspeech 2014, 15th Annual Conference of the International Speech Communication Association

This paper investigates unsupervised training strategies for the Korean language in the context of the DGA RAPID Rapmat project. As with previous studies, we begin with only a small amount of manually transcribed data to build preliminary acoustic models. Using the initial models, a larger set of untranscribed audio data is decoded to produce approximate transcripts. We compare both GMM and DNN acoustic models for both the unsupervised transcription and the final recognition system. While the DNN acoustic models produce a lower word error rate on the test set, training on the transcripts from the GMM system provides the best overall performance. We also achieve better performance by expanding the original phone set. Finally, we examine the efficacy of automatically building a test set by comparing system performance both before and after manually correcting the test set.

.bib [Laurent14d] | .pdf
Bouaziz, M., Laurent, A., Estève, Y.

JEP 2014, Journées d'Etudes sur la Parole

Certains Systèmes de Reconnaissance Automatique de la Parole (SRAP) atteignent des taux d'erreur de l'ordre de 10%. Toutefois, notamment dans le cadre de l'indexation automatique des documents multimédia sur le web, les SRAP se trouvent face à la problématique des mots hors-vocabulaire. En effet, les entités nommées en constituent une grande partie et sont remarquablement importantes pour les tâches d'indexation. Nous mettons en œuvre, dans ce travail, la solution du décodage hybride en utilisant les syllabes comme unités sous-lexicales. Cette méthode est intégrée au sein du SRAP LIUM'08 développé par le Laboratoire d'Informatique de l'Université du Maine. Avec une légère dégradation de la performance générale du système, environ 31% des noms de personne hors vocabulaire sont correctement reconnus.

.bib [Laurent14j5] | .pdf
Bonneau-Maynard, H., Segal, N., Bilinski, E., Gauvain, J.-L., Gong, L., Lamel, L., Laurent, A., Yvon, F., Despres, J., Josse, Y., Le, V.-B.

JEP 2014, Journées d'Etudes sur la Parole

Le projet RAPMAT vise à développer des systèmes de traduction de la parole en s’intéressant aux deux traitements constitutifs de la chaîne complète : la reconnaissance de la parole (RAP) et la traduction (TA). Dans la situation classique, les modèles statistiques utilisés par les deux systèmes sont estimés indépendemment, à partir de données de différentes natures (transcriptions manuelles de données de parole pour la RAP et corpus bilingues issus de données textuelles pour la TA). Nous proposons une approche semi-supervisée pour l'adaptation des modèles de traduction à la traduction de parole, dans laquelle les modèles de TA sont entraînés en intégrant des transcriptions manuelles et automatiques de la parole traduites automatiquement. L'approche est expérimentée sur la direction de traduction français vers anglais. Un prototype de démonstration sur smartphones, incluant notamment la traduction de parole pour les paires de langues français/anglais et français/chinois a été développé pour permettre la collecte de données.

.bib [Laurent14j4] | .pdf
Laurent, A., Lamel, L.

JEP 2014, Journées d'Etudes sur la Parole

Ce papier décrit le développement d'un système de reconnaissance automatique de la parole pour le coréen. Le coréen est une langue alpha-syllabique, parlée par environ 78 millions de personnes dans le monde. Le développement de ce système a été mené en utilisant très peu de données annotées manuellement. Les modèles acoustiques ont été adaptés de manière non supervisée en utilisant des données provenant de différents sites d'actualités coréens. Le corpus de développement contient des transcriptions approximatives des documents audio : il s'agit d'un corpus transcrit automatiquement et aligné avec des données provenant des mêmes sites Internet. Nous comparons différentes approches dans ce travail, à savoir, des modèles de langue utilisant des unités différentes pour l'apprentissage non supervisé et pour le décodage (des caractères et des mots avec des vocabulaires de différentes tailles), l'utilisation de phonèmes et d'unités ``demi-syllabiques'' et deux approches différentes d'apprentissage non supervisé.

.bib [Laurent14j3] | .pdf
Laurent, A., Guinaudeau, C., Roy, A.

JEP 2014, Journées d'Etudes sur la Parole

Cet article décrit les méthodes mises en place pour permettre l'analyse d'un corpus composé de documents audiovisuels diffusés au cours des 80 dernières années : le corpus MATRICE. Nous proposons une exploration des données permettant de mettre en évidence les différents thèmes et évènements abordés dans le corpus. Cette exploration est, dans un premier temps, effectuée sur des notices documentaires produites manuellement par les documentalistes de l'Institut National de l'Audiovisuel. Puis, nous montrons, grâce à une étude qualitative et une technique de clustering automatique, que les transcriptions automatiques permettent également d'effectuer une analyse du corpus faisant émerger des thèmes cohérents avec les données traitées.

.bib [Laurent14j2] | .pdf
Laurent, A., Camelin, N., Raymond, C.

JEP 2014, Journées d'Etudes sur la Parole

Dans ce travail, nous nous intéressons au problème de la détection du rôle du locuteur dans les émissions d'actualités radiotélévisées. Dans la littérature, les solutions proposées sont de combiner des indicateurs variés provenant de l'acoustique, de la transcription et/ou de son analyse par des méthodes d'apprentissage automatique. De nombreuses études font ressortir l'algorithme de boosting sur des règles de décision simples comme l'un des plus efficaces à combiner ces différents descripteurs. Nous proposons ici une modification de cet algorithme état-de-l'art en remplaçant ces règles de décision simples par des mini arbres de décision que nous appelons bonzaïs. Les expériences comparatives menées sur le corpus EPAC montrent que cette modification améliore largement les performances du système tout en réduisant le temps d'apprentissage de manière conséquente.

.bib [Laurent14j1] | .pdf
Guinaudeau, C., Laurent, A., Bredin, H.

MediaEval 2014 Social Event Detection Task. Working Notes Proceedings

This paper provides an overview of the Social Event Detection (SED) system developed at LIMSI for the 2014 campaign. Our approach is based on a hierarchical agglomerative clustering that uses textual metadata, user-based knowledge and geographical information. These different sources of knowledge, either used separately or in cascade, reach good results for the full clustering subtask with a normalized mutual information equal to 0.95 and F1 scores greater than 0.82 for our best run.

.bib [guinaudeau14] | .pdf

2012

El-Khoury, E., Laurent, A., Meignier, S., Petitrenaud, S.

ICASSP 2012, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent12] | .pdf
Dufour, R., Laurent, A., Estève, Y.

JEP 2012, Journées d'Etudes sur la Parole

.bib [Laurent12j1] | .pdf

2011

Laurent, A., Meignier, S., Merlin, T., Deléglise, P.

ICASSP 2011, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent11] | .pdf

2010

Estève, Y., Deléglise, P., Meignier, S., Petitrenaud, S., Schwenk, H., Barrault, L., Bougares, F., Dufour, R., Jousse, V., Laurent, A., Rousseau, A.

Workshop CMU SPU

.bib [Esteve10] | .pdf
Laurent, A., Meignier, S., Deléglise, P.

JEP 2010, Journées d'Etudes sur la Parole

.bib [Laurent10j1] | .pdf
Laurent, A., Meignier, S., Merlin, T., Deléglise, P.

Interspeech 2010, 11th Annual Conference of the International Speech Communication Association

.bib [Laurent10-b] | .pdf

2009

Laurent, A., Deléglise, P., Meignier, S.

Interspeech 2009, 10th Annual Conference of the International Speech Communication Association

.bib [Laurent09b] | .pdf
Laurent, A., Merlin, T., Meignier, S., Estève, Y., Deléglise, P.

ICASSP 2009, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent09] | .pdf

2008

Laurent, A., and Meignier, S., Estève, Y., Deléglise, P.

JEP 2008, Journées d'Etudes sur la Parole

.bib [Laurent08b] | .pdf
Laurent, A., Merlin, T., Meignier, S., Estève, Y., Deléglise, P.

LREC 2008, Language Resources and Evaluation Conference

.bib [Laurent08] | .pdf