2023

Nauman Dawalatabad, Sameer Khurana, Antoine Laurent, James Glass

ICASSP 2023

Pseudo-label (PL) filtering forms a crucial part of Self-Training (ST) methods for unsupervised domain adaptation. Dropout-based Uncertainty-driven Self-Training (DUST) proceeds by first training a teacher model on source domain labeled data. Then, the teacher model is used to provide PLs for the unlabeled target domain data. Finally, we train a student on augmented labeled and pseudo-labeled data. The process is iterative, where the student becomes the teacher for the next DUST iteration. A crucial step that precedes the student model training in each DUST iteration is filtering out noisy PLs that could lead the student model astray. In DUST, we proposed a simple, effective, and theoretically sound PL filtering strategy based on the teacher model's uncertainty about its predictions on unlabeled speech utterances. We estimate the model's uncertainty by computing disagreement amongst multiple samples drawn from the teacher model during inference by injecting noise via dropout. In this work, we show that DUST's PL filtering, as initially used, may fail under severe source and target domain mismatch. We suggest several approaches to eliminate or alleviate this issue. Further, we bring insights from the research in neural network model calibration to DUST and show that a well-calibrated model correlates strongly with a positive outcome of the DUST PL filtering step.

.bib [Dawalatabad23] | .pdf
Antoine Laurent, Souhir Gahbiche, Ha Nguyen, Haroun Elleuch, Fethi Bougares, Antoine Thiol, Hugo Riguidel, Salima Mdhaffar, Gaëlle Laperrière, Lucas Maison, Sameer Khurana, Yannick Estève

IWSLT@ACL 2023

This paper describes the ON-TRAC consortium speech translation systems developed for IWSLT 2023 evaluation campaign. Overall, we participated in three speech translation tracks featured in the low-resource and dialect speech translation shared tasks, namely; i) spoken Tamasheq to written French, ii) spoken Pashto to written French, and iii) spoken Tunisian to written English. All our primary submissions are based on the end-to-end speech-to-text neural architecture using a pretrained SAMU-XLSR model as a speech encoder and a mbart model as a decoder. The SAMU-XLSR model is built from the XLS-R~128 in order to generate language agnostic sentence-level embeddings. This building is driven by the LaBSE model trained on multilingual text dataset. This architecture allows us to improve the input speech representations and achieve significant improvements compared to conventional end-to-end speech translation systems.

.bib [Laurent23] | .pdf
Victoria Mingote, Pablo Gimeno, Luis Vicente, Sameer Khurana, Antoine Laurent, Jarod Duret

IEEE Signal Processing Letters

This letter proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to extract the acoustic units using a speech encoder combined with a clustering algorithm. Once units are obtained, an encoder-decoder architecture is trained to predict them. Then a vocoder generates speech from units. Our approach for direct text to speech translation was tested on the new CVSS corpus with two different text mBART models employed as initialisation. The systems presented report competitive performance for most of the language pairs evaluated. Besides, results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.

.bib [Mingote23] | .pdf

2022

Sameer Khurana, Antoine Laurent, James Glass

ICASSP 2022

We propose a simple and effective cross-lingual transfer learning method to adapt monolingual wav2vec-2.0 models for Automatic Speech Recognition (ASR) in resource-scarce languages. We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages. We improve its performance further via several iterations of Dropout Uncertainty-Driven Self-Training (DUST) by using a moderate-sized unlabeled speech dataset in the target language. A key finding of this work is that the adapted monolingual wav2vec-2.0 achieves similar performance as the topline multilingual XLSR model, which is trained on fifty-three languages, on the target language ASR task.

.bib [khurana22] | .pdf
Martin Lebourdais, Marie Tahon, Antoine Laurent, Anthony Larcher, Sylvain Meignier

JEP 2022

.bib [lebourdais22b] |
Valentin Pelloin, Nathalie Camelin, Antoine Laurent, Renato De Mori, Sylvain Meignier

JEP 2022

.bib [pelloin22] |
Marcely Zanon Boito, John Ortega, Hugo Riguidel, Antoine Laurent, Loïc Barrault, Fethi Bougares, Firas Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gahbiche, Yannick Estève

IWSLT 2022

.bib [mzboito22] |
Nicolas Hervé, Valentin Pelloin, Benoit Favre, Franck Dary, Antoine Laurent, Sylvain Meignier, Laurent Besacier

ACL 2022

This papers aims at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. The new models (FlauBERT-Oral) are shared with the community and are evaluated not only in terms of word prediction accuracy but also for two downstream tasks: classification of TV shows and syntactic parsing of speech. Experimental results show that FlauBERT-Oral is better than its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-Generated text can be useful to improve spoken language modeling.

.bib [herve22] |
Rémi Uro, David Doukhan, Albert Rilliard, Laetitia Larcher, Anissa-Claire Adgharouamane, Marie Tahon, Antoine Laurent

LREC 2022

This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker’s age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For each speaker, speech excerpts were extracted from audiovisual documents using an automatic pipeline consisting of speech detection, background music and overlapped speech removal and speaker diarization, used to present clean speaker segments to human annotators identifying target speakers. This pipeline proved highly effective, cutting down manual processing by a factor of ten. Evaluation of the quality of the automatic processing and of the final output is provided. It shows the automatic processing compare to up-to-date process, and that the output provides high quality speech for most of the selected excerpts. This method shows promise for creating large corpora of known target speakers.

.bib [Uro22] |
Sameer Khurana, Antoine Laurent,, James Glass

IEEE Journal of Selected Topics in Signal Processing

We propose the (SAMU−XLSR): Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5-10s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLSR with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU−XLSR. Although we train SAMU−XLSR with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use SAMU−XLSR speech encoder in combination with a pre-trained LaBSE text sentence encoder for cross-lingual speech-to-text translation retrieval, and SAMU−XLSR alone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets.

.bib [samuxlsr22] |
Martin Lebourdais, Marie Tahon, Antoine Laurent, Sylvain Meignier, Anthony Larcher

LREC 2022

Our main goal is to study the interactions between speakers according to their gender and role in broadcast media. In this paper, we propose an extensive study of gender and overlap annotations in various speech corpora mainly dedicated to diarisation or transcription tasks. We point out the issue of the heterogeneity of the annotation guidelines for both overlapping speech and gender categories. On top of that, we analyse how the speech content (casual speech, meetings, debate, interviews, etc.) impacts the distribution of overlapping speech segments. On a small dataset of 93 recordings from LCP French channel, we intend to characterise the interactions between speakers according to their gender. Finally, we propose a method which aims to highlight active speech areas in terms of interactions between speakers. Such a visualisation tool could improve the efficiency of qualitative studies conducted by researchers in human sciences.

.bib [lebourdais22] |
Martin Lebourdais, Marie Tahon, Antoine Laurent, Sylvain Meignier.

Interspeech 2022

This article focuses on overlapped speech and gender detection in order to study interactions between women and men in French audiovisual media (Gender Equality Monitoring project). In this application context, we need to automatically segment the speech signal according to speakers gender, and to identify when at least two speakers speak at the same time. We propose to use WavLM model which has the advantage of being pre-trained on a huge amount of speech data, to build an overlapped speech detection (OSD) and a gender detection (GD) systems. In this study, we use two different corpora. The DIHARD III corpus which is well adapted for the OSD task but lack gender information. The ALLIES corpus fits with the project application context. Our best OSD system is a Temporal Convolutional Network (TCN) with WavLM pre-trained features as input, which reaches a new state-of-the-art F1-score performance on DIHARD. A neural GD is trained with WavLM inputs on a gender balanced subset of the French broadcast news ALLIES data, and obtains an accuracy of 94.9%. This work opens new perspectives for human science researchers regarding the differences of representation between women and men in French media.

.bib [lebourdaisIS22] |
Valentin Pelloin, Franck Dary, Nicolas Hervé, Benoit Favre, Nathalie Camelin, Antoine Laurent, Laurent Besacier

Interspeech 2022

We aim at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing. Results show that FlauBERT-Oral can be beneficial compared to its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-generated text can be used to build spoken language models.

.bib [pelloinIS22] |

2021

Valentin Pelloin, Nathalie Camelin, Antoine Laurent, Renato de Mori, Antoine Caubrière, Yannick Estève, Sylvain Meignier

ICASSP 2021

In this paper, we propose a novel end-to-end sequence-to-sequence spoken language understanding model using an attention mechanism. It reliably selects contextual acoustic features in order to hypothesize semantic contents. An initial architecture capable of extracting all pronounced words and concepts from acoustic spans is designed and tested. With a shallow fusion language model, this system reaches a 13.6 concept error rate (CER) and an 18.5 concept value error rate (CVER) on the French MEDIA corpus, achieving an absolute 2.8 points reduction compared to the state-of-the-art. Then, an original model is proposed for hypothesizing concepts and their values. This transduction reaches a 15.4 CER and a 21.6 CVER without any new type of context.

.bib [pelloin21] |
Jean Carrive, Abdelkrim Beloued, Pascale Goetschel, Serge Heiden, Antoine Laurent, Pasquale Lisena, Franck Mazuet, Sylvain Meignier, Bénédicte Pincemin, Géraldine Poels, Raphaël Troncy

Digital Humanities Quarterly

The ANTRACT project is a cross-disciplinary apparatus dedicated to the analysis of the French newsreel company Les Actualités Françaises (1945-1969) and its film productions. Founded during the liberation of France, this state-owned company filmed more than 20,000 news reports shown in French cinemas and throughout the world over its 24 years of activity. The project brings together research organizations with a dual historical and technological perspective. ANTRACT’s goal is to study the production process, the film content, the way historical events are represented and the audience reception of Les Actualités Françaises newsreels using innovative AI-based data processing tools developed by partners specialized in image, audio, and text analysis. This article focuses on the data processing apparatus and tools of the project. Automatic content analysis is used to select data, to segment video units and typescript images, and to align them with their archival description. Automatic speech recognition provides a textual representation and natural language processing can extract named entities from the voice-over recording; automatic visual analysis is applied to detect and recognize faces of well-known characters in videos. These multifaceted data can then be queried and explored with the TXM text-mining platform. The results of these automatic analysis processes are feeding the Okapi platform, a client-server software that integrates documentation, information retrieval, and hypermedia capabilities within a single environment based on the Semantic Web standards. The complete corpus of Les Actualités Françaises, enriched with data and metadata, will be made available to the scientific community by the end of the project.

.bib [carrive21] | .pdf
Hervé Bredin, Antoine Laurent

Interspeech 2021

Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped speech detection), we propose to train an end-to-end segmentation model that does it directly. Inspired by the original end-to-end neural speaker diarization approach (EEND), the task is modeled as a multi-label classification problem using permutation-invariant training. The main difference is that our model operates on short audio chunks (5 seconds) but at a much higher temporal resolution (every 16ms). Experiments on multiple speaker diarization datasets conclude that our model can be used with great success on both voice activity detection and overlapped speech detection. Our proposed model can also be used as a post-processing step, to detect and correctly assign overlapped speech regions. Relative diarization error rate improvement over the best considered baseline (VBx) reaches 18% on AMI, 17% on DIHARD 3, and 16% on VoxConverse.

.bib [bredin21] | .pdf

2020

Silvio Montrésor, Marie Tahon, Antoine Laurent, Picart Pascal

SPIE Photonics Europe International Symposium

.bib [montresor20] |
Salima mdhaffar, Yannick Estève, Antoine Laurent, Nicolas Hernandez, Richard Dufour, Delphine Charlet, Geraldine Damnati, Solen Quiniou, Nathalie Camelin

LREC 2020

.bib [mdhaffar2020] |
Antoine Caubrière, Sahar Ghannay, Natalia Tomashenko, Renato De Mori, Antoine Laurent, Emmanuel Morin, Yannick Estève

ICASSP 2020

.bib [caubriere2020b] | .pdf
Antoine Caubrière, Sophie Rosset, Yannick Estève, Antoine Laurent, Emmanuel Morin

LREC 2020

.bib [caubriere2020] |
Antoine Caubrière, Sophie Rosset, Yannick Estève, Antoine Laurent, Emmanuel Morin

JEP 2020

.bib [caubriere2020jep] |
Antoine Caubrière, Yannick Estève, Antoine Laurent, Emmanuel Morin

Interspeech 2020

.bib [caubriere2020int] |
Hans Dolfing, Jérome Bellegarda, Jan Chorowski, Ricard Marxer, Antoine Laurent

ICFHR 2020

.bib [dolfing20] |
Silvio Montrésor, Marie Tahon, Antoine Laurent, Pascal Picart

APL Photonics AIP Publishing LLC

.bib [montresor20b] |
Adrian Łancucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans J G A Dolfing, Sameer Khurana, Tanel Alumäe, Antoine Laurent

IJCNN 2020

In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representation learning, they became viable alternatives to continuous latent variable models such as the Variational Auto-Encoder (VAE). However, training deep discrete variable models is challenging, due to the inherent non-differentiability of the discretization operation. In this paper we focus on VQ-VAE, a state-of-the-art discrete bottleneck model shown to perform on par with its continuous counterparts. It quantizes encoder outputs with on-line k-means clustering. We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs. We demonstrate that these can be successfully overcome by increasing the learning rate for the codebook and periodic date-dependent codeword re-initialization. As a result, we achieve more robust training across different tasks, and significantly increase the usage of latent codewords even for large codebooks. This has practical benefit, for instance, in unsupervised representation learning, where large codebooks may lead to disentanglement of latent representations.

.bib [lancucki20] | .pdf
Sameer Khurana, Antoine Laurent, James Glass

Arxiv

More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but usually have speakers that are bi-lingual and trained in a high resource language. It is relatively easy to obtain textual translations corresponding to speech. In this work, we provide a multimodal machine learning framework for speech representation learning by exploiting the correlations between the two modalities namely speech and its corresponding text translation. Here, we construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech. The audio encoder is trained to perform a speech-translation retrieval task in a contrastive learning framework. By evaluating the learned representations on a phone recognition task, we demonstrate that linguistic representations emerge in the audio encoder's internal representations as a by-product of learning to perform the retrieval task.

.bib [khurana20b] | .pdf
Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass

Interspeech 2020

Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods on linear phone classification and recognition on the Wall Street Journal dataset. Furthermore, we found that ConvDMM complements self-supervised methods like Wav2Vec and PASE, improving on the results achieved with any of the methods alone. Lastly, we find that ConvDMM features enable learning better phone recognizers than any other features in an extreme low-resource regime with few labeled training examples.

.bib [khurana20] | .pdf

2019

Salima Mdhaffar, Yannick Estève, Nicolas Hernandez, Antoine Laurent, Solen Quiniou

26e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2019)

.bib [Mdhaffar19b] |
Caubrière, Antoine, Tomashenko, Natalia, Estève, Yannick, Laurent, Antoine, Morin, Emmanuel

26e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2019)

.bib [caubriere2019curriculum] |
Antoine Caubrière, Natalia Tomashenko, Antoine LAURENT, Emmanuel Morin, Nathalie Camelin, Yannick Estève

Interspeech 2019, Annual Conference of the International Speech Communication Association

.bib [Caubriere2019] |
Salima mdhaffar, Yannick Estève, Nicolas Hernandez, Antoine LAURENT, Solen Quiniou

Interspeech 2019, Annual Conference of the International Speech Communication Association

.bib [Mdhaffar2019] | .pdf
Tomashenko, Natalia, Caubrière, Antoine, Estève, Yannick, Laurent, Antoine, Morin, Emmanuel

International Conference on Statistical Language and Speech Processing

.bib [tomashenko2019recent] |
Jan Chorowski, Nanxin Chen, Ricard Marxer, Hans J G A Dolfing, Adrian Ła\'ncucki, Guillaume Sanchez, Tanel Alum\"ae, Antoine Laurent

NeurIPS 2019 workshop - Perception as generative reasoning - Structure, Causality, Probability

.bib [chorowski19] |
Pierre Gagnepain, Thomas Vallée, Serge Heiden, Matthieu Decorde, Jean-Luc Gauvain, Antoine Laurent, Carine Klein-Peschanski, Fausto Viader, Denis Peschanski, Francis Eustache

Nature Human Behaviour

.bib [gagnepain19] | .pdf

2018

S. Mdhaffar, A. Laurent, Y. Estève

25e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2018)

.bib [Mdhaffar18b] |
S. Mdhaffar, A. Laurent, Y. Estève

XXXIIe Journees d'Etudes sur la Parole (JEP 2018)

.bib [Mdhaffar2018] |
S. Gannay, A. Caubriere, Y. Estève, N. Camelin, E. Simmonet, A. Laurent

IEEE Spoken Language Technology Workshop

.bib [Ghannay2018] |

2017

Guangpu Huang, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain, Antoine Laurent, Rasa Lileikyte, Abdel Massouadi

ICASSP 2017, The 42st IEEE International Conference on Acoustics, Speech and Signal Processing

This paper reports on investigations of using two tecnniques for language model text data augmentation for low-resourced automatic speech recognition and keyword search. Low-resourced languages are characterized by limited training materials, which typically results in high out-of-vocabulary (OOV) rates and poor language model estimates. One technique makes use of recurrent neural networks (RNNs) using word or subword units. Word-based RNNs keep the same system vocabulary, so they cannot reduce the OOV, whereas subword units can reduce the OOV but generate many false combinations. A complementary technique is based on automatic machine translation, which requires parallel texts and is able to add words to the vocabulary. These methods were accessed on 10 languages in the context of the Babel program and NIST OpenKWS evaluation. Although improvements vary across languages with both methods, small gains were generally observed in terms of word error rate reduction and improved keyword search performance.

.bib [Huang17] | .pdf
Rasa Lileikyte, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain, Antoine Laurent, Guangpu Huang

ICASSP 2017, The 42st IEEE International Conference on Acoustics, Speech and Signal Processing

In this paper we aim to enhance keyword search for conversational telephone speech under low-resourced conditions. Two techniques to improve the detection of out-of-vocabulary keywords are assessed in this study: using extra text resources to augment the lexicon and language model, and via subword units for keyword search. Two approaches for data augmentation are explored to extend the limited amount of transcribed conversational speech: using conversational-like Web data and texts generated by recurrent neural networks. Contrastive comparisons of subword-based systems are performed to evaluate the benefits of multiple subword decodings and single decoding. Keyword search results are reported for all the techniques, but only some improve performance. Results are reported for the Mongolian and Igbo languages using data from the 2016 Babel program.

.bib [Lileikyte17] | .pdf

2016

G. Gelly, J.L. Gauvain, L. Lamel, A. Laurent, V.B. Le, A. Messaoudi

Odyssey 2016

This paper describes our development work to design a language recognition system that can discriminate closely related languages and dialects of the same language. The work was a joint effort by LIMSI and Vocapia Research in preparation for the NIST 2015 Language Recognition Evaluation (LRE). The language recognition system results from a fusion of four core classifiers: a phonotactic component using DNN acoustic models, two purely acoustic components using a RNN model and and i-vector model, and a lexical component. Each component generates language posterior probabilities optimized to maximize the LID NCE, making their combination simple and robust. The motivation for using multiple components representing different speech knowledge is that some dialect distinctions may not be manifest at the acoustic level. We report experiments on the NIST LRE15 data and provide an analysis of the results and some post-evaluation contrasts. The 2015 LRE task focused on the identification of 20 languages clustered in 6 groups (Arabic, Chinese, English, French, Slavic and Iberic) of similar languages. Results are reported using the NIST Cavg metric which served as the primary metric for the OpenLRE15 evaluation. Results are also reported for the EER and the LER.

.bib [Gelly16] | .pdf
Antoine Laurent, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain

ICASSP 2016, The 41st IEEE International Conference on Acoustics, Speech and Signal Processing

In this paper we investigate various techniques in order to build effective speech to text (STT) and keyword search (KWS) systems for low resource conversational speech. Subword decoding and graphemic mappings were assessed in order to detect out-of-vocabulary keywords. To deal with the limited amount of transcribed data, semi-supervised training and data selection methods were investigated. Robust acoustic features produced via data augmentation were evaluated for acoustic modeling. For language modeling, automatically retrieved conversational-like Webdata was used, as well as neural network based models. We report STT improvements with all the techniques, but interestingly only some improve KWS performance. Results are reported for the Swahili language in the context of the 2015 OpenKWS Evaluation.

.bib [Laurent16] | .pdf
A. Gorin, R. Lileikyte, G. Huang, L. Lamel, J.L. Gauvain, A. Laurent

Interspeech 2016, Annual Conference of the International Speech Communication Association

This research extends our earlier work on using machine translation (MT) and word-based recurrent neural networks to augment language model training data for keyword search in conversational Cantonese speech. MT-based data augmentation is applied to two language pairs: English-Lithuanian and English-Amharic. Using filtered N-best MT hypotheses for language modeling is found to perform better than just using the 1-best translation. Target language texts collected from the Web and filtered to select conversational-like data are used in several manners. In addition to using Web data for training the language model of the speech recognizer, we further investigate using this data to improve the language model and phrase table of the MT system to get better translations of the English data. Finally, generating text data with a character-based recurrent neural network is investigated. This approach allows new word forms to be produced, providing a way to reduce the out-of-vocabulary rate and thereby improve keyword spotting performance. We study how these different methods of language model data augmentation impact speech-to-text and keyword spotting performance for the Lithuanian and Amharic languages. The best results are obtained by combining all of the explored methods.

.bib [Gorin16] | .pdf

2015

Thiago Fraga-Silva, Jean-Luc Gauvain, Lori Lamel, Antoine Laurent, Viet-Bac Le, Abdel Messaoudi

Interspeech 2015, Annual Conference of the International Speech Communication Association

This paper presents first results in using active learning (AL) for training data selection in the context of the IARPA-Babel program. Given an initial training data set, we aim to automatically select additional data (from an untranscribed pool data set) for manual transcription. Initial and selected data are then used to build acoustic and language models for speech recognition. The goal of the AL task is to outperform a baseline system built using a pre-defined data selection with the same amount of data, the Very Limited Language Pack (VLLP) condition. AL methods based on different selection criteria have been explored. Compared to the VLLP baseline, improvements are obtained in terms of Word Error Rate and Actual Term Weighted Values for the Lithuanian language. A description of methods and an analysis of the results are given. The AL selection also outperforms the VLLP baseline for other IARPA- Babel languages, and will be further tested in the upcoming NIST OpenKWS 2015 evaluation.

.bib [Fraga15] | .pdf
Thiago Fraga-Silva, Antoine Laurent, Jean-Luc Gauvain, Lori Lamel, Viet-Bac Le, Abdel Messaoudi

ASRU 2015, 2015 IEEE Automatic Speech Recognition and Understanding Workshop

This paper extends recent research on training data selection for speech transcription and keyword spotting system development. The techinques were explored in the context of the IARPA-Babel Active Learning (AL) task for 6 languages. Different selection criteria were explored with the goal of improving over a system built using a predefined 3 hour training data set. Four variants of the entropy-based criterion were explored: words, triphones, phones as well as the use of HMM-states previously introduced in (see IS 2015 bellow). The influence of the number of HMM-states was assessed as well as whether automatic or manual reference transcripts were used. The combination of selection criteria was investigated, and a novel multi-stage selection method proposed. These methods were also assessed using larger data sets than were permitted in the Babel AL task. Results are reported for the 6 languages. The multi-stage selection was also applied to the surprise language (Swahili) in the NIST OpenKWS 2015 evaluation.

.bib [Fraga15b] | .pdf

2014

Laurent, A., Lamel, L.

SLTU 2014, Spoken Language Technologies for Under-resourced languages

This paper investigates the development of a speech-to-text transcription system for the Korean language in the context of the DGA RAPID Rapmat project. Korean is an alpha-syllabary language spoken by about 78 million people worldwide. As only a small amount of manually transcribed audio data were available, the acoustic models were trained on audio data downloaded from several Korean websites in an unsupervised manner, and the language models were trained on web texts. The reported word and character error rates are estimates, as development corpus used in these experiments was also constructed from the untranscribed audio data, the web texts and automatic transcriptions. Several variants for unsupervised acoustic model training were compared to assess the influence of the vocabulary size (200k vs 2M), the type of language model (words vs characters), the acoustic unit (phonemes vs half-syllables), as well as incremental batch vs iterative decoding of the untranscribed audio corpus.

.bib [Laurent14] | .pdf
Bredin, H., Laurent, A., Sarkar, A., Le, V.-B., Barras, Claude, Rosset, Sophie

Odyssey 2014, The Speaker and Language Recognition Workshop

We address the problem of named speaker identification in TV broadcast which consists in answering the question ``who speaks when?'' with the real identity of speakers, using person names automatically obtained from speech transcripts. While existing approaches rely on a first speaker diarization step followed by a local name propagation step to speaker clusters, we propose a unified framework called person instance graph where both steps are jointly modeled as a global optimization problem, then solved using integer linear programming. Moreover, when available, acoustic speaker models can be added seamlessly to the graph structure for joint named and acoustic speaker identification -- leading to a 10% error decrease (from 45% down to 35%) over a state-of-the-art i-vector speaker identification system on the REPERE TV broadcast corpus.

.bib [Laurent14b] | .pdf
Laurent, A., Camelin, N., Raymond, C.

Interspeech 2014, 15th Annual Conference of the International Speech Communication Association

In this article, we tackle the problem of speaker role detection from broadcast news shows. In the literature, many proposed solutions are based on the combination of various features coming from acoustic, lexical and semantic information with a machine learning algorithm. Many previous studies mention the use of boosting over decision stumps to combine efficiently these features. In this work, we propose a modification of this state-of-the-art machine learning algorithm changing the weak learner (decision stumps) by small decision trees, denoted bonsai trees. Experiments show that using bonsai trees as weak learners for the boosting algorithm largely improves both system error rate and learning time.

.bib [Laurent14e] | .pdf
Laurent, A., Hartmann, W., Lamel, L.

ISCSLP@Interspeech 2014, 15th Annual Conference of the International Speech Communication Association

This paper investigates unsupervised training strategies for the Korean language in the context of the DGA RAPID Rapmat project. As with previous studies, we begin with only a small amount of manually transcribed data to build preliminary acoustic models. Using the initial models, a larger set of untranscribed audio data is decoded to produce approximate transcripts. We compare both GMM and DNN acoustic models for both the unsupervised transcription and the final recognition system. While the DNN acoustic models produce a lower word error rate on the test set, training on the transcripts from the GMM system provides the best overall performance. We also achieve better performance by expanding the original phone set. Finally, we examine the efficacy of automatically building a test set by comparing system performance both before and after manually correcting the test set.

.bib [Laurent14d] | .pdf
Laurent, A., Meignier, S., Deléglise, P.

In Computer Speech And Language

Accurate phonetic transcription of proper nouns can be an important resource for commercial applications that embed speech technologies, such as audio indexing and vocal phone directory lookup. However, an accurate phonetic transcription is more difficult to obtain for proper nouns than for regular words. Indeed, phonetic transcription of a proper noun depends on both the origin of the speaker pronouncing it and the origin of the proper noun itself.This work proposes a method that allows the extraction of phonetic transcriptions of proper nouns using actual utterances of those proper nouns, thus yielding transcriptions based on practical use instead of mere pronunciation rules.The proposed method consists in a process that first extracts phonetic transcriptions, and then iteratively filters them. In order to initialize the process, an alignment dictionary is used to detect word boundaries. A rule-based grapheme-to-phoneme generator (LIA_PHON), a knowledge-based approach (JSM), and a Statistical Machine Translation based system were evaluated for this alignment. As a result, compared to our reference dictionary (BDLEX supplemented by LIA_PHON for missing words) on the ESTER 1 French broadcast news corpus, we were able to significantly decrease the Word Error Rate (WER) on segments of speech with proper nouns, without negatively affecting the WER on the rest of the corpus.

.bib [Laurent14c] | .pdf
Bouaziz, M., Laurent, A., Estève, Y.

JEP 2014, Journées d'Etudes sur la Parole

Certains Systèmes de Reconnaissance Automatique de la Parole (SRAP) atteignent des taux d'erreur de l'ordre de 10%. Toutefois, notamment dans le cadre de l'indexation automatique des documents multimédia sur le web, les SRAP se trouvent face à la problématique des mots hors-vocabulaire. En effet, les entités nommées en constituent une grande partie et sont remarquablement importantes pour les tâches d'indexation. Nous mettons en œuvre, dans ce travail, la solution du décodage hybride en utilisant les syllabes comme unités sous-lexicales. Cette méthode est intégrée au sein du SRAP LIUM'08 développé par le Laboratoire d'Informatique de l'Université du Maine. Avec une légère dégradation de la performance générale du système, environ 31% des noms de personne hors vocabulaire sont correctement reconnus.

.bib [Laurent14j5] | .pdf
Bonneau-Maynard, H., Segal, N., Bilinski, E., Gauvain, J.-L., Gong, L., Lamel, L., Laurent, A., Yvon, F., Despres, J., Josse, Y., Le, V.-B.

JEP 2014, Journées d'Etudes sur la Parole

Le projet RAPMAT vise à développer des systèmes de traduction de la parole en s’intéressant aux deux traitements constitutifs de la chaîne complète : la reconnaissance de la parole (RAP) et la traduction (TA). Dans la situation classique, les modèles statistiques utilisés par les deux systèmes sont estimés indépendemment, à partir de données de différentes natures (transcriptions manuelles de données de parole pour la RAP et corpus bilingues issus de données textuelles pour la TA). Nous proposons une approche semi-supervisée pour l'adaptation des modèles de traduction à la traduction de parole, dans laquelle les modèles de TA sont entraînés en intégrant des transcriptions manuelles et automatiques de la parole traduites automatiquement. L'approche est expérimentée sur la direction de traduction français vers anglais. Un prototype de démonstration sur smartphones, incluant notamment la traduction de parole pour les paires de langues français/anglais et français/chinois a été développé pour permettre la collecte de données.

.bib [Laurent14j4] | .pdf
Laurent, A., Lamel, L.

JEP 2014, Journées d'Etudes sur la Parole

Ce papier décrit le développement d'un système de reconnaissance automatique de la parole pour le coréen. Le coréen est une langue alpha-syllabique, parlée par environ 78 millions de personnes dans le monde. Le développement de ce système a été mené en utilisant très peu de données annotées manuellement. Les modèles acoustiques ont été adaptés de manière non supervisée en utilisant des données provenant de différents sites d'actualités coréens. Le corpus de développement contient des transcriptions approximatives des documents audio : il s'agit d'un corpus transcrit automatiquement et aligné avec des données provenant des mêmes sites Internet. Nous comparons différentes approches dans ce travail, à savoir, des modèles de langue utilisant des unités différentes pour l'apprentissage non supervisé et pour le décodage (des caractères et des mots avec des vocabulaires de différentes tailles), l'utilisation de phonèmes et d'unités ``demi-syllabiques'' et deux approches différentes d'apprentissage non supervisé.

.bib [Laurent14j3] | .pdf
Laurent, A., Guinaudeau, C., Roy, A.

JEP 2014, Journées d'Etudes sur la Parole

Cet article décrit les méthodes mises en place pour permettre l'analyse d'un corpus composé de documents audiovisuels diffusés au cours des 80 dernières années : le corpus MATRICE. Nous proposons une exploration des données permettant de mettre en évidence les différents thèmes et évènements abordés dans le corpus. Cette exploration est, dans un premier temps, effectuée sur des notices documentaires produites manuellement par les documentalistes de l'Institut National de l'Audiovisuel. Puis, nous montrons, grâce à une étude qualitative et une technique de clustering automatique, que les transcriptions automatiques permettent également d'effectuer une analyse du corpus faisant émerger des thèmes cohérents avec les données traitées.

.bib [Laurent14j2] | .pdf
Laurent, A., Camelin, N., Raymond, C.

JEP 2014, Journées d'Etudes sur la Parole

Dans ce travail, nous nous intéressons au problème de la détection du rôle du locuteur dans les émissions d'actualités radiotélévisées. Dans la littérature, les solutions proposées sont de combiner des indicateurs variés provenant de l'acoustique, de la transcription et/ou de son analyse par des méthodes d'apprentissage automatique. De nombreuses études font ressortir l'algorithme de boosting sur des règles de décision simples comme l'un des plus efficaces à combiner ces différents descripteurs. Nous proposons ici une modification de cet algorithme état-de-l'art en remplaçant ces règles de décision simples par des mini arbres de décision que nous appelons bonzaïs. Les expériences comparatives menées sur le corpus EPAC montrent que cette modification améliore largement les performances du système tout en réduisant le temps d'apprentissage de manière conséquente.

.bib [Laurent14j1] | .pdf
Guinaudeau, C., Laurent, A., Bredin, H.

MediaEval 2014 Social Event Detection Task. Working Notes Proceedings

This paper provides an overview of the Social Event Detection (SED) system developed at LIMSI for the 2014 campaign. Our approach is based on a hierarchical agglomerative clustering that uses textual metadata, user-based knowledge and geographical information. These different sources of knowledge, either used separately or in cascade, reach good results for the full clustering subtask with a normalized mutual information equal to 0.95 and F1 scores greater than 0.82 for our best run.

.bib [guinaudeau14] | .pdf

2012

El-Khoury, E., Laurent, A., Meignier, S., Petitrenaud, S.

ICASSP 2012, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent12] | .pdf
Dufour, R., Laurent, A., Estève, Y.

JEP 2012, Journées d'Etudes sur la Parole

.bib [Laurent12j1] | .pdf

2011

Laurent, A., Meignier, S., Merlin, T., Deléglise, P.

ICASSP 2011, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent11] | .pdf

2010

Estève, Y., Deléglise, P., Meignier, S., Petitrenaud, S., Schwenk, H., Barrault, L., Bougares, F., Dufour, R., Jousse, V., Laurent, A., Rousseau, A.

Workshop CMU SPU

.bib [Estève10] |
Laurent, A., Meignier, S., Deléglise, P.

JEP 2010, Journées d'Etudes sur la Parole

.bib [Laurent10j1] | .pdf
Laurent, A., Meignier, S., Merlin, T., Deléglise, P.

Interspeech 2010, 11th Annual Conference of the International Speech Communication Association

.bib [Laurent10-b] | .pdf

2009

Laurent, A., Deléglise, P., Meignier, S.

Interspeech 2009, 10th Annual Conference of the International Speech Communication Association

.bib [Laurent09b] | .pdf
Laurent, A., Merlin, T., Meignier, S., Estève, Y., Deléglise, P.

ICASSP 2009, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent09] | .pdf

2008

Laurent, A., and Meignier, S., Estève, Y., Deléglise, P.

JEP 2008, Journées d'Etudes sur la Parole

.bib [Laurent08b] | .pdf
Laurent, A., Merlin, T., Meignier, S., Estève, Y., Deléglise, P.

LREC 2008, Language Resources and Evaluation Conference

.bib [Laurent08] | .pdf

2023

Victoria Mingote, Pablo Gimeno, Luis Vicente, Sameer Khurana, Antoine Laurent, Jarod Duret

IEEE Signal Processing Letters

This letter proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to extract the acoustic units using a speech encoder combined with a clustering algorithm. Once units are obtained, an encoder-decoder architecture is trained to predict them. Then a vocoder generates speech from units. Our approach for direct text to speech translation was tested on the new CVSS corpus with two different text mBART models employed as initialisation. The systems presented report competitive performance for most of the language pairs evaluated. Besides, results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.

.bib [Mingote23] | .pdf

2022

Sameer Khurana, Antoine Laurent,, James Glass

IEEE Journal of Selected Topics in Signal Processing

We propose the (SAMU−XLSR): Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5-10s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLSR with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU−XLSR. Although we train SAMU−XLSR with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use SAMU−XLSR speech encoder in combination with a pre-trained LaBSE text sentence encoder for cross-lingual speech-to-text translation retrieval, and SAMU−XLSR alone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets.

.bib [samuxlsr22] |

2021

Jean Carrive, Abdelkrim Beloued, Pascale Goetschel, Serge Heiden, Antoine Laurent, Pasquale Lisena, Franck Mazuet, Sylvain Meignier, Bénédicte Pincemin, Géraldine Poels, Raphaël Troncy

Digital Humanities Quarterly

The ANTRACT project is a cross-disciplinary apparatus dedicated to the analysis of the French newsreel company Les Actualités Françaises (1945-1969) and its film productions. Founded during the liberation of France, this state-owned company filmed more than 20,000 news reports shown in French cinemas and throughout the world over its 24 years of activity. The project brings together research organizations with a dual historical and technological perspective. ANTRACT’s goal is to study the production process, the film content, the way historical events are represented and the audience reception of Les Actualités Françaises newsreels using innovative AI-based data processing tools developed by partners specialized in image, audio, and text analysis. This article focuses on the data processing apparatus and tools of the project. Automatic content analysis is used to select data, to segment video units and typescript images, and to align them with their archival description. Automatic speech recognition provides a textual representation and natural language processing can extract named entities from the voice-over recording; automatic visual analysis is applied to detect and recognize faces of well-known characters in videos. These multifaceted data can then be queried and explored with the TXM text-mining platform. The results of these automatic analysis processes are feeding the Okapi platform, a client-server software that integrates documentation, information retrieval, and hypermedia capabilities within a single environment based on the Semantic Web standards. The complete corpus of Les Actualités Françaises, enriched with data and metadata, will be made available to the scientific community by the end of the project.

.bib [carrive21] | .pdf

2020

Silvio Montrésor, Marie Tahon, Antoine Laurent, Pascal Picart

APL Photonics AIP Publishing LLC

.bib [montresor20b] |

2019

Pierre Gagnepain, Thomas Vallée, Serge Heiden, Matthieu Decorde, Jean-Luc Gauvain, Antoine Laurent, Carine Klein-Peschanski, Fausto Viader, Denis Peschanski, Francis Eustache

Nature Human Behaviour

.bib [gagnepain19] | .pdf

2014

Laurent, A., Meignier, S., Deléglise, P.

In Computer Speech And Language

Accurate phonetic transcription of proper nouns can be an important resource for commercial applications that embed speech technologies, such as audio indexing and vocal phone directory lookup. However, an accurate phonetic transcription is more difficult to obtain for proper nouns than for regular words. Indeed, phonetic transcription of a proper noun depends on both the origin of the speaker pronouncing it and the origin of the proper noun itself.This work proposes a method that allows the extraction of phonetic transcriptions of proper nouns using actual utterances of those proper nouns, thus yielding transcriptions based on practical use instead of mere pronunciation rules.The proposed method consists in a process that first extracts phonetic transcriptions, and then iteratively filters them. In order to initialize the process, an alignment dictionary is used to detect word boundaries. A rule-based grapheme-to-phoneme generator (LIA_PHON), a knowledge-based approach (JSM), and a Statistical Machine Translation based system were evaluated for this alignment. As a result, compared to our reference dictionary (BDLEX supplemented by LIA_PHON for missing words) on the ESTER 1 French broadcast news corpus, we were able to significantly decrease the Word Error Rate (WER) on segments of speech with proper nouns, without negatively affecting the WER on the rest of the corpus.

.bib [Laurent14c] | .pdf

2023

Nauman Dawalatabad, Sameer Khurana, Antoine Laurent, James Glass

ICASSP 2023

Pseudo-label (PL) filtering forms a crucial part of Self-Training (ST) methods for unsupervised domain adaptation. Dropout-based Uncertainty-driven Self-Training (DUST) proceeds by first training a teacher model on source domain labeled data. Then, the teacher model is used to provide PLs for the unlabeled target domain data. Finally, we train a student on augmented labeled and pseudo-labeled data. The process is iterative, where the student becomes the teacher for the next DUST iteration. A crucial step that precedes the student model training in each DUST iteration is filtering out noisy PLs that could lead the student model astray. In DUST, we proposed a simple, effective, and theoretically sound PL filtering strategy based on the teacher model's uncertainty about its predictions on unlabeled speech utterances. We estimate the model's uncertainty by computing disagreement amongst multiple samples drawn from the teacher model during inference by injecting noise via dropout. In this work, we show that DUST's PL filtering, as initially used, may fail under severe source and target domain mismatch. We suggest several approaches to eliminate or alleviate this issue. Further, we bring insights from the research in neural network model calibration to DUST and show that a well-calibrated model correlates strongly with a positive outcome of the DUST PL filtering step.

.bib [Dawalatabad23] | .pdf
Antoine Laurent, Souhir Gahbiche, Ha Nguyen, Haroun Elleuch, Fethi Bougares, Antoine Thiol, Hugo Riguidel, Salima Mdhaffar, Gaëlle Laperrière, Lucas Maison, Sameer Khurana, Yannick Estève

IWSLT@ACL 2023

This paper describes the ON-TRAC consortium speech translation systems developed for IWSLT 2023 evaluation campaign. Overall, we participated in three speech translation tracks featured in the low-resource and dialect speech translation shared tasks, namely; i) spoken Tamasheq to written French, ii) spoken Pashto to written French, and iii) spoken Tunisian to written English. All our primary submissions are based on the end-to-end speech-to-text neural architecture using a pretrained SAMU-XLSR model as a speech encoder and a mbart model as a decoder. The SAMU-XLSR model is built from the XLS-R~128 in order to generate language agnostic sentence-level embeddings. This building is driven by the LaBSE model trained on multilingual text dataset. This architecture allows us to improve the input speech representations and achieve significant improvements compared to conventional end-to-end speech translation systems.

.bib [Laurent23] | .pdf

2022

Sameer Khurana, Antoine Laurent, James Glass

ICASSP 2022

We propose a simple and effective cross-lingual transfer learning method to adapt monolingual wav2vec-2.0 models for Automatic Speech Recognition (ASR) in resource-scarce languages. We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages. We improve its performance further via several iterations of Dropout Uncertainty-Driven Self-Training (DUST) by using a moderate-sized unlabeled speech dataset in the target language. A key finding of this work is that the adapted monolingual wav2vec-2.0 achieves similar performance as the topline multilingual XLSR model, which is trained on fifty-three languages, on the target language ASR task.

.bib [khurana22] | .pdf
Martin Lebourdais, Marie Tahon, Antoine Laurent, Anthony Larcher, Sylvain Meignier

JEP 2022

.bib [lebourdais22b] |
Valentin Pelloin, Nathalie Camelin, Antoine Laurent, Renato De Mori, Sylvain Meignier

JEP 2022

.bib [pelloin22] |
Marcely Zanon Boito, John Ortega, Hugo Riguidel, Antoine Laurent, Loïc Barrault, Fethi Bougares, Firas Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gahbiche, Yannick Estève

IWSLT 2022

.bib [mzboito22] |
Nicolas Hervé, Valentin Pelloin, Benoit Favre, Franck Dary, Antoine Laurent, Sylvain Meignier, Laurent Besacier

ACL 2022

This papers aims at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. The new models (FlauBERT-Oral) are shared with the community and are evaluated not only in terms of word prediction accuracy but also for two downstream tasks: classification of TV shows and syntactic parsing of speech. Experimental results show that FlauBERT-Oral is better than its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-Generated text can be useful to improve spoken language modeling.

.bib [herve22] |
Rémi Uro, David Doukhan, Albert Rilliard, Laetitia Larcher, Anissa-Claire Adgharouamane, Marie Tahon, Antoine Laurent

LREC 2022

This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker’s age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For each speaker, speech excerpts were extracted from audiovisual documents using an automatic pipeline consisting of speech detection, background music and overlapped speech removal and speaker diarization, used to present clean speaker segments to human annotators identifying target speakers. This pipeline proved highly effective, cutting down manual processing by a factor of ten. Evaluation of the quality of the automatic processing and of the final output is provided. It shows the automatic processing compare to up-to-date process, and that the output provides high quality speech for most of the selected excerpts. This method shows promise for creating large corpora of known target speakers.

.bib [Uro22] |
Martin Lebourdais, Marie Tahon, Antoine Laurent, Sylvain Meignier, Anthony Larcher

LREC 2022

Our main goal is to study the interactions between speakers according to their gender and role in broadcast media. In this paper, we propose an extensive study of gender and overlap annotations in various speech corpora mainly dedicated to diarisation or transcription tasks. We point out the issue of the heterogeneity of the annotation guidelines for both overlapping speech and gender categories. On top of that, we analyse how the speech content (casual speech, meetings, debate, interviews, etc.) impacts the distribution of overlapping speech segments. On a small dataset of 93 recordings from LCP French channel, we intend to characterise the interactions between speakers according to their gender. Finally, we propose a method which aims to highlight active speech areas in terms of interactions between speakers. Such a visualisation tool could improve the efficiency of qualitative studies conducted by researchers in human sciences.

.bib [lebourdais22] |
Martin Lebourdais, Marie Tahon, Antoine Laurent, Sylvain Meignier.

Interspeech 2022

This article focuses on overlapped speech and gender detection in order to study interactions between women and men in French audiovisual media (Gender Equality Monitoring project). In this application context, we need to automatically segment the speech signal according to speakers gender, and to identify when at least two speakers speak at the same time. We propose to use WavLM model which has the advantage of being pre-trained on a huge amount of speech data, to build an overlapped speech detection (OSD) and a gender detection (GD) systems. In this study, we use two different corpora. The DIHARD III corpus which is well adapted for the OSD task but lack gender information. The ALLIES corpus fits with the project application context. Our best OSD system is a Temporal Convolutional Network (TCN) with WavLM pre-trained features as input, which reaches a new state-of-the-art F1-score performance on DIHARD. A neural GD is trained with WavLM inputs on a gender balanced subset of the French broadcast news ALLIES data, and obtains an accuracy of 94.9%. This work opens new perspectives for human science researchers regarding the differences of representation between women and men in French media.

.bib [lebourdaisIS22] |
Valentin Pelloin, Franck Dary, Nicolas Hervé, Benoit Favre, Nathalie Camelin, Antoine Laurent, Laurent Besacier

Interspeech 2022

We aim at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing. Results show that FlauBERT-Oral can be beneficial compared to its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-generated text can be used to build spoken language models.

.bib [pelloinIS22] |

2021

Valentin Pelloin, Nathalie Camelin, Antoine Laurent, Renato de Mori, Antoine Caubrière, Yannick Estève, Sylvain Meignier

ICASSP 2021

In this paper, we propose a novel end-to-end sequence-to-sequence spoken language understanding model using an attention mechanism. It reliably selects contextual acoustic features in order to hypothesize semantic contents. An initial architecture capable of extracting all pronounced words and concepts from acoustic spans is designed and tested. With a shallow fusion language model, this system reaches a 13.6 concept error rate (CER) and an 18.5 concept value error rate (CVER) on the French MEDIA corpus, achieving an absolute 2.8 points reduction compared to the state-of-the-art. Then, an original model is proposed for hypothesizing concepts and their values. This transduction reaches a 15.4 CER and a 21.6 CVER without any new type of context.

.bib [pelloin21] |
Hervé Bredin, Antoine Laurent

Interspeech 2021

Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped speech detection), we propose to train an end-to-end segmentation model that does it directly. Inspired by the original end-to-end neural speaker diarization approach (EEND), the task is modeled as a multi-label classification problem using permutation-invariant training. The main difference is that our model operates on short audio chunks (5 seconds) but at a much higher temporal resolution (every 16ms). Experiments on multiple speaker diarization datasets conclude that our model can be used with great success on both voice activity detection and overlapped speech detection. Our proposed model can also be used as a post-processing step, to detect and correctly assign overlapped speech regions. Relative diarization error rate improvement over the best considered baseline (VBx) reaches 18% on AMI, 17% on DIHARD 3, and 16% on VoxConverse.

.bib [bredin21] | .pdf

2020

Silvio Montrésor, Marie Tahon, Antoine Laurent, Picart Pascal

SPIE Photonics Europe International Symposium

.bib [montresor20] |
Salima mdhaffar, Yannick Estève, Antoine Laurent, Nicolas Hernandez, Richard Dufour, Delphine Charlet, Geraldine Damnati, Solen Quiniou, Nathalie Camelin

LREC 2020

.bib [mdhaffar2020] |
Antoine Caubrière, Sahar Ghannay, Natalia Tomashenko, Renato De Mori, Antoine Laurent, Emmanuel Morin, Yannick Estève

ICASSP 2020

.bib [caubriere2020b] | .pdf
Antoine Caubrière, Sophie Rosset, Yannick Estève, Antoine Laurent, Emmanuel Morin

LREC 2020

.bib [caubriere2020] |
Antoine Caubrière, Sophie Rosset, Yannick Estève, Antoine Laurent, Emmanuel Morin

JEP 2020

.bib [caubriere2020jep] |
Antoine Caubrière, Yannick Estève, Antoine Laurent, Emmanuel Morin

Interspeech 2020

.bib [caubriere2020int] |
Hans Dolfing, Jérome Bellegarda, Jan Chorowski, Ricard Marxer, Antoine Laurent

ICFHR 2020

.bib [dolfing20] |
Adrian Łancucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans J G A Dolfing, Sameer Khurana, Tanel Alumäe, Antoine Laurent

IJCNN 2020

In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representation learning, they became viable alternatives to continuous latent variable models such as the Variational Auto-Encoder (VAE). However, training deep discrete variable models is challenging, due to the inherent non-differentiability of the discretization operation. In this paper we focus on VQ-VAE, a state-of-the-art discrete bottleneck model shown to perform on par with its continuous counterparts. It quantizes encoder outputs with on-line k-means clustering. We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs. We demonstrate that these can be successfully overcome by increasing the learning rate for the codebook and periodic date-dependent codeword re-initialization. As a result, we achieve more robust training across different tasks, and significantly increase the usage of latent codewords even for large codebooks. This has practical benefit, for instance, in unsupervised representation learning, where large codebooks may lead to disentanglement of latent representations.

.bib [lancucki20] | .pdf
Sameer Khurana, Antoine Laurent, James Glass

Arxiv

More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but usually have speakers that are bi-lingual and trained in a high resource language. It is relatively easy to obtain textual translations corresponding to speech. In this work, we provide a multimodal machine learning framework for speech representation learning by exploiting the correlations between the two modalities namely speech and its corresponding text translation. Here, we construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech. The audio encoder is trained to perform a speech-translation retrieval task in a contrastive learning framework. By evaluating the learned representations on a phone recognition task, we demonstrate that linguistic representations emerge in the audio encoder's internal representations as a by-product of learning to perform the retrieval task.

.bib [khurana20b] | .pdf
Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass

Interspeech 2020

Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods on linear phone classification and recognition on the Wall Street Journal dataset. Furthermore, we found that ConvDMM complements self-supervised methods like Wav2Vec and PASE, improving on the results achieved with any of the methods alone. Lastly, we find that ConvDMM features enable learning better phone recognizers than any other features in an extreme low-resource regime with few labeled training examples.

.bib [khurana20] | .pdf

2019

Salima Mdhaffar, Yannick Estève, Nicolas Hernandez, Antoine Laurent, Solen Quiniou

26e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2019)

.bib [Mdhaffar19b] |
Caubrière, Antoine, Tomashenko, Natalia, Estève, Yannick, Laurent, Antoine, Morin, Emmanuel

26e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2019)

.bib [caubriere2019curriculum] |
Antoine Caubrière, Natalia Tomashenko, Antoine LAURENT, Emmanuel Morin, Nathalie Camelin, Yannick Estève

Interspeech 2019, Annual Conference of the International Speech Communication Association

.bib [Caubriere2019] |
Salima mdhaffar, Yannick Estève, Nicolas Hernandez, Antoine LAURENT, Solen Quiniou

Interspeech 2019, Annual Conference of the International Speech Communication Association

.bib [Mdhaffar2019] | .pdf
Tomashenko, Natalia, Caubrière, Antoine, Estève, Yannick, Laurent, Antoine, Morin, Emmanuel

International Conference on Statistical Language and Speech Processing

.bib [tomashenko2019recent] |
Jan Chorowski, Nanxin Chen, Ricard Marxer, Hans J G A Dolfing, Adrian Ła\'ncucki, Guillaume Sanchez, Tanel Alum\"ae, Antoine Laurent

NeurIPS 2019 workshop - Perception as generative reasoning - Structure, Causality, Probability

.bib [chorowski19] |

2018

S. Mdhaffar, A. Laurent, Y. Estève

25e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2018)

.bib [Mdhaffar18b] |
S. Mdhaffar, A. Laurent, Y. Estève

XXXIIe Journees d'Etudes sur la Parole (JEP 2018)

.bib [Mdhaffar2018] |
S. Gannay, A. Caubriere, Y. Estève, N. Camelin, E. Simmonet, A. Laurent

IEEE Spoken Language Technology Workshop

.bib [Ghannay2018] |

2017

Guangpu Huang, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain, Antoine Laurent, Rasa Lileikyte, Abdel Massouadi

ICASSP 2017, The 42st IEEE International Conference on Acoustics, Speech and Signal Processing

This paper reports on investigations of using two tecnniques for language model text data augmentation for low-resourced automatic speech recognition and keyword search. Low-resourced languages are characterized by limited training materials, which typically results in high out-of-vocabulary (OOV) rates and poor language model estimates. One technique makes use of recurrent neural networks (RNNs) using word or subword units. Word-based RNNs keep the same system vocabulary, so they cannot reduce the OOV, whereas subword units can reduce the OOV but generate many false combinations. A complementary technique is based on automatic machine translation, which requires parallel texts and is able to add words to the vocabulary. These methods were accessed on 10 languages in the context of the Babel program and NIST OpenKWS evaluation. Although improvements vary across languages with both methods, small gains were generally observed in terms of word error rate reduction and improved keyword search performance.

.bib [Huang17] | .pdf
Rasa Lileikyte, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain, Antoine Laurent, Guangpu Huang

ICASSP 2017, The 42st IEEE International Conference on Acoustics, Speech and Signal Processing

In this paper we aim to enhance keyword search for conversational telephone speech under low-resourced conditions. Two techniques to improve the detection of out-of-vocabulary keywords are assessed in this study: using extra text resources to augment the lexicon and language model, and via subword units for keyword search. Two approaches for data augmentation are explored to extend the limited amount of transcribed conversational speech: using conversational-like Web data and texts generated by recurrent neural networks. Contrastive comparisons of subword-based systems are performed to evaluate the benefits of multiple subword decodings and single decoding. Keyword search results are reported for all the techniques, but only some improve performance. Results are reported for the Mongolian and Igbo languages using data from the 2016 Babel program.

.bib [Lileikyte17] | .pdf

2016

G. Gelly, J.L. Gauvain, L. Lamel, A. Laurent, V.B. Le, A. Messaoudi

Odyssey 2016

This paper describes our development work to design a language recognition system that can discriminate closely related languages and dialects of the same language. The work was a joint effort by LIMSI and Vocapia Research in preparation for the NIST 2015 Language Recognition Evaluation (LRE). The language recognition system results from a fusion of four core classifiers: a phonotactic component using DNN acoustic models, two purely acoustic components using a RNN model and and i-vector model, and a lexical component. Each component generates language posterior probabilities optimized to maximize the LID NCE, making their combination simple and robust. The motivation for using multiple components representing different speech knowledge is that some dialect distinctions may not be manifest at the acoustic level. We report experiments on the NIST LRE15 data and provide an analysis of the results and some post-evaluation contrasts. The 2015 LRE task focused on the identification of 20 languages clustered in 6 groups (Arabic, Chinese, English, French, Slavic and Iberic) of similar languages. Results are reported using the NIST Cavg metric which served as the primary metric for the OpenLRE15 evaluation. Results are also reported for the EER and the LER.

.bib [Gelly16] | .pdf
Antoine Laurent, Thiago Fraga-Silva, Lori Lamel, Jean-Luc Gauvain

ICASSP 2016, The 41st IEEE International Conference on Acoustics, Speech and Signal Processing

In this paper we investigate various techniques in order to build effective speech to text (STT) and keyword search (KWS) systems for low resource conversational speech. Subword decoding and graphemic mappings were assessed in order to detect out-of-vocabulary keywords. To deal with the limited amount of transcribed data, semi-supervised training and data selection methods were investigated. Robust acoustic features produced via data augmentation were evaluated for acoustic modeling. For language modeling, automatically retrieved conversational-like Webdata was used, as well as neural network based models. We report STT improvements with all the techniques, but interestingly only some improve KWS performance. Results are reported for the Swahili language in the context of the 2015 OpenKWS Evaluation.

.bib [Laurent16] | .pdf
A. Gorin, R. Lileikyte, G. Huang, L. Lamel, J.L. Gauvain, A. Laurent

Interspeech 2016, Annual Conference of the International Speech Communication Association

This research extends our earlier work on using machine translation (MT) and word-based recurrent neural networks to augment language model training data for keyword search in conversational Cantonese speech. MT-based data augmentation is applied to two language pairs: English-Lithuanian and English-Amharic. Using filtered N-best MT hypotheses for language modeling is found to perform better than just using the 1-best translation. Target language texts collected from the Web and filtered to select conversational-like data are used in several manners. In addition to using Web data for training the language model of the speech recognizer, we further investigate using this data to improve the language model and phrase table of the MT system to get better translations of the English data. Finally, generating text data with a character-based recurrent neural network is investigated. This approach allows new word forms to be produced, providing a way to reduce the out-of-vocabulary rate and thereby improve keyword spotting performance. We study how these different methods of language model data augmentation impact speech-to-text and keyword spotting performance for the Lithuanian and Amharic languages. The best results are obtained by combining all of the explored methods.

.bib [Gorin16] | .pdf

2015

Thiago Fraga-Silva, Jean-Luc Gauvain, Lori Lamel, Antoine Laurent, Viet-Bac Le, Abdel Messaoudi

Interspeech 2015, Annual Conference of the International Speech Communication Association

This paper presents first results in using active learning (AL) for training data selection in the context of the IARPA-Babel program. Given an initial training data set, we aim to automatically select additional data (from an untranscribed pool data set) for manual transcription. Initial and selected data are then used to build acoustic and language models for speech recognition. The goal of the AL task is to outperform a baseline system built using a pre-defined data selection with the same amount of data, the Very Limited Language Pack (VLLP) condition. AL methods based on different selection criteria have been explored. Compared to the VLLP baseline, improvements are obtained in terms of Word Error Rate and Actual Term Weighted Values for the Lithuanian language. A description of methods and an analysis of the results are given. The AL selection also outperforms the VLLP baseline for other IARPA- Babel languages, and will be further tested in the upcoming NIST OpenKWS 2015 evaluation.

.bib [Fraga15] | .pdf
Thiago Fraga-Silva, Antoine Laurent, Jean-Luc Gauvain, Lori Lamel, Viet-Bac Le, Abdel Messaoudi

ASRU 2015, 2015 IEEE Automatic Speech Recognition and Understanding Workshop

This paper extends recent research on training data selection for speech transcription and keyword spotting system development. The techinques were explored in the context of the IARPA-Babel Active Learning (AL) task for 6 languages. Different selection criteria were explored with the goal of improving over a system built using a predefined 3 hour training data set. Four variants of the entropy-based criterion were explored: words, triphones, phones as well as the use of HMM-states previously introduced in (see IS 2015 bellow). The influence of the number of HMM-states was assessed as well as whether automatic or manual reference transcripts were used. The combination of selection criteria was investigated, and a novel multi-stage selection method proposed. These methods were also assessed using larger data sets than were permitted in the Babel AL task. Results are reported for the 6 languages. The multi-stage selection was also applied to the surprise language (Swahili) in the NIST OpenKWS 2015 evaluation.

.bib [Fraga15b] | .pdf

2014

Laurent, A., Lamel, L.

SLTU 2014, Spoken Language Technologies for Under-resourced languages

This paper investigates the development of a speech-to-text transcription system for the Korean language in the context of the DGA RAPID Rapmat project. Korean is an alpha-syllabary language spoken by about 78 million people worldwide. As only a small amount of manually transcribed audio data were available, the acoustic models were trained on audio data downloaded from several Korean websites in an unsupervised manner, and the language models were trained on web texts. The reported word and character error rates are estimates, as development corpus used in these experiments was also constructed from the untranscribed audio data, the web texts and automatic transcriptions. Several variants for unsupervised acoustic model training were compared to assess the influence of the vocabulary size (200k vs 2M), the type of language model (words vs characters), the acoustic unit (phonemes vs half-syllables), as well as incremental batch vs iterative decoding of the untranscribed audio corpus.

.bib [Laurent14] | .pdf
Bredin, H., Laurent, A., Sarkar, A., Le, V.-B., Barras, Claude, Rosset, Sophie

Odyssey 2014, The Speaker and Language Recognition Workshop

We address the problem of named speaker identification in TV broadcast which consists in answering the question ``who speaks when?'' with the real identity of speakers, using person names automatically obtained from speech transcripts. While existing approaches rely on a first speaker diarization step followed by a local name propagation step to speaker clusters, we propose a unified framework called person instance graph where both steps are jointly modeled as a global optimization problem, then solved using integer linear programming. Moreover, when available, acoustic speaker models can be added seamlessly to the graph structure for joint named and acoustic speaker identification -- leading to a 10% error decrease (from 45% down to 35%) over a state-of-the-art i-vector speaker identification system on the REPERE TV broadcast corpus.

.bib [Laurent14b] | .pdf
Laurent, A., Camelin, N., Raymond, C.

Interspeech 2014, 15th Annual Conference of the International Speech Communication Association

In this article, we tackle the problem of speaker role detection from broadcast news shows. In the literature, many proposed solutions are based on the combination of various features coming from acoustic, lexical and semantic information with a machine learning algorithm. Many previous studies mention the use of boosting over decision stumps to combine efficiently these features. In this work, we propose a modification of this state-of-the-art machine learning algorithm changing the weak learner (decision stumps) by small decision trees, denoted bonsai trees. Experiments show that using bonsai trees as weak learners for the boosting algorithm largely improves both system error rate and learning time.

.bib [Laurent14e] | .pdf
Laurent, A., Hartmann, W., Lamel, L.

ISCSLP@Interspeech 2014, 15th Annual Conference of the International Speech Communication Association

This paper investigates unsupervised training strategies for the Korean language in the context of the DGA RAPID Rapmat project. As with previous studies, we begin with only a small amount of manually transcribed data to build preliminary acoustic models. Using the initial models, a larger set of untranscribed audio data is decoded to produce approximate transcripts. We compare both GMM and DNN acoustic models for both the unsupervised transcription and the final recognition system. While the DNN acoustic models produce a lower word error rate on the test set, training on the transcripts from the GMM system provides the best overall performance. We also achieve better performance by expanding the original phone set. Finally, we examine the efficacy of automatically building a test set by comparing system performance both before and after manually correcting the test set.

.bib [Laurent14d] | .pdf
Bouaziz, M., Laurent, A., Estève, Y.

JEP 2014, Journées d'Etudes sur la Parole

Certains Systèmes de Reconnaissance Automatique de la Parole (SRAP) atteignent des taux d'erreur de l'ordre de 10%. Toutefois, notamment dans le cadre de l'indexation automatique des documents multimédia sur le web, les SRAP se trouvent face à la problématique des mots hors-vocabulaire. En effet, les entités nommées en constituent une grande partie et sont remarquablement importantes pour les tâches d'indexation. Nous mettons en œuvre, dans ce travail, la solution du décodage hybride en utilisant les syllabes comme unités sous-lexicales. Cette méthode est intégrée au sein du SRAP LIUM'08 développé par le Laboratoire d'Informatique de l'Université du Maine. Avec une légère dégradation de la performance générale du système, environ 31% des noms de personne hors vocabulaire sont correctement reconnus.

.bib [Laurent14j5] | .pdf
Bonneau-Maynard, H., Segal, N., Bilinski, E., Gauvain, J.-L., Gong, L., Lamel, L., Laurent, A., Yvon, F., Despres, J., Josse, Y., Le, V.-B.

JEP 2014, Journées d'Etudes sur la Parole

Le projet RAPMAT vise à développer des systèmes de traduction de la parole en s’intéressant aux deux traitements constitutifs de la chaîne complète : la reconnaissance de la parole (RAP) et la traduction (TA). Dans la situation classique, les modèles statistiques utilisés par les deux systèmes sont estimés indépendemment, à partir de données de différentes natures (transcriptions manuelles de données de parole pour la RAP et corpus bilingues issus de données textuelles pour la TA). Nous proposons une approche semi-supervisée pour l'adaptation des modèles de traduction à la traduction de parole, dans laquelle les modèles de TA sont entraînés en intégrant des transcriptions manuelles et automatiques de la parole traduites automatiquement. L'approche est expérimentée sur la direction de traduction français vers anglais. Un prototype de démonstration sur smartphones, incluant notamment la traduction de parole pour les paires de langues français/anglais et français/chinois a été développé pour permettre la collecte de données.

.bib [Laurent14j4] | .pdf
Laurent, A., Lamel, L.

JEP 2014, Journées d'Etudes sur la Parole

Ce papier décrit le développement d'un système de reconnaissance automatique de la parole pour le coréen. Le coréen est une langue alpha-syllabique, parlée par environ 78 millions de personnes dans le monde. Le développement de ce système a été mené en utilisant très peu de données annotées manuellement. Les modèles acoustiques ont été adaptés de manière non supervisée en utilisant des données provenant de différents sites d'actualités coréens. Le corpus de développement contient des transcriptions approximatives des documents audio : il s'agit d'un corpus transcrit automatiquement et aligné avec des données provenant des mêmes sites Internet. Nous comparons différentes approches dans ce travail, à savoir, des modèles de langue utilisant des unités différentes pour l'apprentissage non supervisé et pour le décodage (des caractères et des mots avec des vocabulaires de différentes tailles), l'utilisation de phonèmes et d'unités ``demi-syllabiques'' et deux approches différentes d'apprentissage non supervisé.

.bib [Laurent14j3] | .pdf
Laurent, A., Guinaudeau, C., Roy, A.

JEP 2014, Journées d'Etudes sur la Parole

Cet article décrit les méthodes mises en place pour permettre l'analyse d'un corpus composé de documents audiovisuels diffusés au cours des 80 dernières années : le corpus MATRICE. Nous proposons une exploration des données permettant de mettre en évidence les différents thèmes et évènements abordés dans le corpus. Cette exploration est, dans un premier temps, effectuée sur des notices documentaires produites manuellement par les documentalistes de l'Institut National de l'Audiovisuel. Puis, nous montrons, grâce à une étude qualitative et une technique de clustering automatique, que les transcriptions automatiques permettent également d'effectuer une analyse du corpus faisant émerger des thèmes cohérents avec les données traitées.

.bib [Laurent14j2] | .pdf
Laurent, A., Camelin, N., Raymond, C.

JEP 2014, Journées d'Etudes sur la Parole

Dans ce travail, nous nous intéressons au problème de la détection du rôle du locuteur dans les émissions d'actualités radiotélévisées. Dans la littérature, les solutions proposées sont de combiner des indicateurs variés provenant de l'acoustique, de la transcription et/ou de son analyse par des méthodes d'apprentissage automatique. De nombreuses études font ressortir l'algorithme de boosting sur des règles de décision simples comme l'un des plus efficaces à combiner ces différents descripteurs. Nous proposons ici une modification de cet algorithme état-de-l'art en remplaçant ces règles de décision simples par des mini arbres de décision que nous appelons bonzaïs. Les expériences comparatives menées sur le corpus EPAC montrent que cette modification améliore largement les performances du système tout en réduisant le temps d'apprentissage de manière conséquente.

.bib [Laurent14j1] | .pdf
Guinaudeau, C., Laurent, A., Bredin, H.

MediaEval 2014 Social Event Detection Task. Working Notes Proceedings

This paper provides an overview of the Social Event Detection (SED) system developed at LIMSI for the 2014 campaign. Our approach is based on a hierarchical agglomerative clustering that uses textual metadata, user-based knowledge and geographical information. These different sources of knowledge, either used separately or in cascade, reach good results for the full clustering subtask with a normalized mutual information equal to 0.95 and F1 scores greater than 0.82 for our best run.

.bib [guinaudeau14] | .pdf

2012

El-Khoury, E., Laurent, A., Meignier, S., Petitrenaud, S.

ICASSP 2012, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent12] | .pdf
Dufour, R., Laurent, A., Estève, Y.

JEP 2012, Journées d'Etudes sur la Parole

.bib [Laurent12j1] | .pdf

2011

Laurent, A., Meignier, S., Merlin, T., Deléglise, P.

ICASSP 2011, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent11] | .pdf

2010

Estève, Y., Deléglise, P., Meignier, S., Petitrenaud, S., Schwenk, H., Barrault, L., Bougares, F., Dufour, R., Jousse, V., Laurent, A., Rousseau, A.

Workshop CMU SPU

.bib [Estève10] |
Laurent, A., Meignier, S., Deléglise, P.

JEP 2010, Journées d'Etudes sur la Parole

.bib [Laurent10j1] | .pdf
Laurent, A., Meignier, S., Merlin, T., Deléglise, P.

Interspeech 2010, 11th Annual Conference of the International Speech Communication Association

.bib [Laurent10-b] | .pdf

2009

Laurent, A., Deléglise, P., Meignier, S.

Interspeech 2009, 10th Annual Conference of the International Speech Communication Association

.bib [Laurent09b] | .pdf
Laurent, A., Merlin, T., Meignier, S., Estève, Y., Deléglise, P.

ICASSP 2009, IEEE International Conference on Acoustics, Speech, and Signal Processing

.bib [Laurent09] | .pdf

2008

Laurent, A., and Meignier, S., Estève, Y., Deléglise, P.

JEP 2008, Journées d'Etudes sur la Parole

.bib [Laurent08b] | .pdf
Laurent, A., Merlin, T., Meignier, S., Estève, Y., Deléglise, P.

LREC 2008, Language Resources and Evaluation Conference

.bib [Laurent08] | .pdf