End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting

Thierry Desot; François Portet; Michel Vacher

doi:10.1016/j.csl.2022.101369

Article Dans Une Revue Computer Speech and Language Année : 2022

End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting

(1) , (1) , (1)

Thierry Desot

Fonction : Auteur

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

François Portet

Fonction : Auteur
PersonId : 1069
IdHAL : francois-portet
ORCID : 0000-0003-2542-0661
IdRef : 098179160

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Michel Vacher

Fonction : Auteur

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Résumé

Spoken Language Understanding (SLU) is a core task in most human-machine interaction systems. With the emergence of smart homes, smart phones and smart speakers, SLU has become a key technology for the industry. In a classical SLU approach, an Automatic Speech Recognition (ASR) module transcribes the speech signal into a textual representation from which a Natural Language Understanding (NLU) module extracts semantic information. Recently End-to-End SLU (E2E SLU) based on Deep Neural Networks has gained momentum since it benefits from the joint optimization of the ASR and the NLU parts, hence limiting the cascade of error effect of the pipeline architecture. However, little is known about the actual linguistic properties used by E2E models to predict concepts and intents from speech input. In this paper, we present a study identifying the signal features and other linguistic properties used by an E2E model to perform the SLU task. The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands. The results show that a good E2E SLU performance does not always require a perfect ASR capability. Furthermore, the results show the superior capabilities of the E2E model in handling background noise and syntactic variation compared to the pipeline model. Finally, a finer-grained analysis suggests that the E2E model uses the pitch information of the input signal to identify voice command concepts. The results and methodology outlined in this paper provide a springboard for further analyses of E2E models in speech processing.

Domaines

Intelligence artificielle [cs.AI] Informatique et langage [cs.CL] Apprentissage [cs.LG]

Fichier principal

S0885230822000134.pdf (1.2 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Accord Elsevier CCSD : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03727285

Soumis le : lundi 22 juillet 2024-11:23:12

Dernière modification le : lundi 9 décembre 2024-03:26:37

Dates et versions

hal-03727285 , version 1 (22-07-2024)

Licence

Paternité - Pas d'utilisation commerciale

Identifiants

HAL Id : hal-03727285 , version 1
ARXIV : 2207.08179
DOI : 10.1016/j.csl.2022.101369
PII : S0885-2308(22)00013-4

Citer

Thierry Desot, François Portet, Michel Vacher. End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting. Computer Speech and Language, 2022, 75, pp.101369. ⟨10.1016/j.csl.2022.101369⟩. ⟨hal-03727285⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG LIG_TDCGE_GETALP MIAI ANR LIG_SIDCH LIVINGLAB_DOMUS

72 Consultations

34 Téléchargements

End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting

Résumé

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager