Language technologies

This transversal action brings together complementary approaches that already exist within the laboratory in the field of language technologies. The current challenges—at the level of foundation models, corpus collection, computational linguistics, and human–machine interaction—are addressed respectively across different departments and teams.

Department DMID

Natural Language Processing (NLP) is a core research focus of the DECIDE team within the DMID department, whose activities span the entire spectrum “from data to decision.” Within a data analysis pipeline, the role of NLP is to ensure a precise and fine-grained understanding by machines of natural language data, whether in written or spoken form. Techniques such as knowledge extraction make it possible to convert natural language into formal, precise, and unambiguous representations that can be directly exploited by inference or prediction algorithms. Within DECIDE, priority is given to explainable and controllable NLP methods, particularly in domains where precision is critical, such as the processing of health data. The analysis of linguistic corpora in mental health and the processing of personal health records are examples of projects carried out by team members.

Beyond data analysis, the team has a strong interest in computational linguistics and multilingualism, that is, in NLP systems capable of supporting a diversity of languages that may exhibit significant variation in grammar, lexicon, writing systems, and other features. The applications of this research range from the digital humanities, through support for minority and low-resource languages, to the study of structural biases within AI systems (in the broad sense) for or against certain languages. In this regard, research on Celtic languages conducted by members of COMMEDIA creates opportunities for collaboration between teams.

Finally, the biennial conference “Grapholinguistics in the 21st Century,” organized by members of the team, has been—since 2018—one of the very few scientific events worldwide to focus on the written modality of language from both a linguistic and an interdisciplinary perspective (computational, historical, sociolinguistic, artistic, etc.).

Contact: Gábor Bella

Department Interaction

One of the research themes represented within the INTERACTION department is embodied AI. This topic is addressed through user-centered approaches, both for corpus collection and for the adaptation or specialization of existing AI models. These developments draw on research in the humanities and social sciences, and may in turn contribute back to these fields. The projects most often have a strong multidisciplinary dimension.

This theme includes the specialization of models and development of algorithms designed to facilitate human–agent interaction (whether with a virtual agent or a robot) through dialogue, for example, in the DISCOBOT project jointly led by the RAMBO and COMMEDIA teams.

The COMMEDIA team is also involved in several projects focusing on the automatic processing of texts and speech, to extract semantic and pragmatic information. For instance, a research project focuses on analyzing the errors produced by text simplification models through the development of a taxonomy, the creation of annotated datasets, and the definition of metrics for evaluating text simplicity. Another project deals with the retrieval of communicative intention in multimedia documents via narrative understanding.The team also contributes to the JOKER project, which focuses on identifying humor and assisting with its translation, coordinated by the HCTI laboratory.
In addition, COMMEDIA is engaged in an interdisciplinary collaboration with Bangor University, the CNRS MoDyCo laboratory, and the CNRS IKER laboratory to develop evaluation frameworks for AI applied to Brittonic languages, particularly for speech processing in the context of these minority languages.

Contact : Anne-Gwenn BOSSER

Department T2I3

Within the T2I3 department, the BRAIN team focuses its research on Large Language Models (LLMs), structuring its contributions around four main
axes: methodology, application, multi-modality, and reasoning.

Methodologically, the team works on improving the performance of auto-regressive LLMs, particularly through proposals aimed at enhancing text generation diversity and optimizing input-output alignment. In addition, it explores robust architectural alternatives, such as discrete diffusion approaches. These research efforts are accompanied by an application-oriented component specifically targeting figurative language processing.

The team also investigates the multi-modal dimension by establishing connections between text and heterogeneous data such as vision, audio, music, as well as brain and biological signals. Finally, the team explores the integration of reasoning mechanisms to facilitate automated mining within large multi-modal datasets.

Contact: Vincent Gripon