Continuous Sign Language recognition for the design of a gestural server

Updated: 3 months ago
Job Type: FullTime
Deadline: 31 May 2021

1. Context

This PhD topic is proposed within the framework of the Serveur Gestuel project, founded by Bpifrance (public-private partnership). The objective of the project is to provide Deaf people practicing sign language with the equivalent of a voice server for hearing people. The selected candidate will have the opportunity to carry out his/her thesis work in a multidisciplinary environment mixing academic research teams from LISN (ex-LIMSI) in Orsay and GIPSA-lab in Grenoble and two industrial partners from the consortium (the PhD candidate will interact with them on a regular basis). So :

  • he/she will benefit from strong skills in vision, natural language processing and machine learning ;
  • it will benefit from solid technical support to develop its prototypes and carry out user tests in real conditions ;
  • he/she will have the opportunity to deepen his knowledge of the social, economic and technical environment of the Deaf community ; 
  • it will contribute to the heart of the project, which is the development of system able to automatically decode sign language into written text.

Sign languages (SL) are natural languages used by Deaf communities. Unlike vocal languages which are audio-phonatory, LS are visuo-gestural. They are also multimodal, in the sense that information are transmitted by different articulators (hands, arms, chest, shoulders, head, facial features, gaze) and their movements. Utterances in LS include several types of gestural units : lexical signs, which are conventional signs for which the form and meaning remain stable regardless of the context and which can be listed in a dictionary, complex structures built on the fly for illustrative purposes and which therefore cannot be listed in a dictionary, as well as a large number of movements having a linguistic role carried by articulators other than the hands. In addition, the discourse is structured in space, which is used to contextualize a sign, to place objects or concepts, to create visual relationships between these entities. Thus, one cannot reduce a LS utterance to a simple sequence of signs which would have an equivalent in the vocal language. It should also be noted that the shape of the signs varies according to the transitions which precede and follow (co-articulation), and to linguistic constraints (spatialization). The automatic recognition of sign language is becoming of great interest in computer vision and machine learning communities. However most studies focus on non-spatial lexical signs or ignores co-articulation. Addressing those complex aspects of sign language is one the main goal of this PhD project.

2. Objectives

The objective of the thesis is to investigate deep learning methods for natural sign language recognition, which would better take co-articulation, spatialization and illustrative structures into account. This work will initially be based on the data, tools and knowledge developed at LIMSI [1,2,4,5]. The research work will focus on three axes : 1) building realistic datasets of sign language (in collaboration with industrial partners), 22) image processing/feature extraction/representation learning, 3) deep learning pipeline for decoding sign language.

2.1 Data
Contrary to action recognition, sign languages recognition does not benefit from large datasets. First of all, there is no universal sign language, but different languages (French, American, German, Chinese etc). For each of them, corpora are built using various methodologies. The annotation of these videos is not always provided and, when available, the annotation rules vary according to the linguistic model used. Most public datasets propose either a set of isolated lexical signs in a given language, or very simple but unrealistic utterances (a subject, an action and an object), or domain-specific utterances (for example weather). In this thesis, we will rely on so-called ’natural’ sign language data corpora, in the sense that few constraints are imposed on the content and form of the utterances. Contrary to most SL datasets, these corpora contain both lexical signs and illustrative structures. In addition, the thesis will help expanding the corpora by leveraging public data and collecting data through the consortium and assess the impact of using synthetic data.

2.2 Video pre-processing : spatio-temporal representations of the signer
The raw videos will be transformed to compact spatio-temporal data, representing the signer and his articulators. These data will serve as training data for a deep learning models. From a signer representation  made up of key points associated with the pose of the body and face of the signer produced by deep learning (OpenPose library [6]) and a set of dynamic characteristics calculated on these models, we were able to develop a first system allowing the recognition of a few signs, among the most common [7] (see Fig. 1). The hands analysis remains an issue due to the large number of degrees of freedom, motion blur due to their high speed. The most recent tools perform a handshape identification based on the appearance of the hands [6], but it notably lacks the orientation of the hands. The thesis will be an opportunity to exploit the latest advances in terms of finger detection in a video stream.

2.3 Sign language recognition
The core of the thesis will consist in designing machine learning pipelines for addressing the following
objectives :

  • Automatic sign spotting : this approach will allow querying a sign language database directly using a short SL video, without any textual entry ; this detection will automatically annotate videos to enrich the training databases
  • Automatic conversion of sign language sequences into written text : A special focus will be put on sequence-to-sequence models [8]. The problem of adapting a pre-trained model to a new signer or a new recording set-up (e.g. video device, recording environment, etc.) will also be addressed.

3. References

[1] V. Belissen, A. Braffort, M. Gouiffès. Experimenting the  Automatic Recognition of Non-Conventionalized Units in Sign Language. Algorithms 2020, 13, p. 310

[2] H. Chaaban, M.Gouiffès, A. Braffort. Towards an Automatic Annotation of French Sign Language Videos : Detection of Lexical Signs. CAIP 2019. 10.1007/978-3-030-29891-3-35.

[3] S. Matthes, T. Hanke, A. Regen, J. Storz, S. Worseck, E. Efthimiou, N. Dimou, A. Braffort, J. Glauert, E. Safar, Dicta-Sign – Building a Multilingual Sign Language Corpus, 5th Workshop on the Representation and Processing of Sign Languages : Interactions between Corpus and Lexicon, Istanbul, Turkey, ELRA, 2012 www.ortolang.fr , https ://hdl.handle.net/11403/dicta-sign-lsf-v2/v1.

[4] V. Belissen, A. Braffort, M. Gouiffès. Dicta-Sign-LSF-v2 : Remake of a Continuous French Sign Language Dialogue Corpus and a First Baseline for Automatic Sign Language Processing. LREC 2020

[5] H. Bull, A. Braffort, M. Gouiffès. MEDIAPI-SKEL - A 2D-Skeleton Video Database of French Sign Language With Aligned French Subtitles. LREC 2020

[6] Z. Cao and T. Simon, S-E Wei and Y. Sheikh. Real time Multi-Person 2D Pose Estimation using Part Affinity Fields. CVPR 2017, Honolulu, Hawaı̈.

[7] O Koller, H Ney, R Bowden Deep hand : How to train a CNN on 1 million hand images when your data is continuous and weakly labelled Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016

[8] O. Koller, Quantitative Survey of the State of the Art in Sign Language Recognition, arxiv 2008.09918,https ://arxiv.org/abs/2008.09918, 2020

Funding category: Autre financement public

Contrat de recherche

PHD title: Doctorat en Informatique

PHD Country: France


View or Apply

Similar Positions