Modélisation sémantique au sein des transformers pour la détection d'actions dans les vidéos non découpées
Auteur / Autrice : | Seongro Yoon |
Direction : | François Bremond |
Type : | Projet de thèse |
Discipline(s) : | Automatique traitement du signal et des images |
Date : | Inscription en doctorat le 01/10/2024 |
Etablissement(s) : | Université Côte d'Azur |
Ecole(s) doctorale(s) : | École doctorale Sciences et technologies de l'information et de la communication |
Partenaire(s) de recherche : | Laboratoire : Spatio-Temporal Activity Recognition Systems |
Mots clés
Résumé
This PhD work focuses on enhancing Emotion Recognition and Detection algorithms using RGB video cameras during testing, while incorporating multimodal data during training. The objective is to develop and evaluate a model across multiple datasets with varying modalities to identify specific emotions, such as stress, anxiety, and joy. The approach leverages advanced Deep Learning techniques to combine multimodal inputs and explores strategies like multi-task learning, Knowledge Elicitation using the Student-Teacher paradigm, contrastive learning, and co-training or Transformer models. Several levels of ground truth supervision, including weak supervision, will be employed to train the model. The typical pipeline may integrate CNNs for 3D pose estimation, eye-gaze tracking, and facial expression analysis, depending on the target emotions. Short-term temporal features can be processed using RNNs or 3D CNNs, while longer-term reasoning may be handled by TCNs, Transformers, or even ontology-based approaches. The first step is to extract meaningful mid-level features, which will then be refined through more advanced long-term reasoning. A key challenge will be developing a method to effectively integrate knowledge acquisition with long-term reasoning in a weakly supervised setting. The ultimate goal is to minimize the need for supervision and create a robust, generalizable algorithm capable of detecting emotions and facial expressions in individuals within unconstrained environments using a single video camera and minimal sensor data. To validate this work, the proposed approaches will be tested on video data from applications in collaboration with Nice Hospital, including patient monitoring for individuals with behavioral disorders, such as autism, dementia, and depression.