Sign Language Production Using Neural Machine Translation And Generative Adervsarial Networks (original) (raw)

Sign Language Production using Neural Machine Translation and Generative Adversarial Networks

British Machine Vision Conference (BMVC), 2018

We present a novel approach to automatic Sign Language Production using state-of-the-art Neural Machine Translation (NMT) and Image Generation techniques. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign gloss sequences using an encoder-decoder network. We then find a data driven mapping between glosses and skeletal sequences. We use the resulting pose information to condition a generative model that produces sign language video sequences. We evaluate our approach on the recently released PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach by sharing qualitative results of generated sign sequences given their skeletal correspondence.

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

International Journal of Computer Vision

We present a novel approach to automatic Sign Language Production using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks, and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph. The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to...

SignSynth: Data-Driven Sign Language Video Generation

Computer Vision – ECCV 2020 Workshops, 2020

We present SignSynth, a fully automatic and holistic approach to generating sign language video. Traditionally, Sign Language Production (SLP) relies on animating 3D avatars using expensively annotated data, but so far this approach has not been able to simultaneously provide a realistic, and scalable solution. We introduce a gloss2pose network architecture that is capable of generating human pose sequences conditioned on glosses. 1 Combined with a generative adversarial pose2video network, we are able to produce natural-looking, high definition sign language video. For sign pose sequence generation, we outperform the SotA by a factor of 18, with a Mean Square Error of 1.0673 in pixels. For video generation we report superior results on three broadcast quality assessment metrics. To evaluate our full gloss-to-video pipeline we introduce two novel error metrics, to assess the perceptual quality and sign representativeness of generated videos. We present promising results, significantly outperforming the SotA in both metrics. Finally we evaluate our approach qualitatively by analysing example sequences.

Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video

2020

To be truly understandable and accepted by Deaf communities, an automatic Sign Language Production (SLP) system must generate a photo-realistic signer. Prior approaches based on graphical avatars have proven unpopular, whereas recent neural SLP works that produce skeleton pose sequences have been shown to be not understandable to Deaf viewers. In this paper, we propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language. We employ a transformer architecture with a Mixture Density Network (MDN) formulation to handle the translation from spoken language to skeletal pose. A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence. This allows the photo-realistic production of sign videos directly translated from written text. We further propose a novel keypoint-based loss function, which significantly improves the quality of synthesized ha...

Neural Sign Language Translation

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018

Sign Language Recognition (SLR) has been an active research field for the last two decades. However, most research to date has considered SLR as a naive gesture recognition problem. SLR seeks to recognize a sequence of continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here, the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar. We formalize SLT in the framework of Neural Machine Translation (NMT) for both end-to-end and pretrained settings (using expert knowledge). This allows us to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language. To evaluate the performance of Neural SLT, we collected the first publicly available Continuous SLT dataset, RWTH-PHOENIX-Weather 2014T 1. It provides spoken language translations and gloss level annotations for German Sign Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a German vocabulary of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss-level tokenization networks were able to achieve 9.58 and 18.13 respectively.

A data augmentation approach for sign-language-to-text translation in-the-wild

Language Data and Knowledge Connference , 2021

In this paper, we describe the current main approaches to sign language translation which use deep neural networks with videos as input and text as output. We highlight that, under our point of view, their main weakness is the lack of generalization in daily life contexts. Our goal is to build a state-of-the-art system for the automatic interpretation of sign language in unpredictable video framing conditions. Our main contribution is the shift from image features to landmark positions in order to diminish the size of the input data and facilitate the combination of data augmentation techniques for landmarks. We describe the set of hypotheses to build such a system and the list of experiments that will lead us to their verification.

Can Everybody Sign Now? Exploring Sign Language Video Generation from 2D Poses

2020

Sign Language is the primary means of communication of the Deaf community but barely known by the rest of the population. This situation creates difficulties in conversations between sign and non-sign language speakers, which are normally addressed with textual transcriptions of the spoken language, or the sign-speakers developing lip-reading and oral communication skills. The communication barrier between sign and non-sign language speakers may be reduced in the coming years thanks to the recent advances in neural machine translation and computer vision. Recent works [5,6,9] are making steps towards sign language translation by automatically generating detailed human pose skeletons from spoken language. Skeletons are represented by 2D/3D coordinates of human joints also known as keypoints; given a set of estimated keypoints, one can visualize them as a wired skeleton connecting the modeled joints (see the middle row of Figure 1). Although such visualizations are theoretically usefu...

Neural Sign Language Translation Based on Human Keypoint Estimation

Applied Sciences

We propose a sign language translation system based on human keypoint estimation. It is well-known that many problems in the field of computer vision require a massive dataset to train deep neural network models. The situation is even worse when it comes to the sign language translation problem as it is far more difficult to collect high-quality training data. In this paper, we introduce the KETI (Korea Electronics Technology Institute) sign language dataset, which consists of 14,672 videos of high resolution and quality. Considering the fact that each country has a different and unique sign language, the KETI sign language dataset can be the starting point for further research on the Korean sign language translation. Using the KETI sign language dataset, we develop a neural network model for translating sign videos into natural language sentences by utilizing the human keypoints extracted from the face, hands, and body parts. The obtained human keypoint vector is normalized by the ...

Development of an End-to-End Deep Learning Framework for Sign Language Recognition, Translation, and Video Generation

IEEE Access

The recent developments in deep learning techniques evolved to new heights in various domains and applications. The recognition, translation, and video generation of Sign Language (SL) still face huge challenges from the development perspective. Although numerous advancements have been made in earlier approaches, the model performance still lacks recognition accuracy and visual quality. In this paper, we introduce novel approaches for developing the complete framework for handling SL recognition, translation, and production tasks in real-time cases. To achieve higher recognition accuracy, we use the MediaPipe library and a hybrid Convolutional Neural Network + Bi-directional Long Short Term Memory (CNN + Bi-LSTM) model for pose details extraction and text generation. On the other hand, the production of sign gesture videos for given spoken sentences is implemented using a hybrid Neural Machine Translation (NMT) + MediaPipe + Dynamic Generative Adversarial Network (GAN) model. The proposed model addresses the various complexities present in the existing approaches and achieves above 95% classification accuracy. In addition to that, the model performance is tested in various phases of development, and the evaluation metrics show noticeable improvements in our model. The model has been experimented with using different multilingual benchmark sign corpus and produces greater results in terms of recognition accuracy and visual quality. The proposed model has secured a 38.06 average Bilingual Evaluation Understudy (BLEU) score, remarkable human evaluation scores, 3.46 average Fréchet Inception Distance to videos (FID2vid) score, 0.921 average Structural Similarity Index Measure (SSIM) values, 8.4 average Inception Score, 29.73 average Peak Signal-to-Noise Ratio (PSNR) score, 14.06 average Fréchet Inception Distance (FID) score, and an average 0.715 Temporal Consistency Metric (TCM) Score which is evidence of the proposed work.

Adversarial Training for Multi-Channel Sign Language Production

ArXiv, 2020

Sign Languages are rich multi-channel languages, requiring articulation of both manual (hands) and non-manual (face and body) features in a precise, intricate manner. Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody this full sign morphology to be truly understandable by the Deaf community. Previous work has mainly focused on manual feature production, with an under-articulated output caused by regression to the mean. In this paper, we propose an Adversarial Multi-Channel approach to SLP. We frame sign production as a minimax game between a transformer-based Generator and a conditional Discriminator. Our adversarial discriminator evaluates the realism of sign production conditioned on the source text, pushing the generator towards a realistic and articulate output. Additionally, we fully encapsulate sign articulators with the inclusion of non-manual features, producing facial features and mouthing patterns. We evaluate on the chall...