Translatotron 2 SpeechtoSpeech Translation Architecture (original) (raw)

Translatotron 2 Speech-to-Speech Translation Architecture

Last Updated : 23 Jul, 2025

The speech-to-speech translation system translates the input audio from one language to another. These are abbreviated as S2ST (Speech to Speech Translation) systems or S2S(Speech to Speech) systems in general. The primary objective of this system is to enable communication among people who speak different languages.

Seq2Seq Model

A Seq2Seq model consists of the following three types of systems:

The main drawback of such a system was

In 2019, researchers at Google came up with direct speech-to-speech translation with a sequence-to-sequence model, which was the first end-to-end sequence-to-sequence model for S2ST.

This was followed by a modified architecture in 2022 called Translatotron 2: high-quality direct speech-to-speech translation with voice preservation. In this article, we will understand the Translatotron-2 architecture in detail.

Translatotron 2 Architecture

You can refer to the image for the Translatotron 2 architecture for better understanding.

Translatotron 2 Architecture-Geeksforgeeks

Translatotron 2 Architecture

The model takes the spectrogram of source audio as input and tries to predict two things:

  1. The target audio spectrogram
  2. The target language phoneme will be used as input for the target audio spectrogram as well.

Phonemes are the basic sound units in any given language that have become incorporated into formal language systems. For many of the world's languages, phonemes consist of various combinations of consonants (C) and vowels (V).

The model is trained on this dual objective. Once we know the target audio spectrogram it is easier to construct the sound wave using reverse Fourier transform.

There are four main components. Let's discuss in detail each of these components.

Encoder

The encoder uses a conformer as the architecture. The conformer is a combination of transformer and CNN, hence the name 'conformer'. Conformer was devised to combine the advantages of Transformer and convolution capturing the global contexts and local contexts, respectively.

Conformer Encoder Model Architecture -gfg

Conformer Encoder Model Architecture

Feed-Farward-Module-gfg

Feed Forward Module

Convolution-Module-gfg

Convolution Module

Multi-Head-Attention-Module-gfg

Multi-head Self-attention Module

Attention Module

The attention module serves as the bridge that connects all elements of the Translatotron architecture, including the encoder, decoder, and synthesizer. This attention mechanism plays a dual role by modeling both linguistic and acoustic alignments between the source and target speeches. It employs a multi-head attention mechanism, with queries originating from the linguistic decoder. Its primary function is to capture the alignment relationship between a sequence of source spectrograms and a shorter sequence of target phonemes.

Furthermore, the attention module provides valuable acoustic information from the source speech to the synthesizer, presenting it in a summarized form at the per-phoneme level. This summarized acoustic information not only proves to be generally adequate for the speech generation process but also simplifies the task of predicting phoneme durations since it aligns with the same granularity.

No of attention head is 8 . The hidden dimension is 512 divided among the attention head. Thus the output dimension of the attention head is the same as that of the input - (batch,sub-sampled time, encoder_dim)

Decoder

The autoregressive decoder is responsible for producing linguistic information in the translation speech. It takes the attention module, and predicts a phoneme sequence corresponding to the translation speech. It uses LSTM stack . The dimension of LSTM is same as encoder_dim. The number of stack is 6 to 4.The output from the LSTM stack is passed through a projection layer to convert it to phoneme embedding dimension which is typically 256

Speech Synthesizer

The synthesizer assumes the role of acoustically generating the translated speech. It accepts two inputs: the intermediate output from the decoder (prior to the final projection and softmax for phoneme prediction) and the contextual output derived from the attention mechanism. These inputs are concatenated, and the synthesizer utilizes this combined information to produce a Mel-Spectrogram that corresponds to the translated speech.

Speech-sythesizer

Speech Synthesizer of Translatotron 2

The speech synthesizer in Translatotron 2 is adopted from NAT (Non-Attentive Tacotron). NAT first predicts the duration and range of influence for each token in the input sequence. Using these two values it uses Gaussian upsampling to upsample the input. After that, an LSTM stack is used for generating the target spectrogram. A final residual convolutional block further refines the generated spectrogram

Vocoder

The spectrogram is subsequently input into a Vocoder, an abbreviation for "Voice" and "Encoder." The Vocoder serves the purpose of both analyzing and synthesizing the human voice signal based on the information contained in the spectrogram. An exemplary instance of a Vocoder is WaveNet, a generative model realized as a deep neural network designed for generating time-domain waveforms. WaveNet excels at producing audio signals that closely resemble human speech.

Voice Preservation

In translatotron 1, Google used a separate speech encoder to generate embeddings of the speaker's voice which was fed to the speech synthesizer. This helped in preserving the source speaker's voice in the translated speech. However, it had a major drawback in that it could be misused for generating fake voices by playing with speech encoder embedding.

In order to mitigate this risk Google used a speech encoder only during the training to make the model learn voice preservation by training the model on parallel utterances with the same speakers' voice on both sides. Since obtaining such a dataset is very difficult it used TTS (Text to Speech model) with a speech encoder to generate training examples.

Conclusion

When Google introduced Translatotorn 1 for end-to-end S2ST, though it performed well it was not able to match the performance of cascade S2ST. With Translatotron 2 it was able to match the performance of cascade S2ST. As per Google, the primary improvement comes from the high-level architecture i.e. the way the attention module connects the Encoder, Decoder, and Speech Synthesizer. The architectural choice of components did help in improving components but one can always experiment with those.

In June 2023 Google released Translatotron 3: Speech-to-Speech Translation with Monolingual Data. The core architecture of Translatotron 3 was the same as that of Translatotron. However, the major highlight of this paper was the ability to apply unsupervised training for speech-to-speech translation. It meant that even if we do not have a corpus of translated data between two languages, but if we have individual datasets for each language we can train Translatotron 3 to learn the mapping between these two languages !! We will explore Translatotron 3 in our future article.