Meta Launches Multi-Modal AI Model for Speech and Text Translation

Meta has introduced the first all-in-one multilingual multimodal AI translation and transcription model called SeamlessM4T.

Depending on the task, the model can perform various text-to-speech and speech-to-text translations for up to 100 languages. In cases of speech recognition, the model can recognize nearly 100 languages, while it is also able to support around 100 input languages and 36 (including English) output languages for speech-to-speech translation. For text-to-speech translation, SeamlessM4T supports the same number of input languages and 35 output languages, including English.

Meta says SeamlessM4T is available under a research license to allow researchers and developers to use it as a foundation for their work. The company cites that existing speech-to-speech and speech-to-text systems cover only a small fraction of the world's languages, a gap that SeamlessM4T aims to bridge.

The model uses a single-system approach that reduces errors and delays and increases the efficiency and quality of the translation process, significantly improving the way people who speak different languages communicate.

Last year, Meta released a text-to-text machine translation model that supports 200 languages, No Language Left Behind (NLLB), that has been integrated into Wikipedia to translate content into various languages.

SeamlessM4T draws on findings from Meta's previous projects to enable a multilingual and multimodal translation experience from a single model, built across a wide range of spoken data sources.