Traditional models require large quantities of translated text. Models are then developed for each direction that people want to translate, producing a model from one language to another, which is known as a bilingual model. This does not work well when supporting many languages, since building and maintaining thousands of models for each possible language pair would create excessive computational complexity.
That’s why researchers are looking at a new approach called “multilingual models” as the way forward. These are models that build some representation of text that’s common to all languages. Every language has words that are common for objects and properties, such as "dog" and "fluffy". The words for these two concepts occur often next to each other, no matter what language. This allows us to learn language-independent meaning representations that enable transfer of knowledge from one language to another.
This year, for the first time, a single multilingual model outperformed the best specially trained traditional bilingual models and won WMT, a prestigious MT competition, where these models are put through a series of tests, including human evaluation, to judge their accuracy.
Until now, multilingual models haven’t been handling high-resource languages as well as their bilingual counterparts have. Low-resource languages present an added challenge. In these cases, it’s easy for the model to learn the small amount of data it is trained on but fail to generalize to translate other sentences. This development shows that multilingual models have tremendous promise as we work toward our vision of universal translation.
We sat down with Philipp Koehn, a Meta AI research scientist, author of Statistical Machine Translation and Neural Machine Translation, to talk about the latest advances in MT, the newest challenges for the field, and the potential to make sci-fi translators into a reality.
Can you tell us what the award-winning multilingual model your team pioneered means for automatic translations?
Philipp Koehn: Today there’s a significant imbalance in the coverage of MT technology: Language pairs with vast volumes of training data, such as French-English, can be automatically translated close to human quality, but there are still hundreds of low-resource languages for which no MT systems exist at all. Translations have the power to provide access to information that would otherwise not be possible. It’s important that translation technology is inclusive to everyone around the world, regardless of data scarcity.
We’ve come quite a long way since the very beginning of MT, where researchers painstakingly built rules for each and every translation task.
Now, not only is the single multilingual model more efficient to develop via new scaling and data optimization work, but it also brings better-quality translations than bilingual models, across both high- and low-resource languages. This work holds promise in bringing high-quality translations to more languages, which was not possible before.
How quickly do you think we can bring these translation improvements to billions of people using Facebook’s and Meta’s other platforms, especially for people who speak low-resource languages?
PK: Meta’s latest WMT multilingual model translates many very different language pairs using a single model, and this is a major milestone. Having a single model rather than training specialized models for each language direction makes creating and deploying new models a lot more feasible, particularly when scaling to more and more languages. But productionizing it at the rate of 20 billion translations daily on Facebook, Instagram, and our other platforms is its own research direction in and of itself. The Meta AI team has a separate research arm that’s focused on research to deploy these large multilingual models. For example, we’ve productionized a previous version of a multilingual model that’s currently helping to proactively detect hate speech, even in languages for which there’s little training data, which is important to keep people safe on our platform around the world.
While the latest WMT multilingual model is still too big to be deployed in real-time settings, the learnings from building these models will improve the production MT system in the near future.
If multilingual is the path toward universal translation, what challenges do we still need to overcome? How far away are we?
PK: Multilingual models pose serious computational challenges due to their large scale and the vast amounts of training data needed to train them. Hence research into more efficient training methods has been essential.
But there are a host of additional challenges. Modeling challenges range from the balancing of the different types of data (including synthesized data from translating the text back to its source language) and the open questions around how the neural architecture should accommodate language-specific parameters.
The architecture of multilingual models is not yet settled. Early efforts introduced language-specific components. At the other end of the spectrum, you could have a traditional model and feed it a series of translated texts in several languages, tagged with a language token to specify the output language. Most researchers believe that some form of language-specific parameters need to augment a general model. But it’s not yet clear if these should be hard-coded by language or if the model should be tasked to learn how specialized parameters can be best utilized.
There is always the question of whether bigger is better. A language pair with lots of data will likely benefit from a bigger model, but low-resource language pairs risk overfitting. We were able to overcome this with the WMT model. But as we add more languages, these two concerns need to be accommodated at the same time.
What are the most promising solutions to address these challenges? How is the MT community working together to achieve them?
PK: At Meta, the teams are engaged in a concerted effort to cover a much larger number of languages in a multilingual model and utilize it for many applications. This involves all aspects of the problem: modeling, training, data, and productionizing.
In terms of modeling and architecture challenges, we have seen the most success with models that selectively use subsets of parameters, based on the input. One such model uses an ensemble of multiple alternative layers of a large model and allows the model to select a subset of them. Given the large amount of training data, it is not surprising that bigger models yield better results but careful selection of hyperparameters is important to effect that outcome.
For many of the possibilities for translating one language to another, the only parallel data that is available was originally translated through a pivot language. Think of the many translations of the Bible from which, say, a Estonian-Nepali parallel corpus can be extracted, but each Bible version was translated from a third language (may it be Greek, Latin, or English). Since we don’t want the training to be dominated by such data, we combine the high-quality training data (often paired with English) with parallel data only for some language pairs: translations between representative languages of each language family, grouped by linguistic and data-driven analysis.
It’s also important to consider the varying degrees of quality, relevance, and source of training data. Staging the training data in a curriculum (e.g., reducing the data size toward the best subsets) typically gives better results.
These techniques are promising, but progress toward solving open challenges has always been cumulative. Last June, we open-sourced FLORES-101, a first-of-its-kind, many-to-many evaluation data set covering 101 languages from all over the world. This gave researchers a tool to rapidly test and improve upon multilingual translation models, like M2M-100. Additional progress will happen over time through open science, as researchers across the industry build on top of this work, as well as research from other labs and companies. That’s why we’ve published the WMT model and released its code to the wider AI research community, just as we’ve done in the past with research and tools, organizing shared tasks, and funding academia to collectively push research forward.
How might these multilingual advancements help the AI field overall?
PK: The move to large multilingual models mirrors a broader trend in AI. Many advanced natural language models are not built as specialized systems anymore, but rather on top of massive language models.
One may view this as a push toward general intelligence: AI systems that are capable of addressing many different problems and cross-applying knowledge between them. In the same spirit, multilingual translation models solve the general translation problem, not the specific problem of a particular language pair.
Multilingual is a step in that direction. It leads to more flexible systems that can serve more tasks. It is more efficient because it frees up capacity — which allows us to roll out new features instantly to people around the world. Finally, it’s closer to human thinking. As humans, we don’t have specialized models for each task; we have one brain that does many different things. Multilingual models, just like pretrained models, are bringing us closer to that.
As one of the pioneers of modern MT, what do you think the future of translation looks like over the next 10 years? Will we achieve the goal of universal translation?
PK: That is hard to predict. Ten years ago, I would not have predicted the hard turn from statistical to neural methods. What is safe to say, though, is that we will see continued improvements in translation quality and languages covered by translation technology, leading to broader applications. Many people on the Facebook platform already expect that, with a single click, they can translate posts in languages they do not understand. Sometimes they do not even have to click, and the translation is automatically displayed. This kind of seamless integration is an example of how translation technology will be employed, invisible to the users who just use their favorite language and everything just works. There is some exciting research of speech translation at Meta, which promises to bring this kind of seamless integration into the spoken realm.