We’ve all been there: Your manager pings you for the status of an important project. You write a message, reread it, edit, and tweak it once more. Even after you hit send, you reread it to make sure it conveys just the right meaning. For Nanshu Wang, a Software Engineer at Facebook, this was an everyday occurrence. “Scratch that—every hour,” she says. “English is my second language, and it’s my weakness. I always take extra time to make sure that I’m communicating effectively online.”
Speechless? Here’s how AI learns to finish your sentences
Algorithms have been helping people type faster for years. When seeking “best restaurants” in search engines, for instance, you might see useful suggestions like “near me” or “in San Jose.” Typing the letter "H" in any given text message will prompt quick one-word suggestions of “Hi,” “Hey,” and “How.” But these algorithms are typically just based on how often key terms are searched or words are typed. Such pattern matching based on popular phrases doesn’t work when you’re chatting online. To complete full sentences, AI systems have to construct phrases that are not just logical and grammatically correct but also relevant to the broader context of the conversation, while filtering out offensive or biased language. “Near me,” for instance, may not make sense if you want to tell a friend: “I’m going to the best restaurant.”
And all this has to happen faster than a keystroke.
Recently, AI systems have matured enough to do just that: complete entire sentences that are specific to your conversation — as you’re typing.
Nanshu is leading this effort at Facebook. Predictive text suggestions are part of Workplace, Facebook’s work collaboration tool, helping people communicate more easily with their colleagues. Nanshu’s team is working on bringing this capability to other Facebook products. In the future, predictive text will evolve beyond the keyboard. This technology can potentially plug into virtual keyboards in immersive AR/VR experiences, where text suggestions would appear with a simple finger movement.
More broadly, predictive text makes communicating with colleagues, family, and friends easier and more convenient (even a few keystrokes saved add up over time). “It’s been really helpful for me so far in my day to day, personally, as a nonnative English speaker, and I’m really proud to bring this capability to help more people around the world,” Nanshu says. And as COVID-19 shifts millions of people toward remote work, Nanshu and her team hope that predictive text features will make communicating easier for everyone.
Finding a way with words
The only way for machines to predict what word comes next is by first understanding hundreds of thousands of words in trillions of combinations. Essentially, language models have to learn the structure of the language itself.
“One of the biggest accelerators of AI over the last decade is the rise of ‘generative’ language models, which learn to generate accurate sequences of data based on an incredibly complex distribution of data sets,” says Nanshu. “In our case, that’s an enormous volume of words.”
Since the model needs to factor in the sequence of words typed, they used a generative model called a recurrent neural network (RNN) to make it happen. It’s the building block that lets them make correlations between phrases like “Happy anniversary” and “Congrats!” Most important, the representations of previous words in the conversation (technically called hidden states) allow AI systems to understand the context of a conversation. This isn’t perfect — no existing AI system has reached human-level language understanding — but it’s a lot more sophisticated than AI systems five years ago.
Special types of RNNs called long short-term memory (LSTM) are good at analyzing a long string of previous words so they can help make accurate predictions, even if there’s a big gap between the previous context and the current prediction. If you’re replying to a post about an intern who accepted a full-time position on your team, the language model will automatically suggest “Welcome to the team!” — even if multiple people introduced tangential topics, like “Let the good times roll!”
The language model learns writing patterns (e.g., grammar rules) from billions of examples of comments on public Facebook posts. When you’re typing, the model looks at the text you’ve typed so far and analyzes a few hundred of the most recent words in the conversation to create a prediction. If the model calculates a strong confidence score in its prediction, you’ll see the suggestion waiting next to your cursor.
“We can use the same AI model to help you type in real time when you search for a VR game or connect to a friend,” Nanshu says.
Helping you chat but not curse
When generative models learn from real-world public comments, they naturally learn all types of responses — including patterns of inappropriate behavior. To combat these inherent risks, the team worked with linguists to carefully scrutinize the system’s full vocabulary. The language model passes predictions through safety filters that prevent harmful suggestions. “By upkeeping a blocklist of offensive words and sensitive topics, we train our systems to minimize harmful or offensive words,” says Nanshu. The system’s not perfect, as people can always find new ways to express themselves, whether good or bad. But the team stays vigilant in keeping harmful content away from our suggestions. “We also work to remove information that could create inadvertent biases, like gender pronouns,” she adds.
But good suggestions aren’t enough. If there were even a hint of delay when showing suggestions as you’re typing, this tool wouldn’t be useful. “That means we have to show you our best suggestion in less than hundreds of milliseconds per keystroke,” Nanshu says. This level of performance requires a relentless tightrope balance between generating quality predictions and rapidly processing and sending information from the back-end server to the community, she explains. This is hard to do, given the size, complexity, and sophistication of language models.
The size of the English vocabulary is so large, for instance, that the model can’t search through every possible sequence of words to find the most optimal. “Instead, we used a decoding algorithm, called beam search, that calculates the partial optimal sequence and limits the search space to just the most promising candidates,” Nanshu says.
Nanshu and her team also use a technique called incremental decoding to cache the representations of previous words that changed over time. By feeding exactly one word at a time through beam search, the system significantly increases prediction speed. And to make the system even more efficient, the team precalculates the context features and cache as soon as the person starts typing in the composer. So, if the question is “What is the status of the TPS reports?” and you’ve started typing “We are,” the system sifts through tens of thousands of logical possibilities and assigns a confidence score based on the correlations it learned during training. In this case, high confidence in phrases includes “in the process of” and “taking a look at.”
To improve processing speed, the large models transfer their knowledge to smaller networks. This step, called knowledge distillation, helped accelerate the model 14x in production. And when it’s showtime, massive high-end processors called graphics processing units (GPUs) make it possible to run through billions of words as you’re typing.
The next evolution
Looking ahead, predictive text will be even more important for future innovations, like interacting with colleagues, friends, and family through VR or AR glasses.
As a next step, Nanshu’s team has to work through a host of hard computing challenges that come with AR/VR technology that’s often on-device. “The devices are an order of magnitude more restrictive when it comes to memory, processing power, and battery life,” Nanshu says.
And the systems are getting better all the time. Of course, even the most advanced AI language models today can’t perfectly predict every sentence. Nanshu explains that AI systems are far from understanding the abstract aspects of human intelligence — like sarcasm or reasoning — which are broader, wide-open challenges for the entire AI community. “The AI field is working toward a long-term goal of achieving human-level language understanding, which will make many applications, like autocomplete, a lot smarter and more useful,” Nanshu says. “Until then, we’re focused on iteratively improving the quality of our language models every day. And I hope it makes typing a little easier for everyone.”