And we’ve achieved an industry first by making it possible to include information about the positional location and relative size of elements in a photo. So instead of describing the contents of a photo as “May be an image of 5 people,” we can specify that there are two people in the center of the photo and three others scattered toward the fringes, implying that the two in the center are the focus. Or, instead of simply describing a lovely landscape with “May be a house and a mountain,” we can highlight that the mountain is the primary object in a scene based on how large it appears in comparison with the house at its base.
Taken together, these advancements help users who are blind or visually impaired better understand what’s in photos posted by their family and friends — and in their own photos — by providing more (and more detailed) information.
Where we started
The concept of alt text dates back to the early days of the internet, providing slow dial-up connections with a text alternative to downloading bandwidth-intensive images. Of course, alt text also helped people who are blind or visually impaired navigate the internet, since it can be used by screen reader software to generate spoken image descriptions. Unfortunately, faster internet speeds made alt text less of a priority for many users. And since these descriptions needed to be added manually by whoever uploaded an image, many photos began to feature no alt text at all — with no recourse for the people who had relied on it.
Nearly five years ago, we leveraged Facebook’s computer vision expertise to help solve this problem. The first version of AAT was developed using human-labeled data, with which we trained a deep convolutional neural network using millions of examples in a supervised fashion. Our completed AAT model could recognize 100 common concepts, like “tree,” “mountain,” and “outdoors.” And since people who use Facebook often share photos of friends and family, our AAT descriptions used facial recognition models that identified people (as long as those people gave explicit opt-in consent). For people who are BVI, this was a giant step forward.
Seeing more of the world
But we knew there was more that AAT could do, and the next logical step was to expand the number of recognizable objects and refine how we described them.
To achieve this, we moved away from fully supervised learning with human-labeled data. While this method delivers precision, the time and effort involved in labeling data are extremely high — and that’s why our original AAT model reliably recognized only 100 objects. Recognizing that this approach would not scale, we needed a new path forward.
For our latest iteration of AAT, we leveraged a model trained on weakly supervised data in the form of billions of public Instagram images and their hashtags. To make our models work better for everyone, we fine-tuned them so that data was sampled from images across all geographies, and using translations of hashtags in many languages. We also evaluated our concepts along gender, skin tone, and age axes. The resulting models are both more accurate and culturally and demographically inclusive — for instance, they can identify weddings around the world based (in part) on traditional apparel instead of labeling only photos featuring white wedding dresses.