Researchers at the Massachusetts Institute of Technology (MIT) have developed a system that can learn to identify objects within an image, based on a spoken description of the image.
When provided with an image and an audio caption, the system can highlight in real time the relevant regions of the image being described.
The system learns words directly from recorded speech clips and objects in raw images, and associates them with one another.
The researchers trained the model on a total of 400,000 image-caption pairs, and held out 1,000 random pairs for testing.
Said researcher David Harwath, “We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to. We got the idea of training a model in a manner similar to walking a child through the world and narrating what you’re seeing.”
From MIT News
View Full Article
Abstracts Copyright © 2018 Information Inc., Bethesda, Maryland, USA