If the AIs of the future are, as many tech companies seem to hope, going to look through our eyes in the form of AR glasses and other wearables, they'll need to learn how to make sense of the human perspective. We're used to it, of course, but there's remarkably little first-person video footage of everyday tasks out there — which is why Facebook collected a few thousand hours for a new publicly available data set.
The challenge Facebook is attempting to get a grip on is simply that even the most impressive of object and scene recognition models today have been trained almost exclusively on third-person perspectives. So it can recognize a person cooking, but only if it sees that person standing in a kitchen, not if the view is from the person's eyes. Or it will recognize a bike, but not from the perspective of the rider. It's a perspective shift that we take for granted, because it's a natural part of our experience, but that computers find quite difficult.
The solution to machine learning problems is generally either more or better data, and in this case it can't hurt to have both. So Facebook contacted research partners around the world to collect first-person video of common activities like cooking, grocery shopping, typing shoelaces or just hanging out.
The 13 partner universities collected thousands of hours of video from more than 700 participants in nine countries, and it should be said at the outset that they were volunteers and controlled the level of their own involvement and identity. Those thousands of hours were whittled down to 3,000 by a research team that watched, edited and hand-annotated the video, while adding their own footage from staged environments they couldn't capture in the wild. It's all described in this research paper.
View Full Article