As an artificial intelligence graduate student at the University of Washington, Jeffrey P. Bigham wanted to build systems to help people accomplish their goals, but he did not want to necessarily design new machine learning algorithms. He started working in the lab of computer science and engineering professor Richard Ladner, whose research focuses on accessibility technology for people with disabilities, especially for deaf, hard-of-hearing, deaf-blind, and blind people. One day Bigham watched a blind graduate student struggle to book an airline ticket via the Web. "That was powerful," recalls Bigham. "I saw this guy, who I really respected because of his intelligence and ability to get things done, having such difficulty. I feel that's something we shouldn't allow to happen and something we should be able to fix."
Since then, Bigham, now an assistant professor in the department of computer science at the University of Rochester, and one of this year's winners of a U.S. National Science Foundation (NSF) Career Award, has been achieving just that by incorporating artificial intelligence techniques to improve user interfaces. When Bigham learned the blind community was stymied by very specialized software for Web screen readers that were too expensive to install on most library and hotel business center computers, he had the topic for his dissertation. He subsequently developed WebAnywhere, a free Web-based screen reader that enables blind people to use any public computer with a sound card and without having to install screen-reader software. Also, WebAnywhere was an early example of the research area of cloud-based assistive technology.
Bigham next developed VizWiz, an iPhone application that enables blind people to recruit remote, sighted volunteers to help them with visual problems in near real time. With VizWiz, a blind user uses a smartphone to shoot a photo, say, of a can of soup he or she wishes to identify, speaks a question into the phone, and receives spoken answers from the sighted volunteers. Combined with services like Amazon's Mechanical Turk and social networks such as Facebook and Twitter "connecting people at all times on their mobile devices, the human cloud is ready and waiting," says Bigham. "We just need to figure out how to harness it to do useful work." So far, 3,500 people have downloaded the app, 50,000 questions have been asked, and the average answer time is under a minute.
With $500,000 of NSF award money, Bigham is taking such human-backed access technology a step further and applying it to video. The challenge now is how to develop an interactive system that provides high-quality feedback in real-time while compensating for unreliable individuals and constant turnover in the crowd. Bigham says his project will advance a new model in which a diverse and dynamic group collectively acts as a single operator. Suppose, for example, a blind person shoots video as he or she walks along a street and then asks the crowd about the surroundings. Individuals in the crowd would input answers, the system would calculate similarities between the different suggestions, and pick and forward the best one. But how does one determine when inputs are the same and whether they can be compared at all? "What's tricky," says Bigham, "is defining what it means to be similar." The inputs are different in the case of keyboard commands versus natural language strings, he explains. "Ideally, everyone would send their response in a particular millisecond, but people's inputs come in staggered, so we have to figure out how to associate inputs."
Seeking a solution, Bigham's research group has compared different ways of merging the input streams of many crowd workers to voting. If the system takes the most popular submissions, it is too slow because it could only issue a command as often as the time window allows — if that's a half second, it could only issue a command during that interval. To approach real-time, they used votes not to determine the next input, but to elect a temporary leader. So if a blind person wanted instructions about turning left or right to reach a Starbucks, the crowd member who has responded most frequently with the crowd consensus becomes the leader because he or she is representative of the crowd's best decision. The leader's input gets forwarded immediately and the group seems to act like an individual and respond faster. There is a tradeoff, or course. The leader could do something wrong that cannot be corrected, but that's unlikely because their track record predicts future behavior. But if they were to do something the crowd disagrees with, they are dethroned and a new leader is chosen.
"That's a nice mid-point between ensuring reliability through the wisdom of crowd while also wanting control to be in real-time," says Bigham. The hardest aspect of this solution, however, is that "we don't know what it means to have a group work in real-time on the types of tasks usually done by an individual," he says. Furthermore, people are great at gaming systems, so ultimately Bigham would like machines with artificial intelligence to take over for the crowd.
One of the main problems with VizWiz, however, is that it is difficult for blind users to take the correct picture, so although a crowd worker responds within 60 seconds, it is often to tell the user that what they are asking about is not included in their photo. Five or six rounds of instructions to get the right photo may be required before the question is answered. The same difficulty could occur with moving images, but Bigham says correcting the shot will happen faster because it is being streamed so feedback can happen instantly.
Bigham says these principles apply to other disabilities, and hearing-impaired people, for example, could benefit from crowd transcriptions of audio.
When Bigham speaks to businesspeople, they often say blind users do not visit their Web sites. Take a hypothetical car dealership. Bigham counters that although relatively few Web users are blind, the businesspeople's response fails to recognize the broader context in which most of us live. "Blind people may well be providers of a family and are involved in the decision to buy a car," he says. To those who protest that Web pages are all-visual, he responds, "That's only because the browser rendered them that way. There's underlying code that determines what gets rendered visually and you can imagine interpreting it another way." Finally, to those concerned about cost or delaying a product's release, he says making a site more accessible can require work, but it is the same amount as it takes to make Web pages standards-compliant.
Karen A. Frenkel is a science and technology writer based in New York City.