Google recently added the one trillionth Web address to its list of indexed Web pages, and yet that represents only a small portion of the entire Web. Beyond the one trillion pages is an even larger Web of data, including financial information, shopping catalogs, flight schedules, medical research, and all other kinds of data that is largely hidden to search engines. This so-called Deep Web represents a major challenge to large search engines and prevents them from providing meaningful responses to many queries.
Search engines' crawlers, which collect information by following hyperlinks, work well for pages that are on the surface of the Web, but fail to penetrate databases. To collect meaningful data from the Deep Web, search engines must be able to analyze users' search terms and determine how to direct those queries toward databases, but the wide variety of database structures and possible search terms makes this a daunting challenge. "This is the most interesting data integration problem imaginable," says Google's Alon Halevy.
Google's Deep Web strategy involves using programs to analyze the contents of every database the search engine encounters. The search engine analyzes the results of each database and creates a predictive model of the database's contents. Meanwhile, University of Utah professor Juliana Freire's DeepPeep project is attempting to index every publicly available Web database. Freire has developed an automated process for querying databases that she says retrieves more than 90 percent of a database's contents. Experts say that Deep Web search technology could become a more efficient and less expensive approach than the Semantic Web to interconnect Web data. "The huge thing is the ability to connect disparate data sources," says computer scientist Mike Bergman.
From The New York Times
View Full Article