Search engines look for clues about the importance of a document or piece of information for a given set of keywords. Often this means relying on what other pages link to; this is how Google's famous PageRank algorithm works.
Researchers have now developed subtler ways of measuring the influence and importance of documents and pages on the Web and in archives, by using the text stored in those documents. This approach doesn't rely on people adding pointers such as links and citations, and it could lead to better real-time search engines as well as recommendation systems that automatically gather information on a certain topic.
Software being developed at Princeton University takes an archive of documents and measures changes in language use between documents over time. The sample being analyzed could be a collection of scientific papers or a set of posts from certain blogs. The software analyzes the text in documents and then identifies the most significant words and phrases in particular categories—ones that appear often across many different documents. It then teases out the early appearances of those bits of language to pinpoint the documents that most likely contained ideas that influenced those in other documents. The algorithms can continue to run as items are added to a collection of documents over time.
From Technology Review
View Full Article