Several items have come over my desk (?) in the last few days which together point to the growing importance of techniques for programmatically promoting data from unstructured documents, including web pages.
The Economist had a piece about Autonomy, the search sofware company a while ago.
Yet the most important reason for Autonomy’s success is that it is riding a long-term trend in computing. Corporate computing was once all about “structured data” such as payroll records or sales figures. Now, however, computers are also able to crunch “unstructured” data, such as documents, e-mails and photographs. And the quantity of such data has exploded: more than 80% of a typical company’s information is now unstructured, according to some estimates. Firms that can extract meaning from this digital pile will have a big advantage. [Software in the recession | Out on its own | The Economist]
Here is how the article describes what Autonomy does: “Using complex algorithms, they are able to extract ideas from all kinds of data, be it text, audio or video–even if these ideas are expressed in different terms.”.
Several of my colleagues have looked at the OpenCalais web service from Thomson Reuters. Here is the blurb from their website:
The Calais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing, machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well. [How Does Calais Work? | OpenCalais]
I got a note from Krista Thomas of Thomson Reuters about Media Cloud, an initiative of the Berkman Center at Harvard which uses Calais. This is an adventurous initiative which aims to provide tools to track trends in the media …
For each story from a given news source, the system automatically assigns relevant terms to that article. These terms, and the stories they describe, are then explored in relation to the rest of the interconnected network of media sources. Sources may cluster together around specific topics, or diverge.
In its most ambitious incarnation, Media Cloud might ultimately identify new memes as they emerge in the media ecosystem. By combining massive data collection with novel clustering techniques, we may be able to identify thousands of instances of new ideas emerging from one corner the media ecosystem and spreading to other parts – or failing to spread. [Media Cloud » About]
Finally, a colleague circulated a note about Microsoft Research’s EntityCube initiative …
The need for collecting and understanding Web information about a real-world entity (such as a person or a product) currently is fulfilled manually through search engines. But the information about a single entity might appear in thousands of Web pages. Even if a search engine could find all the relevant Web pages about an entity, the user would need to sift through all the pages to get a complete view of the entity. EntityCube is an entity search and summarization system that efficiently generates summaries of Web entities from billions of crawled Web pages. The summarized information is used to build an object-level search engine about people, locations, and organizations and explore their relationships. [EntityCube – Microsoft Research]
These are all very different: but they are symptomatic of a major interest in extracting meaningful content from masses of unstructured data through various programmatic techniques.
This is something of interest to all organizations — including universities. In this context, I was struck by this comment in the Taiga 2009 provocative statements:
4. … knowledge management will be identified as a critical need on campus and will be defined much more broadly than libraries have defined it. The front door for all information inquiries will be at the university level. Libraries will have a small information service role. [Taiga. Provocative Statements (after the meeting)
February 20, 2009 – PDF]
Now, this is intended to be provocative ;-), but it is worth thinking about what sorts of techniques will be useful to manage campus information and where the library should be thinking about developing capacities. And indeed, libraries themselves are also managing more digital materials – eprints, web pages, digitized materials, … – where the current metadata creation model does not scale and programmatic promotion of metadata, entity identification, and so on, becomes more important.
Related entries: