[I spoke at the Lita Top Technology Trends at Dallas. I had a trend in reserve – big data – but did not use it. Here is something along the lines of what I might have said …]
Big Data is a big trend, but as with expressions for other newly forming areas, it may evoke different things for different people.
A few years ago, academic libraries might have thought of scientific or biomedical data when they heard the expression ‘big data’. In particular, the publication of The Fourth Paradigm: data-intensive scientific discovery helped crystallise awareness of developments in scientific practice.
More recently, however, big data has become a much more general term, across various domains. Indeed, it is now common to read about big data in the general business press. One comes across it in government and medicine, and in education. For example, a recent article in Inside Higher Ed talks about ‘big data’ and ‘predictive analytics’ in relation to course data and student retention. There are two interesting aspects of this, one, the data, and, two, the management environment …
The rise of webscale services which handle large amounts of users, transactions and data has made the management of big data a more visible issue. At the same time, as more material is digital, as more business processes are automated, and as more activities shed usage data, organizations are having to cope with greater volume and variety of relatively unstructured data. Analytics, the extraction of intelligence from usage data has become a major activity. Here is a helpful characterization by Edd Dumbill on O’Reilly Radar.
As a catch-all term, “big data” can be pretty nebulous, in the same way that the term “cloud” covers diverse technologies. Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing? [What is big data?]
In a brief discussion of big data as a possible trend on FaceBook, Leslie Johnston provided an interesting perspective on issues from the Library of Congress.
Our collections are not just discovered by people and looked at, they are discovered by processes and analyzed using increasingly sophisticated tools in the hands of individual researchers, using just laptops. And we not only have TB/PB of digital collections, we will have billions of items, so fully manual processing/cataloging is rapidly becoming a thing of the past.
Leslie expanded on some of the actual data …
- 5 million newspaper pages, images with OCR, available via API, used in NSF digging into data project for data mining, combined with other collections used in new visualizations, and in an image analysis project.
- 5 billion files of all types in a single institutional web archive – researchers do not search for and view individual archived sites, they analyze sites over time, and characterize entire corpuses, such as campaign web sites over 10 years.
- Extreme example: over 50 billion tweets: many research requests received to do linguistic analysis, graph analysis, track geographic spread of news stories, etc.
- Collection of 100s of thousands of electronic journal articles, which require article-level discovery: they don’t all come with metadata and no one can afford to create it manually.
The remark about manual creation of metadata is one example where current processing methods do not scale. Leslie also notes:
And we cannot do manual individual file QA for mass digitization or catalog web archives or tweets without automated extraction. And when we start talking about video and audio, it all requires automated extraction or processing. I know of one request that we process a video to produce an audio-only track so that a transcript could then be automatically generated. LC has 20 PB of video and audio. Can you imagine what it would take to provide that level of service? Researchers started asking a few years ago to get files so they could do it themselves.
The Library of Congress may be a special case, but other organizations are facing similar issues. We are familiar with discussions about research data curation in university settings. Referring to the university challenge, Leslie then points to another interesting example.
I hear this from research libraries, but also from archives, especially state archives that are mandated to take in all state records, physical and electronic. Email archives are already Big Data for a lot of state archives.
Indeed, national or state institutions with responsibility for public records are reconfiguring organizations and systems to manage large volumes of e-records. My colleague Jackie Dooley pointed me at the recent Presidential Mandate on Managing Government Records which has implications for agencies and NARA.
In this context, it is not surprising that we are seeing a growing interest in data mining across domains (Leslie mentions the ‘digging into data‘ challenge). The term ‘data scientist’ is cropping up in job ads and position titles. A couple of years ago, Hal Varian’s comments on the importance of data and the skills required to analyse it were widely noticed.
The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complementary scarce factor is the ability to understand that data and extract value from it. [Hal Varian on how the web challenges managers – reg required]
It is clear from this discussion that existing systems are not well suited to manage and analyse these types of data, and this introduces the second topic, the management environment. Indeed, for Dumbill, this is the defining characteristic of big data:
Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.
And alternative ways have been emerging, assisted by the webscale companies who had to face these challenges early on. Google provided MapReduce, described by Edd Dumbill as follows:
The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the issue of data too large to fit onto a single machine. Combine this technique with commodity Linux servers and you have a cost-effective alternative to massive computing arrays. [What is Apache Hadoop]
MapReduce is a central part of Hadoop, whose development was supported by Yahoo, and whose further development is now supported within the Apache Software Foundation.
Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure. By large, we mean from 10-100 gigabytes and above. How is this different from what went before?
Existing enterprise data warehouses and relational databases excel at processing structured data and can store massive amounts of data, though at a cost: This requirement for structure restricts the kinds of data that can be processed, and it imposes an inertia that makes data warehouses unsuited for agile exploration of massive heterogenous data. The amount of effort required to warehouse data often means that valuable data sources in organizations are never mined. This is where Hadoop can make a big difference. [What is Apache Hadoop]
The availability of the Hadoop family of technologies (again, nicely described by Dumbill) and cheap commodity hardware has made processing of large amounts of data more accessible. Cloud options are also emerging, from Amazon, Microsoft and others. Uptake has been rapid.
So, while Hadoop and related technologies have emerged in the context of the Big Data requirements of webscale companies, they are becoming more widely deployed. Their scalability, coupled with lower cost, have made them an attractive option across a range of data processing tasks. They may be used with ‘big data’ and not so big data.
In this way, my big data trend may more realistically be two trends. We are indeed having to process greater volume and variety of data. The description of data management at the Library of Congress provides some nice examples. Several technologies, notably the Hadoop framework, have emerged as a result of such challenges. However, these are now also finding more broad adoption as they reduce costs and provide greater flexibility.
Coda: In OCLC research we have been using MapReduce for several years and more recently have been using Hadoop. We have been also working with colleagues elsewhere in OCLC as we look at where and how Hadoop might provide benefits.