I have posted a couple of times recently about intentional data, data that records choices and behaviors. I mentioned holdings data, ILL records, circulation records, and database usage records. One could extend this list to any data which records an interaction or choice. We are used to looking at transaction logs of various sorts, and new forms of data are emerging, for example, in the form of questions asked in virtual reference. What types of intelligence could be mined from a comparison of the subject profiles of virtual reference questions to the subject profile of collections? Would it expose gaps in the collection, for example?
In that context I was interested to read a post on the Gordian Knot pointing to some work by David Pattern at the University of Huddersfield which shows a ‘people who borrowed this also borrowed …’ feature. And it does look like a good enhancement. (It does not seem to be available on the ‘publicly visible’ catalogue.)
Circulation is interesting in this context. We run into a long tail sort of a thing. Amazon is the primary exemplar of this type of ‘recommender’ service. Amazon aggregates supply (it has a very big database of potential hits in the context of any query, increasing the chances that a person will find something of interest), and it aggregates demand (it is a major gravitational hub on the network, so it assembles lots of eyeballs, increasing the chances that any one book will be found by an interested person). The result of this – the aggregation of supply and the aggregation of demand – is that use is driven down the long tail. More materials are aggregated, and more of them find an audience.
Now, we know that, typically, the smaller part of a library collection circulates (maybe less than 20% in a research library). We also know that, typically, interlibrary lending trafffic is very, very much smaller than circulation.
What does this suggest? Well the former suggests that we have an excess of supply over demand in any library, and we have indeed built ‘just in case’ collections. However, aggregating demand should make those collections more used, and this appears to be the case in services like Ohiolink, for example, which have aggregated demand for insitutional collections at the state-wide level, increasing the chances that an item will be found by an interested reader. The latter suggests that we have not aggregated supply across libraries in a systemwide way very efficiently, as library users do not very often go beyond their local collection. There are various reasons for this, including library policy in what is made available, but in general one might say that the transaction costs of discovering, locating, requesting and having delivered resources are high enough to inhibit use. Again, this suggests that we have not aggregated supply as effectively as we might in systemwide situations (this was the focus of another post).
Coming back to recommendations based on circulation, two things occur to me:
- One might imagine a complement to a circulation-based recommender service which recommends other books in the collection which have not circulated, or have not circulated as much. In other words, which ties circulating books to the non-circulating ones. And we know about various ‘books like this’ measures: by subject, by author, by series. In fact, catalogs were originally designed to make these types of connections. However, there is other data which shares the ‘intentional’ element which makes circulation interesting, and which represents aggregate choices: things that have appeared on the same reading list, that have been been recommended by the same faculty member, and importantly things that cite or have been cited by the selected item. Now, in some of these cases the benefits resulting may not be worth the effort of collecting and manipulating the data; we do not know. In others, citation for example, there clearly are benefits.
- For many of these examples, it may be difficult for a library to generate the data and build services on top of it without better support – in their systems or in services available to them. Furthermore, in many cases the results may be improved by aggregating data across libraries, or across other service environments. The Gordian Knot suggests there may be scope, for example, for services based on aggregated circulation data. (This is not to ignore the real policy questions surrounding the sharing of circulation data. Of course, there are also technical issues of exporting and exchanging in common ways.) Amazon has introduced very useful services based on citation and also associates books based on shared distinctive word patterns. One could imagine those connections being leveraged in a catalog, and Amazon is well placed to do this based on the volume of data it has. In fact, one of the benefits of the mass digitization projects currently under way would be to allow more of that type of connection to be made. Clearly, services based on holdings data depend on aggregations. In WorldCat-based services, OCLC ranks results by volume of holdings, the most widely held first. And there has been interest from time to time from libraries and others in having access to holdings counts to allow them to rank results in their own environments by this measure, on the assumption that the more widely held an item is the more likely it is to meet a need. We do not offer a service like this at the moment, but you can imagine one. We are also experimenting with generating audience levels based on the pattern of holdings (something that lots of high-schools hold is likely to be different to something that only a few research libraries hold). And we are seeing growing interest in the sharing of database usage data, based on pooling of Counter-compliant data. One reason that aggregation is potentially beneficial is that it address the demand-side issue discussed above: by aggregating data one may make connections that do not get made in the data generated by a smaller group of users.
As we extend the ways in which users can discover materials, it puts additional emphasis on the need to improve our systemwide apparatus for delivering those materials.
Making data work harder is an integral part of the Web 2.0 discussions, and we certainly have a lot of data to do things with!