Repository interoperability

JISC has just made available a report by Nicky Ferguson and colleagues about consistency of metadata and policies between repositories.

In the UK, a large number of Institutional Repositories have been set up very recently. Often, it seems, they lack sufficient clarity of policy and purpose. In interviews with depositors and after conducting a case study of an Institutional Repository, we find different perceptions of the role of the repository, some seeing it mainly as an administrative tool for collecting and collating research at the institution and others believing it is a tool for sharing research and creating open access to the results of that research. If such perceptions are combined with weakly defined policies and/or unclear implementation procedures, then it would be unsurprising to find inconsistencies both within and between repositories. In fact, our respondents tell us that such inconsistency is widespread and are pessimistic that this will change, except where sufficient resources, shared objectives and strong relationships are in place. [Feasibility study into approaches to improve the consistency with which repositories share material]

The report makes detailed recommendations to JISC and to repository managers. (Note: I was among those interviewed by the authors.) I don’t propose to summarise their findings here, but to note several things that occurred to me while reading through it:
Interoperability and SEO. We applaud interoperability. Even, or so it sometimes seems, as an end in itself, independent of any motivating services or applications which will benefit. However, we have a somewhat limited interpretation of interoperability. We tend to think about it as something that happens within our domain, between like things. Think of moving metadata or content between repository applications, or searching across multiple databases. However, it is increasingly important to think about interoperability between our services and other people’s services, notably with the general web environment of search engines, RSS aggregators, and so on. This is something that the report touches on. Sitemaps and RSS feeds are increasingly important in terms of disclosing repository content into user environments. SEO (search engine optimization) is increasingly important: given the importance of search engines as entry points, repository and data managers need to think about how their services and data work in a world where being crawled and ranked well are so important. I sometimes come across a view that SEO is to be frowned on, that it involves deception or nasty commercial practices. Maybe if we were to invent a new phrase, SEI or Search Engine Interoperability, people would be happier with it 😉
Names. As search and citation in electronic environments become more important it is interesting to see the focus on consistent names, or author identifiers, emerge more strongly. Authors, institutions, publishers and others have an interest here, as inconsistent naming reduces impact. The authors note various initiatives, including some of our work. National libraries have so far not broadened the scope of their authority work beyond the ‘cataloged’ collection, something that I discuss here. We are now in a phase where multiple organizations are looking at academic author identification and profiling.
Variable metadata. Metadata created in different regimes, with different rules or guidelines, with different editors, with variable access to controlled vocabularies and authority data, will be inconsistent. There should be no surprise here. And working effectively with this metadata will have a cost (what I have called a ‘stitching cost‘). The authors discuss this and point to other confirming work including a JCDL paper by Lagoze et al. There, the authors note their initial expectation that Worldcat and shared cataloging was a potential model for a service based on metadata aggregation. Not surprisingly, they were disappointed:

In fact, reality fell far short of our expectations. We discovered that the WorldCat paradigm, which works so well in the shared professional culture of the library, is less effective in a context of widely varied commitment and expertise. Few collections were willing or able to allocate sufficient human resources to provide quality metadata. [Metadata aggregation and “automated digital libraries”: A retrospective on the NSDL experience PDF]

However, what is ‘quality’ in this context. What is distinctive about the shared cataloging environment? There are mature editors developed over many years, which encapsulate many discussions/agreements about cataloging policy and practice. There is a community of practice institutionalized in various professional venues. There are detailed rules and rule interpretations. There are shared authority and terminology files. There is extensive sharing of metadata between systems. which encourages uniformity And there are criteria for addition to Worldcat or other union catalogues. There is a whole apparatus here devoted to ensuring consistency of metadata, and this is enforced through a shared infrastructure and shared professional practice, as well as through codes and rules. This is a very different model to one in which data is simply aggregated from multiple repositories, even if they are nominally using the same standards.
This is one reason why I think that shared network level metadata creation tools are an interesting area for exploration, particularly for organizations with the national remit of JISC. They may not benefit from the same sharing of creation effort as happens in the shared cataloging environment – because the resources they are describing may be unique – but there are other aspects of the shared environment which would be beneficial.
If people use the same editing environment, with access to a set of vocabularies, and participate in a shared community of interpretation, then the chances for consistency may go up. See Calames for what I understand to be an example of this.
Of course, this note touches on cases where metadata is being created manually. Where and when such creation is justified or affordable is another question, and this is something that the report also dwells on.
Related entries:

Repository interoperability

Teaching: one year in

Workflow is the new content 1: looking at research support and engagement

University Futures are shaping Library Futures

lorcan dempsey dot net