SWORDs and Citations

The researcher Dashboard has been expanded to interface with the Lincoln Repository, ePrints. From it, a researcher can deposit their datasets directly to the repository, complete with DOI.

In previous posts, I spoke about how the CKAN and ePrints APIs can interface. We have finally implemented both APIs for use with the Researcher Dashboard and created a useable workflow for depositing datasets from CKAN to ePrints via the dashboard.

The workflow goes as follows:

  1. Hit ‘Publish’
  2. Get latest metadata from CKAN
  3. Prompt user to complete form
  4. Generate DOI
  5. Send metadata to Datacite
  6. Mint DOI
  7. Post SWORD2 to ePrints
  8. Get ePrints ID from response
  9. Add ePrint to SQL database as minimal data
  10. Update dataset in database with ePrint link

When a researcher views their project, they are presented with a list of datasets lifted from the project environment in CKAN. If they want to deposit one into ePrints, they can select the deposit button and are prompted to finish the dataset metadata. ePrints requires a minimal set of metadata before the dataset can be deposited. It can be put into a users inbox with merely a title, but requires a minimal specific set before depositing.

The DOI is minted for a unique identifier, by sending the metadata to Datacite along with the generated DOI. A DOI has to be generated first before it can be minted. Again, this is another field that is input to ePrints via the metadata.

The inclusion of ePrints metadata gives an all in one approach to the Research Dashboard. As otherwise, users would have to go into ePrints and fill in the data there. An annoyance easily avoided by having all the necessary steps taken care of on one site. This completes the toolset, so projects now have a central hub of activity. Data is brought into Orbital via the AMS (Awards Management System) for importing funded projects and CKAN for datasets, and exported to ePrints for the depositing into the Lincoln repository.

The original plan for this workflow was published by Paul Stainthorp. The workflow as it stands currently is as written in this post. It is, however, still in the finishing stages and polishing to make sure the process is solid.

Open Data Protocols

For those of you playing on the technical side of research data, did you know that Open Data Protocols has done a load of work on standardised ways of storing data and metadata? The standards at Open Data Protocols (and its sister specification on Open Catalogs) are the ones that our future work on long-term preservation of data packages will be informed by (and are involved quite a bit in future work on CKAN). If you’re bundling data up at the end of a project, why not take a look at them?

CKAN for RDM workshop

On the 18th February, we ran a workshop in London which focused on the use of CKAN for research data management. The Orbital project made the decision to use CKAN last summer and was soon followed by Bristol’s data.bris project, which is using CKAN for its discovery catalogue. Simon Price from Bristol, gave a very interesting presentation of their work with CKAN, which you can read about on their project blog.

The #CKAN4RDM workshop was fully booked with 40 delegates attending – many more than we originally anticipated. It was facilitated by Simon Hodson, the Programme Manager of JISC’s Managing Research Data programme. Following presentations from Lincoln and Bristol on our respective uses of CKAN (ours was a live demo of ‘Orbital Bridge‘), we spent the later part of the morning undertaking a requirements gathering exercise, where tables of around 8-10 people acted as different users, providing ‘stories’ (requirements) for a research data management system. The exercise was introduced in the following few slides.

This was a useful exercise regardless of the software used, but after collating all 70+ stories over lunch, we then returned to our user groups and each table worked with a CKAN expert from the Open Knowledge Foundation to discuss the existing constraints for each requirement and started to develop a gap analysis so as to identify work to be done. The output of this work can be viewed on Google docs.

Types of users
Types of users
The 'researcher' user group
The ‘researcher’ user group

 

There was quite a positive buzz about the day and general feedback suggested that delegates got a lot out of the event. You can read write ups from the DCC, LSE and the Datapool project at Southampton.

One of the original purposes of the workshop was research for a conference paper that I (Joss) am giving at the IASSIST conference in Cologne, in May. The abstract I submitted to the conference was as follows:

This paper offers a full and critical evaluation of the open source CKAN software <http://ckan.org> for use as a Research Data Management (RDM) tool within a university environment. It presents a case study of CKAN’s implementation and use at the University of Lincoln, UK, and highlights its strengths and current weaknesses as an institutional Research Data Management tool. The author draws on his prior experience of implementing a mixed media Digital Asset Management system (DAM), Institutional Repository (IR) and institutional Web Content Management System (CMS), to offer an outline proposal for how CKAN can be used effectively for data analysis, storage and publishing in academia. This will be of interest to researchers, data librarians, and developers, who are responsible for the implementation of institutional RDM infrastructure. This paper is presented as part of the dissemination activities of the JISC-funded Orbital project <http://orbital.dev.lincoln.ac.uk>.

As well as using last week’s outputs of the CKAN4RDM workshop, I’ll also be working closely with OKF staff to ensure that the evaluation is as thorough, accurate and up-to-date as possible by the time of the conference. It will focus on version 2.0 of CKAN, which is due for release soon.

I’d also like to appeal to other JISC MRD projects to send me any existing requirements documents you have produced during the course of your project. I will use the anonymised data to enrich the requirements we gathered last week. If you have such documents, please email me.

Finally, we have set up a CKAN4RDM mailing list, which anyone is welcome to join to discuss the use of CKAN within academia. One thing is clear to me: the academic community cannot expect OKF and existing CKAN developers to meet all of our requirements for research data management. We need to contribute developer time and other resource and effort to the overall CKAN open source project, just as other public sector organisations are doing.

 

The Importance of Useful Data

During the development of Orbital (specifically the Researcher Dashboard) we’ve been trying (with mixed success) to make it integrate smoothly with various other University systems. Fortunately, a design decision made by some of the LNCD team a couple of years ago means that we’ve got our own institutional data store (codename Nucleus) with which we can almost exclusively interact to get hold of everything we needed. Where we’ve been integrating with new systems such as the University’s Awards Management System we’ve taken the approach of hooking the data into Nucleus first, so that it’s not only available to Researcher Dashboard but also to any other system which needs it.

Nucleus has quite a powerful framework for managing, structuring and presenting data in a rigorously managed format. It validates things at various points during data entry to make sure that it’s not gibberish, and then at the point of rendering it’s put through another set of functions which ensure it’s presented consistently and in as useful a manner as possible. As a result (using Nucleus, our PHP library, the CWD and our OAuth 2 authorisation server) we can go from a standing start to a fully featured, integrated application in a couple of days. A big part of the reason we can do this is that we make extensive use of dogfooding to ensure that our data is useful.

It saddens me, therefore, that during integration with some other applications both inside and outside the University we are forced to tackle data – often purported to be “machine readable” or “ready for reuse” which has clearly not been looked at by the eye of somebody who wants to reuse it. As an example, one source of data provides a date range which is stored internally (as far as I can gather) as two distinct values; there is a “start date” and there is an “end date”. These are provided through the UI as structured inputs (a date picker) which ensures they’re entered (and presumably then stored) in an expected format which can be manipulated as necessary. The API then chooses to express this date range not as a distinct “start date” and “end date”, but instead as a single “dates”.

You may think that this isn’t such a big problem – after all, how difficult can it be to parse 04/02/2013 - 07/03/2014? In that example it’s actually pretty easy once you’ve decided if you’re using UK or US style dates. The ISO date format can solve this though, giving us 2013-02-04 - 2014-03-07. Sadly, this isn’t what we get. In fact, here are the four (yes, four) distinct ways that “dates” can be represented:

  • 2013-02-04 to 2013-02-04 becomes 4 Feb 2013
  • 2013-02-04 to 2014-03-07 becomes 4 Feb 2013 - 7 Mar 2014
  • 2013-02-04 to 2013-03-07 becomes 4 Feb - 7 Mar 2013
  • 2013-02-04 to 2013-02-07 becomes 4 Feb 2013 - 7 Feb 2013

So, the rule becomes that if the dates are the same you just show the single date, but if the dates are different then you show two dates, unless they are in the same year in which case you only show the year in the final date, unless they are in the same year and the same month, in which case you show two dates. And then you format all the dates with a locale-specific short form of the month name.

Parsing this is understandably more difficult than it should be. Please, think about how your data will actually be used when building outputs.

CKAN trending

Last summer, we adopted CKAN as our data store/repository/catalogue. At that time, I noted that much had happened in the CKAN project in the few months since the start of the Orbital project in November 2011 that made CKAN a more attractive proposition for managing research data.

Recently, someone on the CKAN mailing list pointed to the graph below, which shows that the interest in CKAN has exploded. In November 2011, interest in CKAN was at just a quarter of its current peak, which is double that of September 2012, when we made the switch to CKAN. Following the European Commission and the UK government, the recent decision by the US government to adopt CKAN for the next version of data.gov will only drive interest in and the development of CKAN even further.

It is an exciting time to be observing and part of this explosion of interest. However, it is worth remembering that the interest in CKAN and data management is still very small compared to interest in other, more generic, content management systems. Publishing structured open data remains a niche interest compared to other open practices on the web, such as blogging. Here’s the graph comparing CKAN to WordPress.

Perhaps a fairer comparison would be that of CKAN with open access repository software, such as ePrints and DSpace.

Of course, the cumulative interest of DSpace and of ePrints over the years is greater than that of CKAN, but right now, there is clearly more interest in CKAN and publishing open data, than there is in open access repository software. The open access movement has matured, while the open data movement is growing rapidly. It will be interesting to follow these trends to measure (in part) the maturity of the open data movement, too.