Development roadmap

The first release of Orbital (v0.1) was on May 16th 2012. Orbital is a research and development project that has initial funding until March 2013. Given that timeframe, what follows is a high-level roadmap of where we expect development to go over the next few months.

We plan to release a new, working version of Orbital around the beginning of each month, up to December 2012, when we hope to hit v1.0. We plan to have a further point release in February, based on user feedback and then give ourselves a month to tidy things up and wind down this phase of the project.

You can read our implementation plan, check out our responses to user feedback and watch our project tracker for more details. The code can be downloaded here and here. Be aware that the roadmap may change at any time, based on user requirements, illness and natural disaster.

v0.1 (May)

See announcement. Authentication, user profiles, projects, upload of flat file data, a license picker, private and public archives, publish to the web, stable identifiers, download of flat files, analytics, and a way to give feedback.

v0.2 (June)

Documentation (for both developers and users), activity data, working/dynamic datasets + APIs, app access (OAuth), bug fixes, responses to feedback.

v0.3 (July)

User workspaces, version control, data staging (workflow), data snapshots, tools for basic analysis of datasets.

v0.4 (August)

EPrints integration (SWORD2), integration with Lincoln’s Award Management System (Worktribe), APIs for discovery, full role/permissions management, documentation.

v0.5 (September)

Plugins for third-party tools (e.g. Matlab)

v0.6/0.7/1.0 (Oct/Nov/Dec)

Network drive for desktop workspaces, full documentation for RDM, DataCite, ORCID, DMPOnline integration, improve design/branding, federated access.

Happy Christmas!

A Minimum Viable Product: Orbital v0.1

This is a post about our first release of Orbital.

About a month ago, Dr. Tom Duckett, Reader in The Department of Computing and Informatics approached the Orbital project because he urgently wanted to publish around 20GB of data for Long-term mobile robot operations. That afternoon, we gave Tom and Feras Dayoub, his Research Assistant, space on one of our servers and they uploaded a bunch of HTML pages and the zipped up data. We minted a proxy URL for them and advised them on an appropriate data license to choose.   We also set up Google Analytics, so they could see what interest in his data there was.

Job done. For the time being.

What Tom really wanted was to be able to email a link to his data to a robotics mailing list and tell an international community of likeminded researchers and manufacturers that the data was available to use. He says that long-term datasets for mobile robots are quite rare in his community, so there was a good chance people would be interested in them. He also wanted to be able to demonstrate his work when writing an EU bid. There will be a follow up blog post about what impact this has had on Tom’s research.

That afternoon got us thinking: What is the minimal set of functions that a researcher like Tom requires of a Research Data Management tool?

Tom wanted access (sign in) to a server (hosting) where he could upload his data (storage) and describe it so that other people could understand and download it (publish) under an appropriate license. The URL pointing to the data should be persistent, even if the data itself is migrated from one system to another. The impact (analytics) of the data should also be measurable.

Tom’s chance intervention in our project made us focus on Orbital v0.1 as the ‘minimum viable product‘ for researchers who need to publish open data. We thought his requirements were a great opportunity to release something early and start getting direct user feedback on our product. We decided to set a release date for Orbital v0.1 a month ahead and aim to deliver everything that Tom asked of us in this first release.

A Minimum Viable Product has just those features that allow the product to be deployed, and no more.

Today, we released Orbital v0.1 and it does everything described above. It’s an alpha release, but we’ve been testing it like crazy, we also had Feras test it and we’ve been pushing code through Jenkins since the beginning of the project so we know it passes our QA checks and we think it’s stable enough for use. From this point forward, Orbital and the URIs it mints will persist, too.

From today, a researcher at the University of Lincoln can sign in to Orbital, create and describe a project, upload their data to the project, choose a license for the data and add a Google Analytics code to measure project analytics (we’re also tracking each button click to better understand how people use Orbital). The data is published at a id.lincoln.ac.uk URI, which will persist indefinitely. At this stage, until we’ve got an approved business case for scaling it up and out to all academics, we’ll be limiting uploads on a case-by-case basis. You can view and request what other features we develop for Orbital on UserVoice, or in more detail on our project tracker. We’ve also written a basic development roadmap.

For developers, here are the basic technical details. You might also want to trawl through our implementation plan and the collected blog posts at the bottom of the plan.

Orbital is written in PHP using the CodeIgniter development framework.  It’s split into two main pieces of functionality. Orbital Core (database and APIs) is currently hosted on a Linux box on Rackspace’s cloud. Orbital Manager (the User Interface) is likewise hosted on Rackspace. A user signs in to Orbital Manager via OAuth 2.0 using their university credentials. Orbital Manager is using Twitter’s Bootstrap framework. The project metadata is stored in a MySQL database. Files are uploaded to Rackspace’s cloud files storage using Andrew Valums’s AJAX Uploader. APIs are exposed using Phil Sturgeon’s CodeIgniter REST server.

Orbital is licensed under the GNU Affero GPL 3 license and you can download, fork it and create pull requests on Github:

Orbital Core

Orbital Manager

New contributors to Orbital will be ritually applauded each weekday morning 🙂 Thanks.

Shared, versioned network drives

I’m at the DevCSI #mrdHackday with Nick, Harry and about 30 other people interested in hacking around research data. One of the user requirements identified among some MRD projects is the need for personal and shared networked workspaces i.e. a desktop drive for dumping, organising and sharing research data.

In our recent survey of researchers at Lincoln, we learned that many academics (myself included!) are using Dropbox as a way to share project files and research data among partners. It has the advantage over the FTP ‘H Drive’ that Lincoln staff are given in that Dropbox offers more storage and folders/files can be shared among people both inside and outside the university. The first couple of GB of storage is free and the pricing is clear when you need more space.

Just as researchers surveyed said they were using Dropbox, they also acknowledged in the survey that this isn’t an ideal situation. It’s being held by a third-party service, it’s runs unreliably on our university desktops, there’s a 30 day version history, but there’s no information about what changes were made and no way to compare versions. Part of the Orbital implementation plan is to provide an alternative to Dropbox and other similar network drives to Lincoln researchers. One that (probably) runs over HTTP, does version control properly, can be accessed through a web interface if necessary, and can be shared securely. The DataFlow project at Oxford has gone down the route of using WebDav for remote file storage and sharing and it’s an area we should investigate, too. There is a WebDav extension that provides versioning, too.

Of all the comments by researchers who responded to our survey, the clearest message which united them was for more, secure, backed up, and flexible storage. Within Orbital, we’ve been thinking about how Git (or a similar versioned source code repository tool) could be used to provide this functionality. Git is a proven and popular repository tool for managing text files, developed for the Linux kernel project and the basis for the popular Github ‘social network’ for developers. Jez Cope from Bath mentioned that there was an open source desktop tool called SparkleShare that provides a folder on your PC, just like Dropbox, Google Drive and Ubuntu One do, and uses Git as its backend. Jez and I have been playing with SparkleShare the last couple of days, having installed the Mac client on our laptops and it shows some promise but also needs some further consideration and effort to meet our immediate requirements for RDM. Jez has written a companion post about this, too.

SparkleShare for RDM

Pros

Multi platform GUI client

Easy to install

Relatively mature, actively maintained open source project

Version control built into backend (Git)

Notifications of changes to folder contents

Cons

Git isn’t built for handling large, binary files

Version control not built into desktop client (shows a high-level history of changes, but no roll-back functionality)

Sharing folders not built into desktop client

Next steps?

If Git isn’t the right choice of backend, SparkleShare can use something else. Whatever the underlying versioned repository technology, SparkleShare currently lacks detailed versioning information and roll-back functionality, which is in the backend repository. Presumably it could be surfaced and further functionality built around it. Likewise, a more convenient way to share repository folders with other people could be added to the client. Currently, you need to share the repository with them outside of the client.

Windows Explorer integration

Most researchers are using Windows as their OS, so it’s worth looking at the integration with Windows Explorer that other tools use. The DATUM project selected Bazaar over Git because they found the integration (TortoiseBZR) with Explorer to be better. I have found the standard Git tools for Windows Explorer to be pretty good, too. Neither provide the transparent functionality of SparkleShare or Dropbox.

Handling big files

The git architecture simply sucks for big objects. It was discussed somewhat durign the early stages, but a lot of it really is pretty fundamental. The fact that all the operations work on a full object, and the delta’s are (on purpose) just a very specific and limited kind of size compression is just very ingrained… Personally, I think the answer is “git is good for lots of small files”. It’s very much what git was designed for, and the fact that it doesn’t work for everything is a trade-off for the things it _does_ work well for.

So says Linus Torvalds, the creator of Git (and the Linux kernel). Git and other source code repository software were not designed to handle big files. However, there are other Git-based and alternative projects that are addressing this. git-annex is a mature well-documented and maintained project that

allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space. Even without file content tracking, being able to manage files with git, move files around and delete files with versioned directory trees, and use branches and distributed clones, are all very handy reasons to use git. And annexed files can co-exist in the same git repository with regularly versioned files, which is convenient for maintaining documents, Makefiles, etc that are associated with annexed files but that benefit from full revision control.

git-annex includes a use case on its home page that speaks to the RDM domain:

use case: The Archivist

Bob has many drives to archive his data, most of them kept offline, in a safe place.

With git-annex, Bob has a single directory tree that includes all his files, even if their content is being stored offline. He can reorganize his files using that tree, committing new versions to git, without worry about accidentally deleting anything.

When Bob needs access to some files, git-annex can tell him which drive(s) they’re on, and easily make them available. Indeed, every drive knows what is on every other drive. more about location tracking

Bob thinks long-term, and so he appreciates that git-annex uses a simple repository format. He knows his files will be accessible in the future even if the world has forgotten about git-annex and git. more about future-proofing

Run in a cron job, git-annex adds new files to archival drives at night. It also helps Bob keep track of intentional, and unintentional copies of files, and logs information he can use to decide when it’s time to duplicate the content of old drives. more about backup copies

The git-annex website has a useful page that discusses what it is not and it points to Sharebox as a FUSE filesystem built on top of git-annex. The project doesn’t look as mature as SparkleShare, but it’s good to see work being done on this, as the use case for Sharebox is very close to what I think several RDM projects are looking for. The git-annex website also points to other projects that are worth considering:

git-annex is more than just a workaround for git limitations that might eventually be fixed by efforts like git-bigfiles.

git-bigfiles does not tackle the same use cases that SparkleShare and Sharebox are focused on, but could perhaps provide the backend to such tools.

git-media has the advantage of using git smudge filters rather than git-annex’s pile of symlinks, and it may be a tighter fit for certain situations. It lacks git-annex’s support for widely distributed storage, using only a single backend data store. It also does not support partial checkouts of file contents, like git-annex does.

git-media is also a command-line tool and therefore provides only part of the solution to a ‘Dropbox alternative’ for big files. It doesn’t look like there’s been very much activity on the project in the last couple of years.

Boar implements its own version control system, rather than simply embracing and extending git. And while boar supports distributed clones of a repository, it does not support keeping different files in different clones of the same repository, which git-annex does, and is an important feature for large-scale archiving.

Boar does not use git, but is an alternative “version control and backup for photos, videos and other binary files.” It is not a distributed version control system either, but “does however work well with repositories on mapped network file systems, such as Windows shares and NFS.” The rationale for Boar is worth reading as it addresses many of the problems found in the RDM domain. It’s a well-maintained and well documented project, which, like git-annex, was clearly written to tackle genuine archival problems.

Boar aims to be the perfect way to make sure your most important digital information, like pictures, movies and documents, are stored safely.

  • Boar makes it possible for you to restore any or all of your files from any point in time.
  • Boar makes it easy to maintain verified backups of your data, including file history.
  • Boar imposes no limits on file or repository sizes.
  • Using boar is an effective way to prevent data loss due to human or machine error.

If you are familiar with vcs software such as Subversion, you might think of boar as “version control for large binary files”.

This sounds like an ideal tool for expert users willing to use the command line for managing large research datasets and binary files and it would be worth looking at how much work it would take to write GUI client or Windows Explorer integration as an alternative to Dropbox.

In summary, there are robust command line tools suitable for managing workspaces for research data over a network, but more work is required to build effective, simple graphical clients that can be used by any researcher.

We’re listening.

We love hearing feedback, so as part of our v0.1 release of Orbital we decided to make giving feedback super easy. We’ve done this using the tools provided by UserVoice, giving us a single box which lets you complete a simple sentence:

I need Orbital to…

That’s it. No questionnaires to draw up complicated statistical analysis of how people feel about a list of stuff we plucked out of thin air, just a single box for you to share your requirements with us.

UserVoice is designed to be really useful when it comes to gathering user feedback, so we’re taking advantage of that usefulness and adding a “feedback” button to every single page. When inspiration strikes, or you discover that Orbital is missing something, or you realise that life would be easier if it did something differently, you can just click the button and let us know.

You can also see a list of everything that people have said already, to see if your thought has already been shared. If you think a thought is particularly important you can vote on it, to add your own voice to what you reckon we should do next. We’ll also let you know how we’re getting on with looking at or implementing your ideas, telling you when we’re looking at feasibility, planning it and working (or not) on it.

So there you have it. No focus groups, workshops or questionnaires to carefully filter and manage your thoughts and suggestions. Just a simple box and a hotline to the development team.