I’m currently doing some linked data consultancy for the Open Planets Foundation (OPF). I’m helping to investigate the options for using linked data for a registry of file format information, part of the toolbox needed for long-term preservation of digital material by ‘memory institutions’ like archives and libraries. This initiative was kicked off and sponsored by the National Archives of the Netherlands, one of the founding members of the OPF, a not-for-profit that was set up to further the work of the EU Framework 7 research project PLANETS.

What’s a file format registry? In order to make sure you can still access all your files in 10, 50 or a 100 years you first need to know what kind of file formats you’ve got and the tools available to work with them. For institutions with a responsibility to look after our records of government and cultural history this is a high priority. So step 1 is to systematically track file formats and information about them.

Registries of this kind already exist, notably the UK National Archives PRONOM system, the GDFR system from Harvard University Library and the PLANETS Core Registry.

So why do we need another one? The field of digital preservation research is still only a decade or two old and many lessons are still being learned: the first generation of registries have done a great job in many respects but have also highlighted new requirements.

An important issue arises from the large amount of ongoing research effort required to keep on top of the wide range of file formats in use, with new types of digital material and new software appearing all the time. This is not a job that any one institution can afford to do by itself, so sharing of information is essential, between archives, libraries, universities, software vendors and individual experts. Also, the information you need is a mixture of facts and policy choices. The specification for PDF1.4 may not be open for argument, but choices on how to manage PDF files over the long term and what tools to use may vary from one organization to the other.

The problem is in many ways one of distributed web publishing, with the need for unambiguous shared identifiers, so everyone knows when they are talking about the same thing. The information to be stored about file formats is complex and a precisely defined shared vocabulary for format descriptions is essential for effective information sharing. So it’s a very natural fit for linked data.

Keep up to date with our news by signing up to our newsletter.
Thanks for reading all the way to the end!
We'd love it if you shared this article.