The hard graft of Linked Data ETL
Allowing a range of users to repeatably and reliably produce high quality Linked Data requires a new approach to Extract, Transform, Load (ETL), and we’re working on a solution called Grafter.
At Swirrl, our aims are simple: we’re helping regional and national governments publish the highest quality open data imaginable, whilst helping consumers access and use this data in rich and meaningful ways. This means visualisations, browsing, metadata, statistical rigour, data modelling, data integration and lots of design; and it’s why our customers choose to buy PublishMyData.
This is great, but unlocking the data from the “ununderstood dark matter of IT” and the dozens of other formats in use within the bowels of government is currently a highly skilled job.
Extract, Transform and Load
For us this job involves converting the data from its source form (which is almost always tabular in nature) to a linked data graph. Once this is done the data then needs to be loaded into our publishing platform. This whole life-cycle is known as “Extract, Transform and Load”, or ETL for short.
Recently thanks to our continuing growth, in terms of employees, customers and data volume, we’ve come to identify ETL as a barrier. And it’s a barrier to both our own and our customers’ ability to produce high quality, high volume data in a repeatable efficient manner.
ETL is not a new problem; it dates back to the code breaking work of Turing at Bletchley Park in the 1940’s. Though solutions to it were largely popularised by the likes of IBM with the birth of mainframe, batch processing, and database computing in the 1950s and 60s.
The Bletchley Park ETL Pipeline
Image: Wikipedia (CC BY-SA 3.0)
Unfortunately when it comes to Linked Data, the ETL tools available tend to be immature, unusable, flawed in a critical dimension, or too inflexible to be useful for the work we encounter on a day to day basis.
For lack of a better, flexible option we’ve solved all of our ETL needs until now by writing bespoke Ruby scripts. Though this is flexible and we manage a certain amount of code re-use, it can be problematic and costly to do.
The scripts typically end up doing data cleaning and data conversion: they are often quite large and take time to develop, and they can be awkward to maintain and document. So if they need to be re-run again with new data, another time consuming step is required to check exactly how the script works and what inputs it needs.
But even deeper problems can occur. Is the script robust to structural changes in the source data? Does it identify errors and report them properly? Does the script need to be modified to accommodate any new assumptions in the data? And, when it’s finally done, how can we be sure it actually did the right thing?
Typically once we’ve run the script, we’ll have a file of linked data in an appropriate RDF serialisation (usually n-triples). This file will then need to be loaded and indexed by the triple store; an operation made awkward by having to load and transfer large files over the network.
Answering all of these questions and validating the result is time consuming - it’s clear that a better way is needed. We realised it was time to start addressing some of these thorny issues for ourselves and our customers and so are busy developing a suite of software we’re calling Grafter. We’re grateful for support from two EU FP7 projects: DaPaas and OpenCube.
The kinds of transformation we encounter broadly fall into two camps: simple or complex. With an appropriate tool simple transformations could be built by the talented Excel users we meet working within local authorties and government. These represent perhaps 60-80% of the datasets we encounter.
The remaining datasets will sometimes require a significantly more complicated transformation; one that requires a Turing complete programming language to process.
There’s a tendency for many existing graphical ETL tools to become Turing complete, i.e. to introduce loops and conditional branches. But we feel this is a big mistake because the complexities of user interface workflows around Turing-complete systems are so great that they become almost useless.
Graphical Interfaces to Turing complete systems are unwieldy
Discriminating between user types
At the one extreme of the spectrum are the software developers, like ourselves, who need to fall back on a Turing complete language to tackle the bigger, thornier transformation problems.
At the other end of that spectrum are the talented Excel workers who produce detailed and rigorous stats for Local Authorities and government. These users need tools that let them graphically perform the bulk of the data conversions necessary to publish their data online - without having to be programmers and deal with the many problems that come with software development.
It’s also worth mentioning that even experienced developers will prefer a graphical tool for simple cases, as GUI’s can make some aspects of the problem significantly easier, especially by introducing rapid feedback and interfaces for data exploration.
Finally there is another class of users, who are less willing to get into low level data transformation, but are responsible for putting the data online, curating it and loading it into the system. These users should be able to use the transformations built by the other users, by simply providing the source data through a file upload.
We are hoping to target these three classes of user by building a suite of tools which are built on a common basis and which target the different types of users within the data publication industry.
Clear & Coherent Transformation Model
The mental model for designing transformations should be clear, flexible and encourage people to build transformations that are unsurprising.
The set of operations for expressing transformations should help users fall into the pit of success, encouraging users to express their transformations in a way that is both natural and internally consistent.
Many of the existing data cleaning and transformation tools we’ve seen aren’t efficient enough at processing the data for robust, commercial grade ETL. For example Open Refine, though a great tool for data cleaning, isn’t suited to ETL because its processing model requires files to be eagerly loaded into RAM rather than streamed or lazily evaluated. This makes it unsuitable for large datasets, which can easily consume all available memory.
Also, it’s common for tools to make multiple unnecessary passes over the data - for example, performing an additional pass for each operation in the pipeline. We believe most data can be processed in a single pass, and that a good ETL tool needs to be efficient in terms of memory and minimizing the work it does.
Removing unnecessary intermediate steps
Our old way of doing things (with Ruby scripts) would frequently export the data into an RDF serialisation (such as n-triples) which would then need to be imported separately to the database via HTTP.
Sometimes you might want to serialise the output data locally to disk, but often we’d like to wire transformations to go directly into our triple store; because it minimises the amount of manual steps required and means that data can be processed as a stream from source to database.
We believe that ETL has to be a process you can trust: partial failures are common in distributed systems, particularly when dealing with large volumes of data. So detecting failures, recovering from them and failing into well known recoverable error states is extremely important for large scale ETL pipelines.
Robustness requires facilities for developers and pipeline builders to easily incorporate error detection into their transformations. It makes sense to support this kind of validation and assertions on the source data, at arbitrary intermediate steps and on the final output.
A big part of being robust is reporting good errors at different levels of severity, and handling them appropriately. Sometimes you might just want to warn the user of an error; other times you might want to fail the whole import. Sometimes if pipelines are to be reused, error handling may need to be overridden from outside of the pipeline itself.
Likewise, being robust to structural changes in the source data is critically important. Sometimes a transformation might expect a spreadsheet to grow in particular dimensions, but not others. For example, one transformation might need to be tolerant of new columns being added to the end of the sheet, whilst another should throw an error if that ever happened.
Layered architecture with a component interface
We believe that having a suite of interoperable ETL tools that operate and build on each other in a cleanly architected manner is the right way to solve the ETL problem. There is never a one size fits all solution to ETL, so having a suite of layered APIs, DSLs, file formats, services and tools that let users change levels of abstraction is important to ensure both flexibility and reuse.
It’s also important to be able to specify transformations abstractly without too much concern over the underlying file format. (e.g. lots of formats are inherently tabular in nature, but have different serialisations) So you need an appropriate abstraction that lets you convert arbitrary file formats into a unit of operation on your pipeline.
Transformations themselves should be loadable into a generic import service, that will inspect the transformation for its required inputs and generate a web form that allows other users to import and process the raw data.
Once a spreadsheet’s structure has been decided (and a transformation built for sheets with that structure), import services become essential to gaining reuse out of transformations and lowering barriers to repeatable publication.
A lot of our users face similar data transformation challenges. It’s important that the transformation pipelines can be easily shared between users, so that rather than starting from scratch, a user can tweak a transformation that someone else has already built for a similar purpose.
Commercial Grade ETL
We see a need for a different kind of ETL tool and we’re currently working on delivering on this vision. We’re starting small and have a good understanding of the problems we’re tackling, and for whom we are solving them.
We already have the fundamentals of this approach up and running and we are using it to process hundreds of datasets far more efficiently than we have before.
The next step is to start wrapping Grafter (our solution) in a user interface that will make high performance repeatable linked data creation easier and quicker for experts and more accessible to non-experts.
We’ve found a big need for better ETL tools, and many are yearning for better approaches and tooling. Over the next few months we’ll be sharing more details about our approach, and consulting with numerous data communities about Grafter and its design.
If you’re interested in our work, feel free to contact us, and stay tuned to the Swirrl blog for more posts about Grafter, data processing and ETL.
We'd love it if you shared this article.