Manchester buses Linked Data diary, Day 1

On 9 April, we (Ric and Bill from Swirrl) are going along to the Lovely Data Transport Hack Day in Manchester. A few weeks ago we said that we’d put together a Linked Data version of the Transport for Greater Manchester bus timetable data, which has recently been released as open data.

That seemed like a good idea at the time! But now it’s only 6 days till the event, so we’d better get on with it. We’re planning to hack together an app using this data at the event and we’re hoping others might use it too.

Since this will involve some concentrated Linked Data creation effort, we thought we would write a series of blog posts documenting the steps along the way, hoping that it might be a useful illustration of what’s involved. As well as showing-off what will hopefully be the lovely shiny result, we’ll also be open about the tricks, hacks, mistakes, backtracks and shortcuts that will no doubt be required along the way!

Getting the data

The data is available to download from the DataGM website, a new collaborative initiative from a group of public sector organisations in the Greater Manchester area. The bus timetable data is in a package called ATCO-CIF (named after the standard timetable interchange format used by the UK transport industry). This data was created on 28 March 2011, so it’s up to date (I think they release a new package weekly).

Downloading and unzipping this package produces around 500 text files, each one containing the timetable for a particular bus route. The format used is a formatted text file, where the positions of characters on the line is significant. At first glance it’s a bit difficult to understand, but it’s compact and reasonably easy to parse, as long as you have the specification! This supporting document is useful too.

Understanding the structure

Each file represents a timetable and contains a list of related ‘services’, generally following the same route: there are ‘inward’ and ‘outward’ services, and sometimes different services for weekdays, Saturdays, Sundays, bank holidays etc.

Each service has a series of journeys. A journey represents a bus following the service route at a specified set of times, so if the bus runs twice an hour from 7am to 7pm, then you end up with 24 journeys.

Each service also has a list of the stops that makes up the route. Each stop has a 12 character identifier (such as ‘1800NE07711’). The files themselves don’t contain detailed information on where each bus stop is, just a brief description.

I was already aware of the NAPTAN dataset and knew that it was available as linked data at the data.gov.uk site.

The NAPTAN linked data identifiers for bus stops follow this kind of pattern: http://transport.data.gov.uk/id/stop-point/1800NE07711, where the last bit of the URI is the same 12 digit code used in the CIF files. It’s almost like someone planned it :-). So we can link to data.gov.uk to get location and other information about the bus stops.

Sanity check

Before we get too far down the road, it’s worth just getting a rough idea of how much data we’re going to end up with. Doing a ‘wc’ on the files shows that there are about 4.5 million lines, most of which will be ‘stop events’, i.e. part of a journey where a bus visits a given stop at a given time.

On a rough estimate that we end up with 10 triples per line of CIF data, that will be 45 million triples. One line of an n-triples file is usually around 100 bytes, so 4.5GB in total (but n-triples compresses well due to all the repetition, so the data won’t be too hard to handle and move around). 45 million triples is a lot but it won’t melt the internet, so let’s press on!

I’ve got the skeleton of a parser written for the CIF files. Tomorrow, we’ll look at how to represent the information as RDF.

Stay tuned for the next instalment tomorrow! (You can subscribe to the feed here).

blog comments powered by Disqus