Highly connected graphs: Opening BBC data

Mike Butcher wants the BBC to open up. To make its data available, to provide lots of APIs, well as he puts it:

Dear BBC,

What we want is your data, a lot more APIs, developer tools and your traffic.
We’ve paid for it already in the license fee.
Now get on with it.

Yours Sincerely,
The UK’s Startups

I do largely agree with Mike’s central premise – the BBC does need to make its data more accessible, it does need to provide more APIs. And as Matt has already noted there are people at the BBC working to open things up. Now I don’t want to get into the debate about what the BBC does well vs what it doesn’t – but I did want to highlight some of the work that the team I work in is doing and to give some perspective on why Mike’s objective isn’t as simply as it might appear.

I work in the “FM&T for A&Mi” bit of the BBC (as James has rechristened it) – in other words the ‘new media’ team embedded within the radio and music department. We’re currently working on a couple of projects (programmes and a revamped music site) that I hope might give some of the UK Startups some of what Mike is after. And in due course we’ll be adding more data that will make more startups happy (hopefully).

So what are we doing? It’s probably easiest to start by looking at the current programmes beta – the objective is to ensure that every programme the BBC broadcasts has a permanent, findable web presence. The site provides data for the eight BBC TV channels, ten national radio stations and the six stations covering Scotland, Northern Ireland and Wales.

To enable the sharing of this data in a structured way, we are using the linked data approach to connect and expose resources i.e. using web technologies (URLs and HTTP etc.) to identify and link to a representation of something, and that something can be person, a programme or an album release. These resources also have representations which can be machine-processable (through the use of RDF, Microformats, RDFa, etc.) and they can contain links for other web resources, allowing you to jump from one dataset to another.

OK so that’s the theory – what are we doing?

Currently the pages are marked up with microformats: hCalendar on the schedule views and hCard for cast and crew on episode pages. That’s OK – but hardly the APIs that Mike is after. But what’s coming very soon will hopefully be a bit closer to the mark. Our plan is to make all resources available in a variety of formats: XML, Atom, RSS 2, JSON, YAML, RDF etc. (We’ll announce these on Backstage as they become available.)

And to help folk get direct access to the actual data backing BBC Programmes, we designed a Semantic Web ontology covering programmes data, The Programmes Ontology. This ontology provides web identifiers for concepts such as brand, series, or episode and is released under a creative commons license so anyone can use it.

But there are limitations of such a web interface. To provide a more expressive API we are also investigating using D2R Server, a Java application for mapping relational databases to RDF to make the data accessible through SPARQL. SPARQL allows you to carry out more complex queries than would be possible with simple RDF representations of the resources – think of SPARQL as SQL for the semantic web. It allows you to semantically connect to external data sources such as DBpedia to provide extra information that is not present in our dataset, such as date and place of birth of cast members.

So what about music data? We’re not as far ahead with this work as we are with programmes but you still shouldn’t have to wait too long.

As I’ve written about before we are using MusicBrainz to provide GUIDs for artists and releases and; to give us core metadata about those music resources. The use of MusicBrainz IDs means that we can relate all the BBC resources about an artist together; and others who also use these GUIDs (e.g. metaweb) can use them to find our resources.

In terms of making the data accessible – it’s a similar story to programmes. We’re currently marking up relevant pages with microfromats (hReview and hCard) but the plan is to publish BBC music resources in a variety of different representations (XML, Atom, RSS 2, JSON, APML etc.)

What resources are we talking about? In addition to the core data from MusicBrainz we’re also talking about album reviews (released under a Creative Commons License) and data from our music programmes, for example:

  • views aggregating the artists played on each station into both charts (most played that day, week and since the prototype started running) and artist clouds;
  • programme views which are similar to the station aggregation views but for each programme and links through to each episode;
  • programme episodes with track listings which link through to artist pages;
  • artist pages with biographies pulled in from Wikipedia (MusicBrainz include links to Wikipedia) and links back to the programmes that have featured that artist.
  • And much more…

As you can see we’re not only making the data available as discrete resources we are also linking them together – and making that data available in both human readable and machine readable views. This is a big job – it involves a lot of data, a lot of systems (both web and production systems) and it all needs to work under a high load.

And what about the future? As Michael recently presented at the Semantic Camp our plans are to join programmes, music, events, users and topics. All available on the web, for ever, for people to explore and open for machines to process. If you would like to find out more then Nick and I will be discussing this further at XTech next month in Dublin.

So I hope, in our way we are ‘getting on with it’ – sorry for the delay.

Photo: Please open door slowly, by splorp. Used under licence.