Highly connected graphs: Opening BBC data

Mike Butcher wants the BBC to open up. To make its data available, to provide lots of APIs, well as he puts it:

Dear BBC,

What we want is your data, a lot more APIs, developer tools and your traffic.
We’ve paid for it already in the license fee.
Now get on with it.

Yours Sincerely,
The UK’s Startups

I do largely agree with Mike’s central premise – the BBC does need to make its data more accessible, it does need to provide more APIs. And as Matt has already noted there are people at the BBC working to open things up. Now I don’t want to get into the debate about what the BBC does well vs what it doesn’t – but I did want to highlight some of the work that the team I work in is doing and to give some perspective on why Mike’s objective isn’t as simply as it might appear.

I work in the “FM&T for A&Mi” bit of the BBC (as James has rechristened it) – in other words the ‘new media’ team embedded within the radio and music department. We’re currently working on a couple of projects (programmes and a revamped music site) that I hope might give some of the UK Startups some of what Mike is after. And in due course we’ll be adding more data that will make more startups happy (hopefully).

So what are we doing? It’s probably easiest to start by looking at the current programmes beta – the objective is to ensure that every programme the BBC broadcasts has a permanent, findable web presence. The site provides data for the eight BBC TV channels, ten national radio stations and the six stations covering Scotland, Northern Ireland and Wales.

To enable the sharing of this data in a structured way, we are using the linked data approach to connect and expose resources i.e. using web technologies (URLs and HTTP etc.) to identify and link to a representation of something, and that something can be person, a programme or an album release. These resources also have representations which can be machine-processable (through the use of RDF, Microformats, RDFa, etc.) and they can contain links for other web resources, allowing you to jump from one dataset to another.

OK so that’s the theory – what are we doing?

Currently the pages are marked up with microformats: hCalendar on the schedule views and hCard for cast and crew on episode pages. That’s OK – but hardly the APIs that Mike is after. But what’s coming very soon will hopefully be a bit closer to the mark. Our plan is to make all resources available in a variety of formats: XML, Atom, RSS 2, JSON, YAML, RDF etc. (We’ll announce these on Backstage as they become available.)

And to help folk get direct access to the actual data backing BBC Programmes, we designed a Semantic Web ontology covering programmes data, The Programmes Ontology. This ontology provides web identifiers for concepts such as brand, series, or episode and is released under a creative commons license so anyone can use it.

But there are limitations of such a web interface. To provide a more expressive API we are also investigating using D2R Server, a Java application for mapping relational databases to RDF to make the data accessible through SPARQL. SPARQL allows you to carry out more complex queries than would be possible with simple RDF representations of the resources – think of SPARQL as SQL for the semantic web. It allows you to semantically connect to external data sources such as DBpedia to provide extra information that is not present in our dataset, such as date and place of birth of cast members.

So what about music data? We’re not as far ahead with this work as we are with programmes but you still shouldn’t have to wait too long.

As I’ve written about before we are using MusicBrainz to provide GUIDs for artists and releases and; to give us core metadata about those music resources. The use of MusicBrainz IDs means that we can relate all the BBC resources about an artist together; and others who also use these GUIDs (e.g. metaweb) can use them to find our resources.

In terms of making the data accessible – it’s a similar story to programmes. We’re currently marking up relevant pages with microfromats (hReview and hCard) but the plan is to publish BBC music resources in a variety of different representations (XML, Atom, RSS 2, JSON, APML etc.)

What resources are we talking about? In addition to the core data from MusicBrainz we’re also talking about album reviews (released under a Creative Commons License) and data from our music programmes, for example:

  • views aggregating the artists played on each station into both charts (most played that day, week and since the prototype started running) and artist clouds;
  • programme views which are similar to the station aggregation views but for each programme and links through to each episode;
  • programme episodes with track listings which link through to artist pages;
  • artist pages with biographies pulled in from Wikipedia (MusicBrainz include links to Wikipedia) and links back to the programmes that have featured that artist.
  • And much more…

As you can see we’re not only making the data available as discrete resources we are also linking them together – and making that data available in both human readable and machine readable views. This is a big job – it involves a lot of data, a lot of systems (both web and production systems) and it all needs to work under a high load.

And what about the future? As Michael recently presented at the Semantic Camp our plans are to join programmes, music, events, users and topics. All available on the web, for ever, for people to explore and open for machines to process. If you would like to find out more then Nick and I will be discussing this further at XTech next month in Dublin.

So I hope, in our way we are ‘getting on with it’ – sorry for the delay.

Photo: Please open door slowly, by splorp. Used under licence.

Foo Camping

I’ve just published a short piece on my recent trip to San Francisco and the O’Reilly Foo Camp over at the BBC Radio Lab’s blog.

It was my first trip to San Francisco and I loved the city (you can see my photos on Flickr). But I was also struck my how meme friendly the place is. I guess that’s not that surprising – it’s a relatively small city with a high density of tech companies in and around the bay area, but none the less it does appear to be a good place for tech memes to arise and flourish. One reason why that corner of the world produces so much innovative technology?

Anyway below is my blog post as published on the Radio Lab’s blog.


“I’ve recently returned from a very enjoyable and educational trip to California where I was honored to be invited to attend the Social Graph Foo Camp. Although I do have to say that while I found the whole thing very exciting I was also, at times, left realising just how far behind some of the conversations I have become, it really is amazing how rapidly the issues and technology within this space are developing – and that’s in the context of a fast moving industry.

It was, however, clear that the really big issues are social not technological: user expectations, data ownership and portability. Although a key piece of the technology puzzle in all this is the establishment of XFN and FOAF which are going to play an ever increasingly important role in glueing different social networks together. And with the launch of Google’s Social Graph API (released under a Creative Commons license by the way) data portability is going to really explode; but with it expect more “Scoblegate” like incidents.

But the prize for getting this right are great, as illustrated by this clip of Joseph Smarr of Plaxo presenting on friends list portability and who owns the data in social networks.

For my part what I took away from this and other discussion is that although on the surface moving data between one social network and another is no different from copying a business card into Outlook people’s expectations make it different. People don’t (yet) expect the data they enter in one site to suddenly appear in another. But they do expect to be able to easily find their friends within a new network. Google’s Social Graph API will make it easier – but there will be a price, as Tim O’Reilly points out:

“Google’s Social Graph API… will definitively end “security by obscurity” regarding people and their relationships, as well as opening up the social graph to “rel=me” spammers. The counter-argument is that all this data is available anyway, and that by making it more visible, we raise people’s awareness and ultimately their behavior.”

Tied to all of this, of course, is the rise of OpenID, the open and decentralized identity system, and OAuth an open protocol to allow secure API authentication between application. Both of which appear to be central to most people’s plans for the coming year.

So what were the other highlights? For me I’m really exited by Tom Coates and Rabble’s latest Yahoo! project: Fire Eagle; which allows you to share you location with friends, other websites or services.

You can think of Fire Eagle as a location brokerage service. Via open APIs other people can write applications that update Fire Eagle with your location so that further applications that can then use it. So for example, someone might write an application that runs on your mobile that triangulates your position based on the location of the transmitters before sending the data to Fire Eagle. You could then run an application on your phone that let you know if your friends where near by, what restaurants are in your area or where the nearest train or tube station is.

Obviously what Fire Eagle also provides is lots of security so you can control who and what applications have access to your location data. I can’t wait to see what people end up doing with Fire Eagle and I’m hoping that we can come up with some interesting applications too.

Finally, XMPP, which I have to say caught me a bit by surprises. If you’ve not come across it before XMPP it’s a messaging and presence protocol developed by Jabber and now used by Google Talk, Jaiku and Apple’s iChat amongst others (with a lot more clients on the way if last weekend was anything to go by).

XMPP is a much more efficient protocol than HTTP for two way messaging because you don’t require your application to check in with the servers periodically – instead the server sends a signal via XMPP when new information is published. And there’s no need to limit that communication to person to person – XMPP can also be used for essentially machine-to-machine Instant Messaging which means you have real time communication between machines.

So based on last weekend’s Foo Camp it looks like XMPP, OpenID, OAuth are all going to be huge in 2008, Google’s Social Graph API and related technologies (FOAF and XFN) will result in some head aches while people’s understanding and expectations settle down but it will be worth it as we move towards a world of data portability.”

Microformat injection

Ben Smith and I are attending Social Graph Foocamp this weekend – this is his post on the BBC development blog – which we’re setting free here on my blog.

While this isn’t a topic that anyone is talking about here, I was struck by a throw away comment by Brad Fitzpatrick about the possibility of Microformat Injection. Everyone knows about XSS and only the worst developers leave themselves exposed by allowing JavaScript through form submissions. However, allowing a subset of HTML through in requests, to then be published in, say, profile pages, is quite standard.

I’m not sure if it would ever be particularly dangerous but Microformat Injection could be used to insert ‘rel=”me”‘ tags to pages as <a> tags are quite regularly allowed through. Now I’m not very knowledgeable about Microformats (RDFa seems shinier) so I’ll leave it up to you to think up some interesting Microformat Injection exploits. Please comment if you think of any!