Category URL

Linking bbc.co.uk to the Linked Data cloud

I’ve been doing a few talks recently – most recently at the somewhat confused OKCon (Open Knowledge) Conference. The audience was extremely diverse and so I tried to not only talk about what we’ve done but also introduce the concept of Linked Data and explain what it is.

Linked Data is a grassroots project to use web technologies to expose data on the web. It is for many people  synonymous with the semantic web – and while this isn’t quite true. It does, as far as I’m concerned, represent a very large subset of the semantic web project. Interestingly, it can also be thought of as the ‘the web done right’, the web as it was originally designed to be.

But what is it?

Well it can be described with 4 simple rules.

1. Use URIs to identify things not only documents

The web was designed to be a web of things with documents making assertions about those real-world things. Just as a passport or driving license, in the real world, can be thought of as providing an identifier for a person making an assertion about who they are, so URIs can be thought of as providing identifiers for people, concepts or things on the web.

Minting URIs for things rather than pages helps make the web more human literate because it means we are identifying those things that people care about.

2. Use HTTP URIs – they are globally unique and anyone can dereference them

The beauty of the web is its ubiquitous nature – it is decentralised and able to function on any platform. This is because of TimBL’s key invention the HTTP URI.

URI’s are globally unique, open to all and decentralised. Don’t go using DOI or any other identifier – on the web all you need is an HTTP URI.

3. Provide useful information [in RDF] when someone looks up a URI

And obviously you need to provide some information at that URI. When people dereference it you need to give them some data – ideally as RDF as well as HTML. Providing the data as RDF means that machines can process that information for people to use. Making it more useful.

4. Include links to other URIs to let people discover related information

And of course you also need to provide links to other resources so people can continue their journey, and that means contextual links to other resources elsewhere on the web, not just your site.

And that’s it.

Pretty simple really and other than the RDF bit, I would argue that these principles should be followed for any website – they just make sense.

But why?

Before the Web people still networked their computers – but to access those computers you needed to know about the network, the routing and the computers themselves.

For those in their late 30s you’ll probably remember the film War Games – because this was written before the Web had been invented David and Jennifer the two ‘hackers’ had to find and connect directly to each computer; they had to know about the computer’s location.

Phoning up another computer

War Games, 1983

The joy of the web is that it adds a level of abstraction – freeing you from the networking, routing and server location – it lets you focus on the document.

Following the principles of Linked Data allows us to add a further level of abstraction – freeing us from the document and letting us focus on the things, people and stuff that matters to people. It helps us design a system that is more human literate, and more useful.

This is possible because we are identifying real world stuff and the relationships between them.

Free information from data silos

Of course there are other ways of achieving this – lots of sites now provide APIs which is good just not great. Each of those APIs tend to be proprietary and specific to the site. As a result there’s an overhead every time someone wants to add that data source.

These APIs give you access to the silo – but the silo still remains. Using RDF and Linked Data means there is a generic method to access data on the web.

What are we doing at the BBC?

First up it’s worth pointing out the obvious: the BBC is a big place and so it would be wrong to assume that everything we’re doing online is following these principles. But there’s quite a lot of stuff going on that does.

We do have – BBC’s programme support, music discovery and, soon, natural history content all adopting these principles. In other words persistent HTTP URIs that can be dereferenced to HTML, RDF, JSON and mobile views for programmes, artists, species and habitats.

We want HTTP URIs for every concept, not HTML webpage – an individual page is made up of multiple resource, multiple concepts. So for example an artist page transcludes the resource ‘/:artist/news’ and ‘/:artist/reviews’ – but those resources also have their own URIs. If they didn’t they wouldn’t be on the web.

Also because there’s only one web we only have one URI for a resource but a number of different representation for that resource. So the URI for the proggramme ‘Nature’s Great Events’ is:

bbc.co.uk/programmes/b00ht655#programme

Through content negotiation we will able to server an HTML, RDF, or mobile document to represent that programme.

We then need to link all of this stuff up within the BBC. So that, for example, you can go from a tracklist on an episode page of Jo Whiley on the Radio 1 site to the U2 artist page and then from there to all episodes of Chris Evans which have played U2. Or from an episode of Nature’s Great Events to the page about Brown Bears to all BBC TV programmes about Brown Bears.

But obviously the BBC is only one corner of the web. So we also need to link with the rest of the web.

Because we’re now thinking on a webscale we’ve started to think about the web as a CMS.

Where URIs already exist to represent that concept we are using it rather than minting our own. The new music site transcludes and links back to Wikipedia to provide biographical information about an artist. Rather than minting our own URI for artist biographic info we use Wikipedia’s.

Likewise when we want to add music metadata to the music site we add MusicBrainz.

What does the history of the web tell us about its future?

Following my invitation to speak at the WWW@20 celebrations [my bit starts about 133 minutes into the video] – this is my attempt to squash the most interesting bits into a somewhat coherent 15 minute presentation.

20 years ago Tim Berners-Lee was working, as a computer scientist, at CERN. What he noticed was that, much like the rest of the world, sharing information between research groups was incredibly difficult. Everyone had their own document management solution, running on their own flavour of hardware over different protocols.  His solution to the problem was a lightweight method of linking up existing (and new) stuff over IP – a hypertext solution – which he dubbed the World Wide Web – and documented in a memo “Information Management: A Proposal“.

Then for a year or so nothing happened. Nothing happened for a number of reasons, including the fact that IP, and the ARPANET before that, was popular in America but less so in Europe. Indeed senior managers at CERN had recently sent out a memo to all department heads reminding them that IP wasn’t a supported protocol – people were being told not to use it!

Also because CERN was full of engineers everyone thought they could build their own solution, do better than what was already there – no one wanted to play together. And of course because CERN was there to do particle physics not information management.

Then TimBL got his hands on a NeXT Cube – officially he was evaluating the machine not building a web server – but, with the support of his manager, that’s what he did — build the first web server and client. There then ensued a period of negotiation to get the idea out freely, for everyone to use, which happened in 1993. This coincided, more or less, with the University of Minnesota’s decision to charge a license fee for Gopher. Then the web took off especially in the US where IP was already popular.

The first webserver

The first webserver

The beauty of TimBL’s proposal was it’s simplicity – it was designed to work on any platform and importantly with the existing technology. The team knew that to make it work it had to be as easy as possible. He only wanted people to do one thing, that one thing was to give their resources identifiers – links – URIs; so information could be linked and discovered.

This is then is the key invention – the URL.

To make this work URLs were designed to work with existing protocols, in particular it needed to work with FTP and Gopher. That’s why there’s a colon in the URL — so that URLs can be given for stuff that’s already available via other protocols. As an aside, TimBL’s said his biggest mistake was the inclusion of // in the URL — the idea was that one slash meant the resource is on the local machine and two somewhere else on the web, but because everyone used http://foo.bar it means the second / is redundant. I love that this is TimBL’s biggest mistake.

He also implemented a quick tactical solution to get things up and running and demonstrate what he was talking about — HTML. HTML was originally just one of a number of supported doctypes – it wasn’t intended to be the doctype but HTML took off because it was easy. Apparently the plan was to implement a mark-up language that worked a bit like the NeXT application builder. But they didn’t get round to it before Mosaic came along with the first browser (TimBL’s first client was a browser-editor) and then it was all too late. And we’ve been left with something so ugly I doubt even it’s parents love it.

The curious thing, however, is that if you read the original memo — despite its simplicity — it’s clear that we’re still implementing it, we’re still working on the the original spec. Its just that we’ve tended to forget what it said or decided to get sidetracked for a while with some other stuff. So forget about Web 2.0.

For example, the original Web was read-write. Not only that but it used style sheets and a WYSIWYG editing interface — no tags, no mark-up. They didn’t think anyone would want to edit the raw mark-up.

The first web site was read and write

The first web site was read and write

You can also see that the URL’s hidden, you get to it via a property dialog.

This is because the whole point of the web is that it provides a level of abstraction, allowing you to forget about the infrastructure, the servers and the routing. You only needed to worry about the document. For those who remember the film War Games — you will remember that they had to ‘phone up individual computers — they needed this networking information to access the computer, they needed to know its location before they could use it. The beauty of the Web and the URL is that the location shouldn’t matter to the end user.

URIs are there to provide persistent identifiers across the web — they’re not a function of ownership, branding, look and feel, platform or anything else for that matter.

The original team described CERN’s IT ecosystem as a zoo because there were so many different flavours of hardware, different operating systems and protocols in use. The purpose of the web was to be ubiquitous, to work on any machine, open to everyone. It was designed to work no matter what machine or operating system you’re running. This is, of course, achieved by having one identifier, one HTTP URI and defererence that to the appropriate document based on the capacities of that machine.

We should be adopting the same approach today when it comes to delivery to mobile, IPTV, connected devices etc. — we should have one URI for a resource and allow the client to request the document it needs. As Tim intended. The technology is there to do this — we just don’t using it very often.

The original memo also talked about linking people, documents, things and concepts, and data. But we are only now getting around to building it. Through technologies such as OpenID and FOAF we can give people identifiers on the web and describe their social graph, the relationships between those people. And through RDF we can publish information so that machines can process it, describing the nature of and the relationship between the different nodes of data.

Information Management: A Proposal

Information Management: A Proposal by Tim Berners-Lee

The original memo described, and the original server supported, link typing so that you could describe not only real word things but also the nature of the relationship between those things. Like RDF and HTML 5 now does, 20 years later. This focus on data is all a good idea because it lets you treat the web like a giant database. Making computers human literate by linking up bits of data so that the tools, devices and apps connected to the web can do more of the work for you, making it easier to find the things that interest you.

The semantic web project – and TimBL’s original memo – is all about helping people access data in a standard fashion so that we can add another level of abstraction – letting people focus on the things that matter to them. This is what, I believe, we should be striving for for the web’s future because I agree with Dan Brickley, to understand the future of the web you first need to understand it’s origins.

Don’t think about HTML documents – think about the things and concepts that matter to people and give each it’s own identifier, it’s own URI and then put in place the technology to dereference that URI to the document appropriate to the device. Whether that be a desktop PC, a mobile device, an IPTV or third party app.

Making computers human literate WWW@20

Last Friday saw the 20th anniversary of the Web — well if not the web as such then TimBL’s proposal for an information management system. To celebrate the occasision CERN hosted a celebration which I was honoured to be invited to speak at, by the big man no less! I’ll write up some more about the event itself, but in the meantime here are my slides.

I’ve also posted some photos of the event up on Flickr.

Building coherence at bbc.co.uk

Michael and I have written an article for the latest addition [pdf] of Talis’s magazine Nodalities, reproduced below. If you are interested in the process behind this then I can’t recommend enough Michael’s awesome post ”How we make website“ over on the BBC’s Radio Lab blog.

—-

Telling (non-linear) stories

For the past 86 years the BBC has plied its trade as a storytelling organisation. In the world of linear broadcasting we’ve even gotten very good at it. Guiding the audience through complex news story lines, explaining the natural world and, interleaved narrative arcs and the plotlines of  drama  has become our forte. But storytelling in a linear world is different from storytelling in the non-linear, hypertext world of the web.

Joining www.bbc.co.uk to the rest of the web (of course as @gkob points out those should be dbpedia URIs)

With the exception of BBC News Online (news.bbc.co.uk) the online world has often been seen as a supporting adjunct to the linear broadcast world. Over the years we’ve commissioned and built sites to provide online support for programmes; but we’ve too often taken our linear storytelling expertise and attempted to replicate the same techniques on the web – with mixed success. Unlike linear broadcast storylines the web doesn’t provide people with a predicted and controlled linear journey. Instead we dip in and out of any given website — following different journeys — to find the information we want at that time.

Many of our programme support sites have been commissioned and developed in isolation. So you see an Archers site and an Eastenders site and a Top Gear site which are internally coherent but which fail to link up other than via editorially determined cross promotions. Want to see who presents Top Gear? No problem, we can do that. Want to see what else those people present? Sorry, can’t do that. By developing self-contained microsites the BBC has produced some good stuff but it has also been unable to reach its full potential because it hasn’t managed to join up all of its resources. By failing to link up the content (on both a data and a user experience level) the stuff we publish can never becomes greater than the sum of its parts. Without these links we can’t make bbc.co.uk a coherent experience. As a user, it’s very difficult to find everything the BBC has published about any given subject, nor can you easily navigate across BBC domains following a particular semantic thread. For example, you can’t yet navigate from a page about a musician to a page with all the programmes that have played that artist.

So how do you tell stories on a web scale? We could stick with the easy option and try to control ‘user journeys’ across the site. Provide links to where we think the user should go next. But that’s little better than those flip a dice, go to page 30 dungeons and dragons books we all had as kids. We had to recognise that non-linear storytelling puts the narrative arc into the hands of the user. What to read, what to click, where to go next is really up to you. So storylines split and merge, meta-narratives emerge and fracture; ‘user journeys’ slip out of (editorial) control.

All of this comes from the power of the link – back to basics. But we can only provide precisely targeted links at the user experience level if those links exist at a data level. And that’s the difficult part. The organic growth of our sites has been mirrored in the organic growth of our content and data management systems. We currently have a range of systems across the business for managing different bits of content throughout the production chain. And like our public facing sites none of these speak the same language or share the same identifiers. A typical episode of Top Gear might have 6 separate identifiers on it’s way from scriptwriter to airwaves to archive. Once you’ve solved this problem you hit the problem of multiple identifiers for James May and once you’ve got one canonical James May you’re back to the problem of multiple identifiers for all the other programmes he’s presented…

Solving these problems makes for a more linked, more coherent bbc.co.uk. But an internally coherent bbc.co.uk isn’t enough. bbc.co.uk needs to be weaved into the rest of the web, not merely on the web. It needs to be linked in to all those other Top Gear / James May pages out there… Luckily the tips, tricks and techniques pioneered by the Linked Data community give us some clues here.

Add into this mix the fact that there’s some data the BBC can never hope to provide. So we know when an artist is played on radio or TV. But we can’t hope to know when they were born, or where they were born, or which bands they’ve been in, or who they’re married to etc. If we want to tell stories around music all this is important data. And we can only get it by tapping into the collective knowledge of the web.

BBC in the web of data

I’d like to claim that when we set out to develop /programmes we had the warm embrace of the semantic web in mind. But that would be a lie. We were however building on very similar philosophical foundations.

In the work leading up to bbc.co.uk/programmes we were all too aware of the importance of persistent web identifiers, permanent URIs and the importance of links as a way to build meaning. To achieve all this we broke with BBC tradition by designing from the domain model up rather than the interface down. The domain model provided us with a set of objects (brands, series, episodes, versions, ondemands, broadcasts etc) and their sometimes tangled interrelationships.

We were also convinced that the value in programme websites lay not in the implicit metadata of the domain model but rather in the way this domain model overlapped and intersected with other domains. As ever the links are more important than the nodes because that’s where the context lives: programmes:segment <features> music:track, programmes:segment <features> food:recipe etc. In this way we could weave new ‘user journeys’ into and out of /programmes, into and out of bbc.co.uk. From archive episodes no longer available online, to a recipe page, to a chef, to another recipe and back to a recent episode. Using well targeted content specific links we could not only escape the dead end content silos that characterised bbc.co.uk but point users back to programmes that would hopefully inform, educate and of course entertain.

Finally we believed in the merits of opening our data and building on top of other people’s open data. When we looked to rebuild bbc.co.uk/music we looked at a number of commercial providers of music metadata. They all did a similar job to MusicBrainz (musicbrainz.org) – similar models, similar data quality etc. But choosing to go with a commercial provider would have precluded our ability to provide any kind of machine friendly (API if you must) views. The decision to publish JSON or vanilla XML or RDF would have been a decision to give the 3rd party business model away. So we went with the open alternative – an open, public domain provider, one that is more in keeping with  our public service remit and one that represents better value for money for the license fee payer – which has to be a lesson to someone.

Without ever explicitly talking RDF we’d built a site that complied with Tim Berners-Lee’s four principles for Linked Data:

  1. Use URIs as names for things. – CHECK
  2. Use HTTP URIs so that people can look up those names. – CHECK
  3. When someone looks up a URI, provide useful information. – Well, if we’re only talking HTML, RSS, ATOM, JSON etc. CHECK
  4. Include links to other URIs. so that they can discover more things. – Again if we’re talking HTML only CHECK

By keeping everything in its right place we’d also built a sane, maintainable, scalable, accessible site that search engines love and could be easily evolved to add new features and functionality. So to anyone considering how best to build websites we’d recommend you throw out the Photoshop and embrace Domain Driven Design and the Linked Data approach every time. Even if you never intend to publish RDF it just works.

Around this time we met by chance with some people from the Linking Open Data community and the two worlds collided. Obviously TBL wasn’t talking only HTML in the last 2 principles but aside from that the parallels were striking. We set about converting our programmes domain model into an RDF ontology which we’ve since published under a Creative Commons License (www.bbc.co.uk/ontologies/programmes/). Which took one person about a week. The trick here isn’t the RDF mapping – it’s having a well thought through and well expressed domain model. And if you’re serious about building web sites that’s something you need anyway. Using this ontology we began to add RDF views to /programmes (e.g. www.bbc.co.uk/programmes/b00f91wz.rdf). Again the work needed was minimal.

So for those considering the Linked Data approach we’d say that 95% of the work is work you should be doing just to build for the (non-semantic) web. Get the fundamentals right and the leap to the Semantic Web is really more of a hop.

Why bother with RDF?

For all the pages we’ve published we’ve only had a limited success at making this information available for others to use, to hack with and to build new services with. While we’ve not done a very good job of making bbc.co.uk a coherent experience for people the situation is worse for machines.

It is our belief that rather than publishing proprietary APIs it is better to use the ubiquitous technologies of URIs and HTTP. This approach supports the generative nature of the Web, making it easy for third parties to build with BBC metadata without learning BBC specific APIs and at the same time providing the BBC and its users with immediate benefits.

Services like Flickr, Twitter and the like have in many, many ways followed the same principles we adopted for programmes and music — or if they didn’t then the end results look pretty similar — they are wonderful services. However, if as a third party developer you want to deal with the semantics, accessing the data via the Giant Global Graph to find everything about a certain person, place or topic and you wanted to include data from Flickr then you will need to deal with the specifics of Flickr. I suspect that it wouldn’t be that difficult for Flickr to add RDF representations – if they did then Flickr content would be part of a common way of doing things. We want BBC data to be part of a common way of doing things.

Our hope in making BBC data available as RDF is that we will make it as generative as possible – helping others to do interesting things with our data. The BBC has a public service remit, a remit that means it should look beyond its internal business needs to help create public value around useful technologies and around its content for others to benefit from. The longer term aim of this work is to not only expose BBC data but to ensure that it is contextually linked to the wider web. We have started along this path by linking to Wikipedia (DBpedia in the RDF view) and MusicBrainz from the artist pages but this could be extended for programmes and events.

Follow

Get every new post delivered to your Inbox.

Join 812 other followers