Linking to the Linked Data cloud

I’ve been doing a few talks recently – most recently at the somewhat confused OKCon (Open Knowledge) Conference. The audience was extremely diverse and so I tried to not only talk about what we’ve done but also introduce the concept of Linked Data and explain what it is.

Linked Data is a grassroots project to use web technologies to expose data on the web. It is for many people  synonymous with the semantic web – and while this isn’t quite true. It does, as far as I’m concerned, represent a very large subset of the semantic web project. Interestingly, it can also be thought of as the ‘the web done right’, the web as it was originally designed to be.

But what is it?

Well it can be described with 4 simple rules.

1. Use URIs to identify things not only documents

The web was designed to be a web of things with documents making assertions about those real-world things. Just as a passport or driving license, in the real world, can be thought of as providing an identifier for a person making an assertion about who they are, so URIs can be thought of as providing identifiers for people, concepts or things on the web.

Minting URIs for things rather than pages helps make the web more human literate because it means we are identifying those things that people care about.

2. Use HTTP URIs – they are globally unique and anyone can dereference them

The beauty of the web is its ubiquitous nature – it is decentralised and able to function on any platform. This is because of TimBL’s key invention the HTTP URI.

URI’s are globally unique, open to all and decentralised. Don’t go using DOI or any other identifier – on the web all you need is an HTTP URI.

3. Provide useful information [in RDF] when someone looks up a URI

And obviously you need to provide some information at that URI. When people dereference it you need to give them some data – ideally as RDF as well as HTML. Providing the data as RDF means that machines can process that information for people to use. Making it more useful.

4. Include links to other URIs to let people discover related information

And of course you also need to provide links to other resources so people can continue their journey, and that means contextual links to other resources elsewhere on the web, not just your site.

And that’s it.

Pretty simple really and other than the RDF bit, I would argue that these principles should be followed for any website – they just make sense.

But why?

Before the Web people still networked their computers – but to access those computers you needed to know about the network, the routing and the computers themselves.

For those in their late 30s you’ll probably remember the film War Games – because this was written before the Web had been invented David and Jennifer the two ‘hackers’ had to find and connect directly to each computer; they had to know about the computer’s location.

Phoning up another computer
War Games, 1983

The joy of the web is that it adds a level of abstraction – freeing you from the networking, routing and server location – it lets you focus on the document.

Following the principles of Linked Data allows us to add a further level of abstraction – freeing us from the document and letting us focus on the things, people and stuff that matters to people. It helps us design a system that is more human literate, and more useful.

This is possible because we are identifying real world stuff and the relationships between them.

Free information from data silos

Of course there are other ways of achieving this – lots of sites now provide APIs which is good just not great. Each of those APIs tend to be proprietary and specific to the site. As a result there’s an overhead every time someone wants to add that data source.

These APIs give you access to the silo – but the silo still remains. Using RDF and Linked Data means there is a generic method to access data on the web.

What are we doing at the BBC?

First up it’s worth pointing out the obvious: the BBC is a big place and so it would be wrong to assume that everything we’re doing online is following these principles. But there’s quite a lot of stuff going on that does.

We do have – BBC’s programme support, music discovery and, soon, natural history content all adopting these principles. In other words persistent HTTP URIs that can be dereferenced to HTML, RDF, JSON and mobile views for programmes, artists, species and habitats.

We want HTTP URIs for every concept, not HTML webpage – an individual page is made up of multiple resource, multiple concepts. So for example an artist page transcludes the resource ‘/:artist/news’ and ‘/:artist/reviews’ – but those resources also have their own URIs. If they didn’t they wouldn’t be on the web.

Also because there’s only one web we only have one URI for a resource but a number of different representation for that resource. So the URI for the proggramme ‘Nature’s Great Events’ is:

Through content negotiation we will able to server an HTML, RDF, or mobile document to represent that programme.

We then need to link all of this stuff up within the BBC. So that, for example, you can go from a tracklist on an episode page of Jo Whiley on the Radio 1 site to the U2 artist page and then from there to all episodes of Chris Evans which have played U2. Or from an episode of Nature’s Great Events to the page about Brown Bears to all BBC TV programmes about Brown Bears.

But obviously the BBC is only one corner of the web. So we also need to link with the rest of the web.

Because we’re now thinking on a webscale we’ve started to think about the web as a CMS.

Where URIs already exist to represent that concept we are using it rather than minting our own. The new music site transcludes and links back to Wikipedia to provide biographical information about an artist. Rather than minting our own URI for artist biographic info we use Wikipedia’s.

Likewise when we want to add music metadata to the music site we add MusicBrainz.

Making computers human literate WWW@20

Last Friday saw the 20th anniversary of the Web — well if not the web as such then TimBL’s proposal for an information management system. To celebrate the occasision CERN hosted a celebration which I was honoured to be invited to speak at, by the big man no less! I’ll write up some more about the event itself, but in the meantime here are my slides.

I’ve also posted some photos of the event up on Flickr.

2008 Year-End Wrap-Up

It’s become the tradition at this time of year for the cool kids to round-up the year with the most popular blog postings of the year; so I thought I would do the same.

My most popular photo on Flickr. Some rights reserved.
My most popular photo on Flickr. Some rights reserved.

Here then are the most popular posts from the last 12 months (most popular first):

Web design 2.0 – it’s all about the resource and its URL — thanks to Simon Willison this is my most popular post of all time and of 2008.

QR codes for BBC programmes and some other stuff — a lunchtime of hacking from the wonderful Duncan Robertson gave us QR Codes for every BBC programme.

When agile projects become mini waterfalls — I have no idea why this is so popular, but there you go.

Interesting BBC data to hack with — the release of XML views of Radio AOD data, unsurprisingly, proved popular.

The all new BBC music site where programmes meet music and the semantic web — the first hint at what the BBC will be able to do by caring about its URLs, Linked Data and Domain Driven Design. If you put everything in the right place you can join it all up and create a coherent user experience. 

Osmotic communication – keeping the whole company in touch — I still think this is a good idea.

Find and Play BBC Programmes — announcing the embedded media player on programme pages — meaning all BBC programme support sites now include the latest TV and Radio media.

iPhoto photos not appearing in Front Row — how to fix iPhoto’s album.xml file when you migrate from Google’s Picasa to iPhoto. The fact this is still proving popular implies Apple still haven’t fixed the bug. 

Highly connected graphs: Opening BBC data — in response to Mike Butcher’s post on TechCrunch requesting the BBC open up their data and provide APIs I thought it worth pointing out there’s already some good stuff going on.

Ladies and gentlemen I give you BBC Programmes — the launch of a page for every programme the BBC broadcasts.

UGC its rude, its wrong and it misses the point — its still rude and it still means those that think of amateur publishers in these terms will continue to miss opportunities.

So there you have it. It’s been a good year and as I’ve discussed previously I’m very proud of what we’ve achieved, as reflected in many of these posts and the fact the Guardian also cover the work — which also had the added bonus that my parents finally have some idea of what I do for a living.

Permanent web IDs or making good web 2.0 citizens

These are the slides for a presentation I gave a little while ago in Broadcasting House at a gathering of radio types – both BBC and commercial radio – as part of James Cridland’s mission to “agree on technology, compete on content“.

The presentation is based on the thinking outlined in my previous post: web design 2.0 it’s all about the resource and its URL.

Media companies should embrace the generative nature of the web

Generativity, the ability to remix different pieces of the web or deploy new code without gatekeepers (so that anyone can repurpose, remix or reuse the original content or service for a different purpose) is going to be at the heart of successful media companies.

Depth of field (Per Foreby)

As Jonathan Zittrain points out in The Future of the Internet (and how to stop it) the web’s success is largely because it is a generative platform.

The Internet is also a generative system to its very core as is each and every layer built upon this core. This means that anyone can build upon the work of those that went before them – this is why the Internet architecture, to this day, is still delivering decentralized innovation.

This is true at a technological level, for example, XMPP, OAuth and OpenID are all technologies that have been invented because the technology layers upon which they are built are open, adaptable and easy for others to reuse and master. It is also true at the content level – Wikipedia is only possible because it is built as a true web citizen, likewise blogging platforms and services such as MusicBrainz – these services allow anyone to create or modify content without the need for strict rules and controls.

But what has this got to do with the success or otherwise of any media company or any content publisher? After all just because the underlying technology stack is generative doesn’t mean that what you build must be generative. There are, after all, plenty of successful walled gardens and tethered appliances out there. The answer, in part, depends on what you believe the future of the Web will look like.

Tim Berners-Lee presents a pretty compelling view in his article on The Giant Global Graph. In it he explains how the evolution of the Internet has seen a move from a network of computers, through the Internet, to a  web of documents and we are now seeing a migration to a ‘web of concepts’.

[The Internet] made life simpler and more powerful. It made it simpler because of having to navigate phone lines from one computer to the next, you could write programs as though the net were just one big cloud, where messages went in at your computer and came out at the destination one. The realization was, “It isn’t the cables, it is the computers which are interesting”. The Net was designed to allow the computers to be seen without having
to see the cables. […]

The WWW increases the power we have as users again. The realization was “It isn’t the computers, but the documents which are interesting”. Now you could browse around a sea of documents without having to worry about which computer they were stored on. Simpler, more powerful. Obvious, really. […]

Now, people are making another mental move. There is realization now, “It’s not the documents, it is the things they are about which are important”. Obvious, really.

If you believe this, if you believe that there is a move from a web of documents to concepts, then you can start to see why media companies will need to start to publish data the right way. Publishing it so that they, and others, can help people find the things they are interested in. How does this happen then? For starters we need a mechanism by which we can identify things and identify the relationship between them – at a level above that of the document. And that’s just what the semantic web technologies are for – they allow different organisations a common way of describing the relationship between things. For example, the Programmes Ontology allows any media company to describe the nature of a programme; the music ontology any artist, release or label.

This implies a couple of different, but related things, firstly it highlights the importance of links. Links are an expression of a person’s interests. I choose what to link to from this blog – which words, which subjects to link from and where to – my choice of links provide you with a view onto how I view the subject beyond what I write here. The links give you insight into who I trust and what I read. And of course it allows others to aggregate my content around those subjects.

It also implies that we need a common way of doing things. A way of doing things that allows others to build with, on top of, the original publishers content. This isn’t about giving up your rights over your content, rather it is about letting it be connected to content from peer sites. It is about joining contextually relevant information from other sites, other applications. As Tim Berners-Lee points out this is similar to the transition we had to make in going from interconnected computers to the Web.

People running Internet systems had to let their computer be used for forwarding other people’s packets, and connecting new applications they had no control over. People making web sites sometimes tried to legally prevent others from linking into the site, as they wanted complete control of the user experience, and they would not link out as they did not want people to escape. Until after a few months they realized how the web works. And the re-use kicked in. And the payoff started blowing people’s minds.

Because the Internet is a generative system it means it has a different philosophy from most other data discovery systems and APIs (including some that are built with Internet technologies), as Ed Summers explains:

…which all differ in their implementation details and require you to digest their API documentation before you can do anything useful. Contrast this with the Web of Data which uses the ubiquitous technologies of URIs and HTTP plus the secret sauce of the RDF triple.

They also often require the owner of the service or API to give permission for third parties to use those services, often mediated via API keys. This is bad, had the Web or the Internet before that adopted a similar approach, rather than the generative approach it did take, we would not have seen the level of innovation we have; and as a result we would not have had the financial, social and political benefits we have derived from it.

Of course there are plenty of examples of where people have been able to work with the web of documents – everything from 800lb gorilla’s like Google through to sites like After Our Time and Speechification – both provide users with a new and distinctive service while also helping to drive traffic and raise brand awareness to the BBC. Just think what would also be possible if transcripts, permanent audio, and research notes where also made available not only as HTML but also as RDF joining content inside and outside the BBC to create a system which, in Zittrain words, provides “a system’s capacity to produce unanticipated change through unfiltered contributions from broad and varied audiences.”