I’ve really been neglecting this blog recently – apologies but my attention has been elsewhere recently. Anyway, while I get round to actually writing something here’s a presentation I gave at the Online Information Conference recently.
The presentation is largely based upon the article Michael and I wrote for Nodalities this time last year.
As a child I loved Lego. I could let my imagination run riot, design and build cars, space stations, castles and airplanes.
My brother didn’t like Lego, instead preferring to play with Action Men and toy cars. These sorts of toys did nothing for me, and from the perspective of an adult I can understand why. I couldn’t modify them, I couldn’t create anything new. Perhaps I didn’t have a good enough imagination because I needed to make my ideas real. I wanted to build things, I still do.
Then the most exciting thing happened. My dad bought a BBC micro.
Obviously computers such as the BBC Micro were in many, many ways different from today’s Macs and if you must PCs. Obviously they were several orders of magnitude less powerful than today’s computers but, and importantly, they were designed to be programmed by the user, you were encouraged to do so. It was expected that that’s what you would do. So from a certain perspective they were more powerful.
BBC Micro’s didn’t come preloaded with word processors, spreadsheets and graphics editors and they certainly weren’t WIMPs.
They also came with two thick manuals. One telling you how to set the computer up; the other how to programme it.
This was all very exciting, I suddenly had something with which I could build incredibly complex things. I could, in theory at least, build something that was more complex than the planes, spaceships and cars which I modelled with Lego a few years before.
Like so many children of my age I cut my computing teeth on the BBC Micro. Learnt to programme computers, and played a lot of games!
Unfortunately all was not well. You see I wasn’t very good at programming my BBC micro. I could never actually build the things I had pictured in my mind’s eye, I just wasn’t talented enough.
You see Lego hit a sweet spot which those early computers on the one hand and Action Man on the other missed.
What Lego provided was reusable bits.
When Christmas or my birthdays came around I would start off by building everything suggested by the sets I was given. But I would then dismantle the models and reuse those bricks to build something new, whatever was in my head. By reusing bricks from lots of different sets I could build different models. The more sets I got given, the more things I could build.
Action men simply didn’t offer any of those opportunities, I couldn’t create anything new.
Early computers where certainly very capable of providing a creative platform; but they lacked the reusable bricks, it was more like being given an infinite supply of clay. And clay is harder to reuse than bricks.
Today, with the online world we are in a similar place but with digital bits and bytes rather than moulded plastic bits and bricks.
The Web allows people to create their own stories – it allows people to follow their nose to create threads through the information about the things that interest them, commenting, and discussing it on the way. But the Web also allows developers to reuse previously published information within new, different context to tell new stories.
But only if we build it right.
Most Lego bricks are designed to allow you to stick one brick to another. But not all bricks can be stuck to all others. Some can only be put at the top – these are the tiles and pointy bricks to build your spires, turrets and roofs. These bricks are important, but they can only be used at the end because you can’t build on top of them.
The same is true of the Web – we need to start by building the reusable bits, then the walls and only then the towers and spires and twiddly bits.
But this can be difficult – the shinny towers are seductive and the draw to start with the shiny towers can be strong; only to find out that you then need to knock it down and start again when you want to reuse the bits inside.
We often don’t give ourselves the best opportunity to womble with what we’ve got – to reuse what others make, to reuse what we make ourselves. Or to let others outside our organisations build with our stuff. If you want to take these opportunities then publish your data the webby way.
On the web I reckon there’s only metadata and URIs or perhaps there’s no metadata and only data. Either way the metadata, data/content distinction isn’t helpful.
Linked Data allows you to bind HTTP URIs to an object and to information about that object. This is useful because it’s more useful to talk about real world things — things like people, places and events — the things that people think about. Despite this I have numerous conversations, and have done for years, about what ‘metadata’ to use to describe a document. Typically what this really means is: “what keywords to use so that some technomagical solution can use that ‘metadata’ to personalise/ recommend content”.
Beyond the obvious — keywords on their own are never going to achieve the sorts of solutions non-technical people imagine — it also forces an unhelpful schism. It makes people think about their content and your metadata, or that metadata is somehow outwith the content they are creating. The trouble is that one persons data is another persons metadata. Is the title of a story metadata or content? Is a news story content or metadata about a real world event? The answer depends on your perspective.
It seems to be that a more useful way to think about things is to have URIs to identify things and then have information/documents/data/metadata/whatever that make assertions about those things. Sometimes those bits of information will be simple data points, for example, for an album release they might include information/metadata about who performed or wrote the piece (obviously linking to URIs to identify the person who did perform or write it, with appropriate predicates) while other bits of metadata might be more verbose: reviews of the album or the lyrics etc. and then again some might be media things (recordings of the album etc.).
And of course because we’re talking about a graph of data, those documents making assertions about a thing can in turn also have metadata/data/documents which make assertions about them, for example, who wrote it, comments about it etc.
Imagine what might happen if a news website took this approach? You would mint a URI for the event (or reuse one that already existed) and then write news stories about it, each with their own URL, each making assertions about that event. It would create a news service which was truly native to the Web, rather than a facsimile of the printed press. Imagine then what it would be like if we could link-up all the news stories on the web which also made assertions about that event. As a user of such a site/ set of sites I could find everything about a given thing (a person, event or place).
concepts and events are still social and technological artefacts, but they are designed to help interconnect descriptions of butterflies, documents (and data) about butterflies, and people with interest or expertise relating to butterflies.
In other words what matters is a way of identifying things, a way of interconnecting them and a way of describing them — subdividing those ways of describing them into ‘data’ and ‘metadata’ is unhelpful, or at the very least adds nothing useful.
It is however useful to separate our concept of something from our conception of it. As Stephen Pinkersputs it:
…if you look up William Shakespeare in a dictionary it says “English playwright, lived in the 17th century, wrote Romeo and Juliet and Hamlet, etc.” Is that what the name William Shakespeare means, and is that what the concept William Shakespeare is? That sounds plausible, but it turns out not to be true. If we were to learn that William Shakespeare didn’t write any of the plays attributed to him — let’s say that we learned he didn’t even live in Stratford, that there was a clerical error and he really lived in Warwick. He would still be William Shakespeare, and we wouldn’t posthumously dub the real author of Shakespeare’s plays William Shakespeare. We would just say we were mistaken about what we believed about William Shakespeare.
So what is the concept of William Shakespeare, the meaning of the word William Shakespeare? Basically, when Mr. and Mrs. Shakespeare christened their son William, and the name stuck, and then everyone who knew him, and then who knew someone else, who knew someone else, and passed it down to us — that unbroken chain of transmission of the name from the moment of first dubbing is what gives William Shakespeare its meaning. There’s a sense in which to have a concept necessarily means to be connected to the world through this chain of transmission of a name going back to the moment of first dubbing.
So while I don’t think it’s helpful to separate data from metadata it is helpful separate concept from conception.
Digital Revolution, a new BBC TV programme, was launched last Friday. Due to be broadcast next year, the programme will be looking back over the first 20 years of the web and considering what the future might hold. The show will be considering how the web has changed society and the implications for things like security, privacy and the economy.
Unlike — well probably every other TV programme I’ve ever come across — each programme will be influenced and debated on the web during it’s production. Some of rushes and interviews will be made available on the web (under permissive terms) so that anyone can contribute to the debate, helping to shape the final programme.
Anyway… the presentations were very cool, and while I tweeted the best bits on the day I thought I would write up a short post summing it all up. You know, contributing to the debate and all that.
The thing that struck me most were the discussions and points made around the way in which the web has provided a platform for creativity, and the risks to it’s future because of governments’ failure to understand it (OK, the failure to understand it is my interpretation, not the view expressed by the speakers).
To misquote TimBL: The web should be like paper. Government should be able to prosecute if you misuse it, but they shouldn’t limit what you are able to do with it. When you buy paper you aren’t limited in what can be written or drawn on it, the and like paper the Internet shouldn’t be set up in such a way as to constrain it’s use.
The reason this is important is because it helps to preserve the web’s generative nature. TimBL points out that people are creative, they simply need platform for that creativity, and if that platform is to be the Web then it needs to support everyone, anyone should be able to express that creativity and that means it needs to be open.
As an aside there was a discussion as to whether or not access to the Internet is a ‘human right’ — I’m not sure whether it is or not, but it’s worth considering whether or not if everyone had access to the Web whether it could be used to solve problems in the developing world. For example, by allowing communities to share information on how to dig wells and maintain irrigation systems, information on health care and generally providing educational material. It is very easy, for us in the West to think of the Web as synonymous with the content and services currently provided on it and whether they would be useful in developing countries. But the point really should be if anyone, anywhere in the world where able to create and share information what would they do with it? My hope would be that the services offered would reflect local needs — whether that be social networking in US colleges or water purification in East Africa.
Of course being open and free for all to use doesn’t mean that everything on the web will be wonderful, or indeed legal; no more so than paper ensures wonderful prose because it is open. Or as TimBL puts it:
Just because you can read everything out there doesn’t mean you should. If you found a piece of paper blowing in the wind you wouldn’t expect it to be edifying.
But what does open mean?
Personally I think that an open web is one that seeks to preserve it’s generative nature. But the discussion last Friday also focused on the implications for privacy and snooping.
Governments the world over, including to our shame the current UK Government, are seeking to limit the openness of the web; that is rather than addressing the specific activities that happen on the web, they are seeking to limit the very platform itself. ISPs around the world, at the behest of governments, are being asked to track and record what you do on the web, everything you do on the web. Elsewhere, content is being filtered, traffic shaped and sites blocked.
The sorts of information being collected can include your search terms (pinned to your IP address) and the sites you visit. Now for sure this might, sometime include a bunch of URIs that point to illegal and nefarious activity, but it might also include (indeed it’s more likely to include) URIs relating to a medical condition or legal advice or a hundred and one other, perfectly legal but equally personal bits of information.
Should a government, its agencies or an ISP be able to capture, store and analyses this data? Personally I think not. And should you think that I’m just being a scaremonger have a read of Bill’s post “The digital age of rights” about the French government’s HADOPI legislation.
On the day Bill Thompson (who, by the way, was on blinding form) summed up the reason why when he summed up his hopes for the web thus:
I hoped that the web would help us know our neighbours better, so that we didn’t go and kill them. That hasn’t happened but it does now mean it’s much harder to get away with it – the world will now know if you do kill them.
Governments know this, which is why some now try to lock down access to the Internet when there is civil unrest in their country. And it is also why the rest of the web tries to help them break though.
Few Western governments, would condone the activities of such Totalitarian states. But it is interesting to consider whether Western governments would support North Korea or Iran setting up the kinds of databases currently being debated in Europe and the States. Now they might point out that the comparison isn’t a fair one since they are nice, democratic governments not nasty oppressive ones. But isn’t that painfully myopic? How do they know who will be in power in the future? How do they know how future governments might seek to use the information they are gathering now?
Seeking to prevent snooping on the Internet aside there is another reason why the web should remain open, and it is the reason why it’s important to fight for One Web.
Susan Greenfield quite rightly pointed out that ‘Knowledge is to be found by creating context, links between facts; it’s the context that counts’. Although she was making the point in an attempt to take a swipe at the Web, trying to suggest that the web is no more than a collection of facts devoid of context, it seems to me that in fact the web is the ultimate context machine. (One sometimes wonders whether she has ever actually used any of the services she complains about, indeed I wonder if she uses the web at all).
The web is, as the name suggest, a set of interconnected links. Those URIs and the links between, as TimBL reminded us, are made by people, they are followed by people and as such you can legitimately think of the Web as humanity connected.
URIs are incredibly powerful, particularly when they are used to identify things in addition to documents. When they are used to identify things (dereferencing to the appropriate data or document format) they can lead to entirely new ways to access information. An example highlighted by TimBL is the impact they might have on TV channels and schedules.
He suggested that the concept of a TV channel was limited and that it would be replaced with complete random access. When anyone, anywhere in the world, can follow a URI to a persistent resource (note he didn’t say click on a link) then the TV channel as a means of discovery and recommendation will be replaced with a trust network. “My friends have watched this, most of them like it…” sort of thing.
Of course to get there we need to change the way we think about the web and the way in which we publish things. And here TimBL pointed to the history of the web, suggesting that the next digital revolution will operate in a similar fashion.
The web originally happened not because senior management thought it was a good idea – it happened because people who ‘got it’ thought it was cool, that it was the right thing and that they were lucky enough to have managers that didn’t get in the way. Indeed this is exactly what happened when TimBL wrote the first web server and client and then when the early web pioneers started publishing web pages. They didn’t do it because they were told to, they didn’t do it because there was any immediate benefit. They did it because they thought that by doing it it would enable cool things to happen. The last couple of years suggests that we are on the cusp of a similar revolution as people start to publish linked data which will in turn result in a new digital revolution.
It’s starting to feel like the world has suddenly woken up to the whole Linked Data thing — and that’s clearly a very, very good thing. Not only are Google (and Yahoo!) now using RDFa but a whole bunch of other things are going on, all rather exciting, below is a round up of some of the best. But if you don’t know what I’m talking about you might like to start off with TimBL’s talk at TED.
The BBC has announced a couple SPARQL endpoints, hosted by talis and openlink
Both platforms allow you to search and query the BBC data in a number of different ways, including SPARQL — the standard query language for semantic web data. If you’re not familiar with SPARQL, the Talis folk have published a tutorial that uses some NASA data.
A social semantic BBC?
Nice presentation from Simon and Ben on how social discovery of content could work… “show me the radio programmes my friends have listen to, show me the stuff my friends like that I’ve not seen” all built on people’s existing social graph. People meet content via activity.
PriceWaterhouseCooper’s spring technology forecast focuses on Linked Data [pwc.com]
“Linked Data is all about supply and demand. On the demand side, you gain access to the comprehensive data you need to make decisions. On the supply side, you share more of your internal data with partners, suppliers, and—yes—even the public in ways they can take the best advantage of. The Linked Data approach is about confronting your data silos and turning your information management efforts in a different direction for the sake of scalability. It is a component of the information mediation layer enterprises must create to bridge the gap between strategy and operations… The term “Semantic Web” says more about how the technology works than what it is. The goal is a data Web, a Web where not only documents but also individual data elements are linked.”
I’ve been doing a few talks recently – most recently at the somewhat confused OKCon (Open Knowledge) Conference. The audience was extremely diverse and so I tried to not only talk about what we’ve done but also introduce the concept of Linked Data and explain what it is.
Linked Data is a grassroots project to use web technologies to expose data on the web. It is for many people synonymous with the semantic web – and while this isn’t quite true. It does, as far as I’m concerned, represent a very large subset of the semantic web project. Interestingly, it can also be thought of as the ‘the web done right’, the web as it was originally designed to be.
But what is it?
Well it can be described with 4 simple rules.
1. Use URIs to identify things not only documents
The web was designed to be a web of things with documents making assertions about those real-world things. Just as a passport or driving license, in the real world, can be thought of as providing an identifier for a person making an assertion about who they are, so URIs can be thought of as providing identifiers for people, concepts or things on the web.
Minting URIs for things rather than pages helps make the web more human literate because it means we are identifying those things that people care about.
2. Use HTTP URIs – they are globally unique and anyone can dereference them
The beauty of the web is its ubiquitous nature – it is decentralised and able to function on any platform. This is because of TimBL’s key invention the HTTP URI.
URI’s are globally unique, open to all and decentralised. Don’t go using DOI or any other identifier – on the web all you need is an HTTP URI.
3. Provide useful information [in RDF] when someone looks up a URI
And obviously you need to provide some information at that URI. When people dereference it you need to give them some data – ideally as RDF as well as HTML. Providing the data as RDF means that machines can process that information for people to use. Making it more useful.
4. Include links to other URIs to let people discover related information
And of course you also need to provide links to other resources so people can continue their journey, and that means contextual links to other resources elsewhere on the web, not just your site.
And that’s it.
Pretty simple really and other than the RDF bit, I would argue that these principles should be followed for any website – they just make sense.
Before the Web people still networked their computers – but to access those computers you needed to know about the network, the routing and the computers themselves.
For those in their late 30s you’ll probably remember the film War Games – because this was written before the Web had been invented David and Jennifer the two ‘hackers’ had to find and connect directly to each computer; they had to know about the computer’s location.
The joy of the web is that it adds a level of abstraction – freeing you from the networking, routing and server location – it lets you focus on the document.
Following the principles of Linked Data allows us to add a further level of abstraction – freeing us from the document and letting us focus on the things, people and stuff that matters to people. It helps us design a system that is more human literate, and more useful.
This is possible because we are identifying real world stuff and the relationships between them.
Free information from data silos
Of course there are other ways of achieving this – lots of sites now provide APIs which is good just not great. Each of those APIs tend to be proprietary and specific to the site. As a result there’s an overhead every time someone wants to add that data source.
These APIs give you access to the silo – but the silo still remains. Using RDF and Linked Data means there is a generic method to access data on the web.
What are we doing at the BBC?
First up it’s worth pointing out the obvious: the BBC is a big place and so it would be wrong to assume that everything we’re doing online is following these principles. But there’s quite a lot of stuff going on that does.
We do have – BBC’s programme support, music discovery and, soon, natural history content all adopting these principles. In other words persistent HTTP URIs that can be dereferenced to HTML, RDF, JSON and mobile views for programmes, artists, species and habitats.
We want HTTP URIs for every concept, not HTML webpage – an individual page is made up of multiple resource, multiple concepts. So for example an artist page transcludes the resource ‘/:artist/news’ and ‘/:artist/reviews’ – but those resources also have their own URIs. If they didn’t they wouldn’t be on the web.
Also because there’s only one web we only have one URI for a resource but a number of different representation for that resource. So the URI for the proggramme ‘Nature’s Great Events’ is:
Through content negotiation we will able to server an HTML, RDF, or mobile document to represent that programme.
We then need to link all of this stuff up within the BBC. So that, for example, you can go from a tracklist on an episode page of Jo Whiley on the Radio 1 site to the U2 artist page and then from there to all episodes of Chris Evans which have played U2. Or from an episode of Nature’s Great Events to the page about Brown Bears to all BBC TV programmes about Brown Bears.
But obviously the BBC is only one corner of the web. So we also need to link with the rest of the web.
Because we’re now thinking on a webscale we’ve started to think about the web as a CMS.
Where URIs already exist to represent that concept we are using it rather than minting our own. The new music site transcludes and links back to Wikipedia to provide biographical information about an artist. Rather than minting our own URI for artist biographic info we use Wikipedia’s.
Likewise when we want to add music metadata to the music site we add MusicBrainz.
Following my invitation to speak at the WWW@20 celebrations [my bit starts about 133 minutes into the video] – this is my attempt to squash the most interesting bits into a somewhat coherent 15 minute presentation.
20 years ago Tim Berners-Lee was working, as a computer scientist, at CERN. What he noticed was that, much like the rest of the world, sharing information between research groups was incredibly difficult. Everyone had their own document management solution, running on their own flavour of hardware over different protocols. His solution to the problem was a lightweight method of linking up existing (and new) stuff over IP – a hypertext solution – which he dubbed the World Wide Web – and documented in a memo “Information Management: A Proposal“.
Then for a year or so nothing happened. Nothing happened for a number of reasons, including the fact that IP, and the ARPANET before that, was popular in America but less so in Europe. Indeed senior managers at CERN had recently sent out a memo to all department heads reminding them that IP wasn’t a supported protocol – people were being told not to use it!
Also because CERN was full of engineers everyone thought they could build their own solution, do better than what was already there – no one wanted to play together. And of course because CERN was there to do particle physics not information management.
Then TimBL got his hands on a NeXT Cube – officially he was evaluating the machine not building a web server – but, with the support of his manager, that’s what he did — build the first web server and client. There then ensued a period of negotiation to get the idea out freely, for everyone to use, which happened in 1993. This coincided, more or less, with the University of Minnesota’s decision to charge a license fee for Gopher. Then the web took off especially in the US where IP was already popular.
The beauty of TimBL’s proposal was it’s simplicity – it was designed to work on any platform and importantly with the existing technology. The team knew that to make it work it had to be as easy as possible. He only wanted people to do one thing, that one thing was to give their resources identifiers – links – URIs; so information could be linked and discovered.
This is then is the key invention – the URL.
To make this work URLs were designed to work with existing protocols, in particular it needed to work with FTP and Gopher. That’s why there’s a colon in the URL — so that URLs can be given for stuff that’s already available via other protocols. As an aside, TimBL’s said his biggest mistake was the inclusion of // in the URL — the idea was that one slash meant the resource is on the local machine and two somewhere else on the web, but because everyone used http://foo.bar it means the second / is redundant. I love that this is TimBL’s biggest mistake.
He also implemented a quick tactical solution to get things up and running and demonstrate what he was talking about — HTML. HTML was originally just one of a number of supported doctypes – it wasn’t intended to be the doctype but HTML took off because it was easy. Apparently the plan was to implement a mark-up language that worked a bit like the NeXT application builder. But they didn’t get round to it before Mosaic came along with the first browser (TimBL’s first client was a browser-editor) and then it was all too late. And we’ve been left with something so ugly I doubt even it’s parents love it.
The curious thing, however, is that if you read the original memo — despite its simplicity — it’s clear that we’re still implementing it, we’re still working on the the original spec. Its just that we’ve tended to forget what it said or decided to get sidetracked for a while with some other stuff. So forget about Web 2.0.
For example, the original Web was read-write. Not only that but it used style sheets and a WYSIWYG editing interface — no tags, no mark-up. They didn’t think anyone would want to edit the raw mark-up.
You can also see that the URL’s hidden, you get to it via a property dialog.
This is because the whole point of the web is that it provides a level of abstraction, allowing you to forget about the infrastructure, the servers and the routing. You only needed to worry about the document. For those who remember the film War Games — you will remember that they had to ‘phone up individual computers — they needed this networking information to access the computer, they needed to know its location before they could use it. The beauty of the Web and the URL is that the location shouldn’t matter to the end user.
URIs are there to provide persistent identifiers across the web — they’re not a function of ownership, branding, look and feel, platform or anything else for that matter.
The original team described CERN’s IT ecosystem as a zoo because there were so many different flavours of hardware, different operating systems and protocols in use. The purpose of the web was to be ubiquitous, to work on any machine, open to everyone. It was designed to work no matter what machine or operating system you’re running. This is, of course, achieved by having one identifier, one HTTP URI and defererence that to the appropriate document based on the capacities of that machine.
We should be adopting the same approach today when it comes to delivery to mobile, IPTV, connected devices etc. — we should have one URI for a resource and allow the client to request the document it needs. As Tim intended. The technology is there to do this — we just don’t using it very often.
The original memo also talked about linking people, documents, things and concepts, and data. But we are only now getting around to building it. Through technologies such as OpenID and FOAF we can give people identifiers on the web and describe their social graph, the relationships between those people. And through RDF we can publish information so that machines can process it, describing the nature of and the relationship between the different nodes of data.
The original memo described, and the original server supported, link typing so that you could describe not only real word things but also the nature of the relationship between those things. Like RDF and HTML 5 now does, 20 years later. This focus on data is all a good idea because it lets you treat the web like a giant database. Making computers human literate by linking up bits of data so that the tools, devices and apps connected to the web can do more of the work for you, making it easier to find the things that interest you.
The semantic web project – and TimBL’s original memo – is all about helping people access data in a standard fashion so that we can add another level of abstraction – letting people focus on the things that matter to them. This is what, I believe, we should be striving for for the web’s future because I agree with Dan Brickley, to understand the future of the web you first need to understand it’s origins.
Don’t think about HTML documents – think about the things and concepts that matter to people and give each it’s own identifier, it’s own URI and then put in place the technology to dereference that URI to the document appropriate to the device. Whether that be a desktop PC, a mobile device, an IPTV or third party app.