Category Technology

Rich Snippets

As everyone knows last night Google announced that they are now supporting RDFa and microformats to add ‘Rich Snippets’ to their search results page.

Rich Snippets give users convenient summary information about their search results at a glance. We are currently supporting data about reviews and people. When searching for a product or service, users can easily see reviews and ratings, and when searching for a person, they’ll get help distinguishing between people with the same name…

To display Rich Snippets, Google looks for markup formats (microformats and RDFa) that you can easily add to your own web pages.

That’s good right? Google gets a higher click through rate because, as their user testing shows, the more useful and relevant information people see from a results page, the more likely they are to click through; sites that support these technologies make their content more discoverable and everyone else gets to what they need more easily. Brilliant, and to make life even better because Google have adopted RDFa and microformats

…you not only make your structured data available for Google’s search results, but also for any service or tool that supports the same standard. As structured data becomes more widespread on the web, we expect to find many new applications for it, and we’re excited about the possibilities.

Those Google guys, they really don’t do evil. Well actually no, not so much. Actually Google are being a little bit evil here.

Doctor Evil

Doctor Evil

Here’s the problem. When Google went and implemented RDFa support they adopted the syntax but decided not to adopt the vocabularies – they went and reinvented their own. And as Ian points out it’s the vocabularies that matters. What Google decided to do is little support those properties and classes defined at data-vocabulary.org rather than supporting the existing ontologies such as: FOAF, vCard and vocab.org/review.

Now in some ways this doesn’t matter too much, after all it’s easy enough to do this sort of thing:

rel=”foaf:name google:name”

And Google do need to make Rich Snippets work on their search results, they need to control which vocabularies to support so that webmaster know what to do and so they can render the data appropriatley. But by starting off with a somewhat broken vocabulary they are providing a pretty big incentive to Web Masters to implement a broken version of RDFa. And they will implement the broken version because Google Juice is so important to the success of their site.

Google have taken an open standard and inserted a slug of proprietary NIH into it and that’s a shame, they could have done so much better. Indeed they could have supported RDFa as well as they support microformats.

Perhaps we shouldn’t be surprised, Google are a commercial operation – by adopting RDFa they get a healthy dose of “Google and the Semantic Web” press coverage while at the same time making their search results that bit better. And lets be honest the semweb community hasn’t done a great job at getting those vocabularies out and into the mainstream so Google’s decision won’t hurt it’s bottom line. Just don’t be fooled this isn’t Google supporting RDFa, it’s Google adding Rich Snippets.

Daytum I love you but please join the web

I’ve been lucky enough to have been a beta testers for daytum.com, a service for collecting and communicating personal data, and I love it. As you might expect from Ryan Case and Nicholas Feltron it’s a lovely piece of interaction and graphic design. You can record and visualise all sorts of qualitative and quantitative data – personally I’m recording information about what I eat, drink, how much I sleep and communicate (emails, blog posts, talks, tweets etc.) but others record the music they listen to, how far they run, gigs they’ve been to, books they’ve read. All sorts of things.

OK I probably drink too much coffee

OK I probably drink too much coffee

And now you too can record and visualise whatever you want because this weekend the service came out of beta. Now here’s the thing, as much as I love the service I wish it were more, well born of the web. You see I have a few problems with daytum.

My main problem is that I can’t point to the stuff I’m recording. That graphic at the top of this post doesn’t have a URL so I can’t link to it or the underlying data; and because I can’t point to it it limits what can be done with it. If I can’t link to to, I can’t embed it elsewhere, I can’t link it to other data sources and mash it up. And that’s a problem because the only possible URI for this sort of information about me is locked away in the daytum interface. Why isn’t there a nice RESTful URL for each ‘display’. Something like:

daytum.com/:user/:statement

Once everything has a URL then I want each of those resources to be made available in a variety of different representations – as JSON, RDF and ATOM for starters – that way the data can be used, not just visualised.

And finally I want to be able to use URIs to describe what I’m measuring, not just strings. I want to be able to point to stuff out there on the web and say “at this time I consumed another one of those”. I’m not suggesting that everything should have to be described like this, but if there’s a URI to represent something I want to be able to point to it so everyone knows what I’m talking about.

In other words I want daytum.com to be following the Linked Data principles rather than an ajax only interface.

If you have a look at Felton’s own annual reports you will see that they group and aggregate all sorts of information but to achieve something similar (conceptually if not visually) then you will need a lot more from daytum than currently being offered.

Felton Annual Report 2008

Felton Annual Report 2008

The other big gap is the lack of an API to update information. Keeping daytum.com up to date is actually quite hard work and certainly to be able to collect the sort of data Nicholas Felton does to put together his annual reports would be onerous to say the least, but it needn’t be.

If daytum.com provided an API that allowed me to post information from other services that would be a great start, but actually it’s not always necessary, nor even that desirable. The Web already knows quite a lot about us, for example Fire Eagle and Dopplr know where I am/ been, delicious knows what I think is interesting on the web, and how I describe those things, Twitter and this blog what I doing and thinking about; for others Last.fm knows what music they are listening to. Daytum doesn’t need to replicate all of that data, indeed it shouldn’t, it could simply request that data when needed — to visualise it. (it shouldn’t store it because it makes it harder to manage access to it).

The one thing I don’t want, however, is yet another social networking site, I don’t want social features to be part of daytum. I don’t want them because I don’t need them – there are already loads of places integrated into my social graph, whether that be Twitter, Flickr, Facebook or this blog. I really don’t want to have to import and then maintain another social graph. I do however want to be able to squirt the data I’m collecting or aggregating here at daytum into my existing social graph; much as Fire Eagle adds location brokerage to existing services so I want a service that adds personal data to existing social networking sites.

Linking bbc.co.uk to the Linked Data cloud

I’ve been doing a few talks recently – most recently at the somewhat confused OKCon (Open Knowledge) Conference. The audience was extremely diverse and so I tried to not only talk about what we’ve done but also introduce the concept of Linked Data and explain what it is.

Linked Data is a grassroots project to use web technologies to expose data on the web. It is for many people  synonymous with the semantic web – and while this isn’t quite true. It does, as far as I’m concerned, represent a very large subset of the semantic web project. Interestingly, it can also be thought of as the ‘the web done right’, the web as it was originally designed to be.

But what is it?

Well it can be described with 4 simple rules.

1. Use URIs to identify things not only documents

The web was designed to be a web of things with documents making assertions about those real-world things. Just as a passport or driving license, in the real world, can be thought of as providing an identifier for a person making an assertion about who they are, so URIs can be thought of as providing identifiers for people, concepts or things on the web.

Minting URIs for things rather than pages helps make the web more human literate because it means we are identifying those things that people care about.

2. Use HTTP URIs – they are globally unique and anyone can dereference them

The beauty of the web is its ubiquitous nature – it is decentralised and able to function on any platform. This is because of TimBL’s key invention the HTTP URI.

URI’s are globally unique, open to all and decentralised. Don’t go using DOI or any other identifier – on the web all you need is an HTTP URI.

3. Provide useful information [in RDF] when someone looks up a URI

And obviously you need to provide some information at that URI. When people dereference it you need to give them some data – ideally as RDF as well as HTML. Providing the data as RDF means that machines can process that information for people to use. Making it more useful.

4. Include links to other URIs to let people discover related information

And of course you also need to provide links to other resources so people can continue their journey, and that means contextual links to other resources elsewhere on the web, not just your site.

And that’s it.

Pretty simple really and other than the RDF bit, I would argue that these principles should be followed for any website – they just make sense.

But why?

Before the Web people still networked their computers – but to access those computers you needed to know about the network, the routing and the computers themselves.

For those in their late 30s you’ll probably remember the film War Games – because this was written before the Web had been invented David and Jennifer the two ‘hackers’ had to find and connect directly to each computer; they had to know about the computer’s location.

Phoning up another computer

War Games, 1983

The joy of the web is that it adds a level of abstraction – freeing you from the networking, routing and server location – it lets you focus on the document.

Following the principles of Linked Data allows us to add a further level of abstraction – freeing us from the document and letting us focus on the things, people and stuff that matters to people. It helps us design a system that is more human literate, and more useful.

This is possible because we are identifying real world stuff and the relationships between them.

Free information from data silos

Of course there are other ways of achieving this – lots of sites now provide APIs which is good just not great. Each of those APIs tend to be proprietary and specific to the site. As a result there’s an overhead every time someone wants to add that data source.

These APIs give you access to the silo – but the silo still remains. Using RDF and Linked Data means there is a generic method to access data on the web.

What are we doing at the BBC?

First up it’s worth pointing out the obvious: the BBC is a big place and so it would be wrong to assume that everything we’re doing online is following these principles. But there’s quite a lot of stuff going on that does.

We do have – BBC’s programme support, music discovery and, soon, natural history content all adopting these principles. In other words persistent HTTP URIs that can be dereferenced to HTML, RDF, JSON and mobile views for programmes, artists, species and habitats.

We want HTTP URIs for every concept, not HTML webpage – an individual page is made up of multiple resource, multiple concepts. So for example an artist page transcludes the resource ‘/:artist/news’ and ‘/:artist/reviews’ – but those resources also have their own URIs. If they didn’t they wouldn’t be on the web.

Also because there’s only one web we only have one URI for a resource but a number of different representation for that resource. So the URI for the proggramme ‘Nature’s Great Events’ is:

bbc.co.uk/programmes/b00ht655#programme

Through content negotiation we will able to server an HTML, RDF, or mobile document to represent that programme.

We then need to link all of this stuff up within the BBC. So that, for example, you can go from a tracklist on an episode page of Jo Whiley on the Radio 1 site to the U2 artist page and then from there to all episodes of Chris Evans which have played U2. Or from an episode of Nature’s Great Events to the page about Brown Bears to all BBC TV programmes about Brown Bears.

But obviously the BBC is only one corner of the web. So we also need to link with the rest of the web.

Because we’re now thinking on a webscale we’ve started to think about the web as a CMS.

Where URIs already exist to represent that concept we are using it rather than minting our own. The new music site transcludes and links back to Wikipedia to provide biographical information about an artist. Rather than minting our own URI for artist biographic info we use Wikipedia’s.

Likewise when we want to add music metadata to the music site we add MusicBrainz.

What does the history of the web tell us about its future?

Following my invitation to speak at the WWW@20 celebrations [my bit starts about 133 minutes into the video] – this is my attempt to squash the most interesting bits into a somewhat coherent 15 minute presentation.

20 years ago Tim Berners-Lee was working, as a computer scientist, at CERN. What he noticed was that, much like the rest of the world, sharing information between research groups was incredibly difficult. Everyone had their own document management solution, running on their own flavour of hardware over different protocols.  His solution to the problem was a lightweight method of linking up existing (and new) stuff over IP – a hypertext solution – which he dubbed the World Wide Web – and documented in a memo “Information Management: A Proposal“.

Then for a year or so nothing happened. Nothing happened for a number of reasons, including the fact that IP, and the ARPANET before that, was popular in America but less so in Europe. Indeed senior managers at CERN had recently sent out a memo to all department heads reminding them that IP wasn’t a supported protocol – people were being told not to use it!

Also because CERN was full of engineers everyone thought they could build their own solution, do better than what was already there – no one wanted to play together. And of course because CERN was there to do particle physics not information management.

Then TimBL got his hands on a NeXT Cube – officially he was evaluating the machine not building a web server – but, with the support of his manager, that’s what he did — build the first web server and client. There then ensued a period of negotiation to get the idea out freely, for everyone to use, which happened in 1993. This coincided, more or less, with the University of Minnesota’s decision to charge a license fee for Gopher. Then the web took off especially in the US where IP was already popular.

The first webserver

The first webserver

The beauty of TimBL’s proposal was it’s simplicity – it was designed to work on any platform and importantly with the existing technology. The team knew that to make it work it had to be as easy as possible. He only wanted people to do one thing, that one thing was to give their resources identifiers – links – URIs; so information could be linked and discovered.

This is then is the key invention – the URL.

To make this work URLs were designed to work with existing protocols, in particular it needed to work with FTP and Gopher. That’s why there’s a colon in the URL — so that URLs can be given for stuff that’s already available via other protocols. As an aside, TimBL’s said his biggest mistake was the inclusion of // in the URL — the idea was that one slash meant the resource is on the local machine and two somewhere else on the web, but because everyone used http://foo.bar it means the second / is redundant. I love that this is TimBL’s biggest mistake.

He also implemented a quick tactical solution to get things up and running and demonstrate what he was talking about — HTML. HTML was originally just one of a number of supported doctypes – it wasn’t intended to be the doctype but HTML took off because it was easy. Apparently the plan was to implement a mark-up language that worked a bit like the NeXT application builder. But they didn’t get round to it before Mosaic came along with the first browser (TimBL’s first client was a browser-editor) and then it was all too late. And we’ve been left with something so ugly I doubt even it’s parents love it.

The curious thing, however, is that if you read the original memo — despite its simplicity — it’s clear that we’re still implementing it, we’re still working on the the original spec. Its just that we’ve tended to forget what it said or decided to get sidetracked for a while with some other stuff. So forget about Web 2.0.

For example, the original Web was read-write. Not only that but it used style sheets and a WYSIWYG editing interface — no tags, no mark-up. They didn’t think anyone would want to edit the raw mark-up.

The first web site was read and write

The first web site was read and write

You can also see that the URL’s hidden, you get to it via a property dialog.

This is because the whole point of the web is that it provides a level of abstraction, allowing you to forget about the infrastructure, the servers and the routing. You only needed to worry about the document. For those who remember the film War Games — you will remember that they had to ‘phone up individual computers — they needed this networking information to access the computer, they needed to know its location before they could use it. The beauty of the Web and the URL is that the location shouldn’t matter to the end user.

URIs are there to provide persistent identifiers across the web — they’re not a function of ownership, branding, look and feel, platform or anything else for that matter.

The original team described CERN’s IT ecosystem as a zoo because there were so many different flavours of hardware, different operating systems and protocols in use. The purpose of the web was to be ubiquitous, to work on any machine, open to everyone. It was designed to work no matter what machine or operating system you’re running. This is, of course, achieved by having one identifier, one HTTP URI and defererence that to the appropriate document based on the capacities of that machine.

We should be adopting the same approach today when it comes to delivery to mobile, IPTV, connected devices etc. — we should have one URI for a resource and allow the client to request the document it needs. As Tim intended. The technology is there to do this — we just don’t using it very often.

The original memo also talked about linking people, documents, things and concepts, and data. But we are only now getting around to building it. Through technologies such as OpenID and FOAF we can give people identifiers on the web and describe their social graph, the relationships between those people. And through RDF we can publish information so that machines can process it, describing the nature of and the relationship between the different nodes of data.

Information Management: A Proposal

Information Management: A Proposal by Tim Berners-Lee

The original memo described, and the original server supported, link typing so that you could describe not only real word things but also the nature of the relationship between those things. Like RDF and HTML 5 now does, 20 years later. This focus on data is all a good idea because it lets you treat the web like a giant database. Making computers human literate by linking up bits of data so that the tools, devices and apps connected to the web can do more of the work for you, making it easier to find the things that interest you.

The semantic web project – and TimBL’s original memo – is all about helping people access data in a standard fashion so that we can add another level of abstraction – letting people focus on the things that matter to them. This is what, I believe, we should be striving for for the web’s future because I agree with Dan Brickley, to understand the future of the web you first need to understand it’s origins.

Don’t think about HTML documents – think about the things and concepts that matter to people and give each it’s own identifier, it’s own URI and then put in place the technology to dereference that URI to the document appropriate to the device. Whether that be a desktop PC, a mobile device, an IPTV or third party app.

Interesting stuff from around the web 2009-03-20

Ben Seagal, Tim Berners-Lee and Robert Calliau with the WWW proposal and first webserver at the WWW@20 celebrations, CERN

Ben Seagal, Tim Berners-Lee and Robert Calliau with TimBL's original proposal and first webserver at the WWW@20 celebrations, CERN

Semantic web news

Linked Data? Web of Data? Semantic Web? WTF? [Tom Heath]
“Think about HTML documents; when people started weaving these together with hyperlinks we got a Web of documents. Now think about data. When people started weaving individual bits of data together with RDF triples (that expressed the relationship between these bits of data) we saw the emergence of a Web of data. Linked Data is no more complex than this – connecting related data across the Web using URIs, HTTP and RDF.”

The Programmes Ontology [BBC]
Yves has updated the programmes ontology to handle “temporal annotations” tracklistings and segments and outlets etc.

Twitter news

The Twitter Global Mind [Rocketboom]
Don’t understand what all the fuss about Twitter? Watch this. Yes it’s about social networking and communication but it’s also about realtime search.

Twitter to begin charging brands for commercial use [Brand Republic News]
Co-founder Biz Stone told Marketing: ‘We are noticing more companies using Twitter and individuals following them. We can identify ways to make this experience even more valuable and charge for commercial accounts.’ He would not be drawn on the level of charges.

Some interesting visualisations

Depressing Project of the Day: Stock Market, Set to Music with Microsoft Songsmith [Create Digital Music]
Thanks to Yves. The failing economy set to music.

Periodic Table of Typefaces on the Behance Network [behance.net]
“The Periodic Table of Typefaces is obviously in the style of all the thousands of over-sized Periodic Table of Elements posters hanging in schools and homes around the world. This particular table lists 100 of the most popular, influential and notorious typefaces today. As with traditional periodic tables, this table presents the subject matter grouped categorically. The Table of Typefaces groups by families and classes of typefaces: san-serif, serif, script, blackletter, glyphic, display, grotesque, realist, didone, garalde, geometric, humanist, slab-serif and mixed.”

The open web

What is the Open Platform? [guardian.co.uk]
“The Open Platform is the suite of services that make it possible for guardian.co.uk to build applications with the Guardian…” very nice, I hope others follow. I also wish the Beeb recognized it’s open projects (recognized internally that is).

RadioAunty feature update – twitter, scheduling and much more [whomwah]
RadioAunty is Mac app that allows you to listen to live and catchup BBC Radio. It’s a lovely app and is built on an open BBC platform :)

Monty Python DVD sales soar thanks to YouTube clips [guardian.co.uk]
“Within days of the launch of the official Monty Python YouTube channel, sales of the DVD box set had gone up by 16,000% on Amazon”

Designing for your least able user [BBC Radio Labs]
Michael’s mighty post on SEO, accessibility and the joy of links. Read it.

Follow

Get every new post delivered to your Inbox.

Join 819 other followers