Apis and APIS a wildlife ontology

By a mile the highlight of last week or so was the 2nd Linked Data meet-up. Silver and Georgi did a great job of organising the day and I came away with a real sense that not only are we on the cusp of seeing a lot of data on the web but also that the UK is at the centre of this particular revolution. All very exciting.

For my part I presented the work we’ve been doing on Wildlife Finder – how we’re starting to publish and consume data on the web. Ed Summers has a great write up of what we’re doing I’ve also published my slides here:

I also joined Paul Miller, Jeni Tennison, Ian Davis and Timo Hannay on a panel session discussing Linked Data in the enterprise.

In terms of Wildlife Finder there are a few things that I wanted to highlight:

  1. If you’re interested in the RDF and how we’re modelling the data we’ve documented the wildlife ontology here. In addition to the ontology itself we’ve also included some background on why we modelled the information in the way we have.
  2. If you want to get you’re hands on the RDF/XML then either add .rdf to the end of most of our URLs (more on this later) or configure your client to request RDF/XML – we’ve implemented content negotiation so you’ll just get the data.
  3. But… we’ve not implemented everything just yet. Specifically the adaptations aren’t published as RDF – this is because we’re making a few changes to the structure of this information and I didn’t want to publish the data and then change it. Nor have we published information on the species conservation status that’s simply because we’ve not finish yet (sorry).
  4. It’s not all RDF – we are also marking-up our taxa pages with the species microformat which gives more structure to the common and scientific names.

Anyway I hope you find this useful.

Rich Snippets

As everyone knows last night Google announced that they are now supporting RDFa and microformats to add ‘Rich Snippets’ to their search results page.

Rich Snippets give users convenient summary information about their search results at a glance. We are currently supporting data about reviews and people. When searching for a product or service, users can easily see reviews and ratings, and when searching for a person, they’ll get help distinguishing between people with the same name…

To display Rich Snippets, Google looks for markup formats (microformats and RDFa) that you can easily add to your own web pages.

That’s good right? Google gets a higher click through rate because, as their user testing shows, the more useful and relevant information people see from a results page, the more likely they are to click through; sites that support these technologies make their content more discoverable and everyone else gets to what they need more easily. Brilliant, and to make life even better because Google have adopted RDFa and microformats

…you not only make your structured data available for Google’s search results, but also for any service or tool that supports the same standard. As structured data becomes more widespread on the web, we expect to find many new applications for it, and we’re excited about the possibilities.

Those Google guys, they really don’t do evil. Well actually no, not so much. Actually Google are being a little bit evil here.

Doctor Evil
Doctor Evil

Here’s the problem. When Google went and implemented RDFa support they adopted the syntax but decided not to adopt the vocabularies – they went and reinvented their own. And as Ian points out it’s the vocabularies that matters. What Google decided to do is little support those properties and classes defined at data-vocabulary.org rather than supporting the existing ontologies such as: FOAF, vCard and vocab.org/review.

Now in some ways this doesn’t matter too much, after all it’s easy enough to do this sort of thing:

rel=”foaf:name google:name”

And Google do need to make Rich Snippets work on their search results, they need to control which vocabularies to support so that webmaster know what to do and so they can render the data appropriatley. But by starting off with a somewhat broken vocabulary they are providing a pretty big incentive to Web Masters to implement a broken version of RDFa. And they will implement the broken version because Google Juice is so important to the success of their site.

Google have taken an open standard and inserted a slug of proprietary NIH into it and that’s a shame, they could have done so much better. Indeed they could have supported RDFa as well as they support microformats.

Perhaps we shouldn’t be surprised, Google are a commercial operation – by adopting RDFa they get a healthy dose of “Google and the Semantic Web” press coverage while at the same time making their search results that bit better. And lets be honest the semweb community hasn’t done a great job at getting those vocabularies out and into the mainstream so Google’s decision won’t hurt it’s bottom line. Just don’t be fooled this isn’t Google supporting RDFa, it’s Google adding Rich Snippets.

links for 2008-06-25

Highly connected graphs: Opening BBC data

Mike Butcher wants the BBC to open up. To make its data available, to provide lots of APIs, well as he puts it:

Dear BBC,

What we want is your data, a lot more APIs, developer tools and your traffic.
We’ve paid for it already in the license fee.
Now get on with it.

Yours Sincerely,
The UK’s Startups

I do largely agree with Mike’s central premise – the BBC does need to make its data more accessible, it does need to provide more APIs. And as Matt has already noted there are people at the BBC working to open things up. Now I don’t want to get into the debate about what the BBC does well vs what it doesn’t – but I did want to highlight some of the work that the team I work in is doing and to give some perspective on why Mike’s objective isn’t as simply as it might appear.

I work in the “FM&T for A&Mi” bit of the BBC (as James has rechristened it) – in other words the ‘new media’ team embedded within the radio and music department. We’re currently working on a couple of projects (programmes and a revamped music site) that I hope might give some of the UK Startups some of what Mike is after. And in due course we’ll be adding more data that will make more startups happy (hopefully).

So what are we doing? It’s probably easiest to start by looking at the current programmes beta – the objective is to ensure that every programme the BBC broadcasts has a permanent, findable web presence. The site provides data for the eight BBC TV channels, ten national radio stations and the six stations covering Scotland, Northern Ireland and Wales.

To enable the sharing of this data in a structured way, we are using the linked data approach to connect and expose resources i.e. using web technologies (URLs and HTTP etc.) to identify and link to a representation of something, and that something can be person, a programme or an album release. These resources also have representations which can be machine-processable (through the use of RDF, Microformats, RDFa, etc.) and they can contain links for other web resources, allowing you to jump from one dataset to another.

OK so that’s the theory – what are we doing?

Currently the pages are marked up with microformats: hCalendar on the schedule views and hCard for cast and crew on episode pages. That’s OK – but hardly the APIs that Mike is after. But what’s coming very soon will hopefully be a bit closer to the mark. Our plan is to make all resources available in a variety of formats: XML, Atom, RSS 2, JSON, YAML, RDF etc. (We’ll announce these on Backstage as they become available.)

And to help folk get direct access to the actual data backing BBC Programmes, we designed a Semantic Web ontology covering programmes data, The Programmes Ontology. This ontology provides web identifiers for concepts such as brand, series, or episode and is released under a creative commons license so anyone can use it.

But there are limitations of such a web interface. To provide a more expressive API we are also investigating using D2R Server, a Java application for mapping relational databases to RDF to make the data accessible through SPARQL. SPARQL allows you to carry out more complex queries than would be possible with simple RDF representations of the resources – think of SPARQL as SQL for the semantic web. It allows you to semantically connect to external data sources such as DBpedia to provide extra information that is not present in our dataset, such as date and place of birth of cast members.

So what about music data? We’re not as far ahead with this work as we are with programmes but you still shouldn’t have to wait too long.

As I’ve written about before we are using MusicBrainz to provide GUIDs for artists and releases and; to give us core metadata about those music resources. The use of MusicBrainz IDs means that we can relate all the BBC resources about an artist together; and others who also use these GUIDs (e.g. metaweb) can use them to find our resources.

In terms of making the data accessible – it’s a similar story to programmes. We’re currently marking up relevant pages with microfromats (hReview and hCard) but the plan is to publish BBC music resources in a variety of different representations (XML, Atom, RSS 2, JSON, APML etc.)

What resources are we talking about? In addition to the core data from MusicBrainz we’re also talking about album reviews (released under a Creative Commons License) and data from our music programmes, for example:

  • views aggregating the artists played on each station into both charts (most played that day, week and since the prototype started running) and artist clouds;
  • programme views which are similar to the station aggregation views but for each programme and links through to each episode;
  • programme episodes with track listings which link through to artist pages;
  • artist pages with biographies pulled in from Wikipedia (MusicBrainz include links to Wikipedia) and links back to the programmes that have featured that artist.
  • And much more…

As you can see we’re not only making the data available as discrete resources we are also linking them together – and making that data available in both human readable and machine readable views. This is a big job – it involves a lot of data, a lot of systems (both web and production systems) and it all needs to work under a high load.

And what about the future? As Michael recently presented at the Semantic Camp our plans are to join programmes, music, events, users and topics. All available on the web, for ever, for people to explore and open for machines to process. If you would like to find out more then Nick and I will be discussing this further at XTech next month in Dublin.

So I hope, in our way we are ‘getting on with it’ – sorry for the delay.

Photo: Please open door slowly, by splorp. Used under licence.

Foo Camping

I’ve just published a short piece on my recent trip to San Francisco and the O’Reilly Foo Camp over at the BBC Radio Lab’s blog.

It was my first trip to San Francisco and I loved the city (you can see my photos on Flickr). But I was also struck my how meme friendly the place is. I guess that’s not that surprising – it’s a relatively small city with a high density of tech companies in and around the bay area, but none the less it does appear to be a good place for tech memes to arise and flourish. One reason why that corner of the world produces so much innovative technology?

Anyway below is my blog post as published on the Radio Lab’s blog.

SGFoo08

“I’ve recently returned from a very enjoyable and educational trip to California where I was honored to be invited to attend the Social Graph Foo Camp. Although I do have to say that while I found the whole thing very exciting I was also, at times, left realising just how far behind some of the conversations I have become, it really is amazing how rapidly the issues and technology within this space are developing – and that’s in the context of a fast moving industry.

It was, however, clear that the really big issues are social not technological: user expectations, data ownership and portability. Although a key piece of the technology puzzle in all this is the establishment of XFN and FOAF which are going to play an ever increasingly important role in glueing different social networks together. And with the launch of Google’s Social Graph API (released under a Creative Commons license by the way) data portability is going to really explode; but with it expect more “Scoblegate” like incidents.

But the prize for getting this right are great, as illustrated by this clip of Joseph Smarr of Plaxo presenting on friends list portability and who owns the data in social networks.

For my part what I took away from this and other discussion is that although on the surface moving data between one social network and another is no different from copying a business card into Outlook people’s expectations make it different. People don’t (yet) expect the data they enter in one site to suddenly appear in another. But they do expect to be able to easily find their friends within a new network. Google’s Social Graph API will make it easier – but there will be a price, as Tim O’Reilly points out:

“Google’s Social Graph API… will definitively end “security by obscurity” regarding people and their relationships, as well as opening up the social graph to “rel=me” spammers. The counter-argument is that all this data is available anyway, and that by making it more visible, we raise people’s awareness and ultimately their behavior.”

Tied to all of this, of course, is the rise of OpenID, the open and decentralized identity system, and OAuth an open protocol to allow secure API authentication between application. Both of which appear to be central to most people’s plans for the coming year.

So what were the other highlights? For me I’m really exited by Tom Coates and Rabble’s latest Yahoo! project: Fire Eagle; which allows you to share you location with friends, other websites or services.

You can think of Fire Eagle as a location brokerage service. Via open APIs other people can write applications that update Fire Eagle with your location so that further applications that can then use it. So for example, someone might write an application that runs on your mobile that triangulates your position based on the location of the transmitters before sending the data to Fire Eagle. You could then run an application on your phone that let you know if your friends where near by, what restaurants are in your area or where the nearest train or tube station is.

Obviously what Fire Eagle also provides is lots of security so you can control who and what applications have access to your location data. I can’t wait to see what people end up doing with Fire Eagle and I’m hoping that we can come up with some interesting applications too.

Finally, XMPP, which I have to say caught me a bit by surprises. If you’ve not come across it before XMPP it’s a messaging and presence protocol developed by Jabber and now used by Google Talk, Jaiku and Apple’s iChat amongst others (with a lot more clients on the way if last weekend was anything to go by).

XMPP is a much more efficient protocol than HTTP for two way messaging because you don’t require your application to check in with the servers periodically – instead the server sends a signal via XMPP when new information is published. And there’s no need to limit that communication to person to person – XMPP can also be used for essentially machine-to-machine Instant Messaging which means you have real time communication between machines.

So based on last weekend’s Foo Camp it looks like XMPP, OpenID, OAuth are all going to be huge in 2008, Google’s Social Graph API and related technologies (FOAF and XFN) will result in some head aches while people’s understanding and expectations settle down but it will be worth it as we move towards a world of data portability.”

Microformat injection

Ben Smith and I are attending Social Graph Foocamp this weekend – this is his post on the BBC development blog – which we’re setting free here on my blog.

While this isn’t a topic that anyone is talking about here, I was struck by a throw away comment by Brad Fitzpatrick about the possibility of Microformat Injection. Everyone knows about XSS and only the worst developers leave themselves exposed by allowing JavaScript through form submissions. However, allowing a subset of HTML through in requests, to then be published in, say, profile pages, is quite standard.

I’m not sure if it would ever be particularly dangerous but Microformat Injection could be used to insert ‘rel=”me”‘ tags to pages as <a> tags are quite regularly allowed through. Now I’m not very knowledgeable about Microformats (RDFa seems shinier) so I’ll leave it up to you to think up some interesting Microformat Injection exploits. Please comment if you think of any!