Scientific publishing on the Web

As usual these are my thoughts, observations and musings not those of my employer.

Scientific publishing has in many ways remained largely unchanged since 1665. Scientific discoveries are still published in journal articles where the article is a review, a piece of metadata if you will, of the scientists’ research.

Nature 1869
Cover of the first issue of Nature, 4 November 1869.

This is of course not all bad. For example, I think it is fair to say that this approach has played a part in creating the modern world. The scientific project has helped us understand the universe, helped eradicate diseases, helped decreased child mortality and helped free us from the drudgery of mere survival. The process of publishing peer reviewed articles is the primary means of disseminating this human knowledge and as such has been, and remains, central to the scientific project.

And if I am being honest nor is it entirely fair, to claim that things haven’t changed in all those years – clearly they have. Recently new technologies, notably the Web, have made it easier to publish and disseminate those articles, which in turn has lead to changes in the associated business models of publishers e.g. Open Access publications.

However, it seems to me that scientific publishers and the scientific community at large has yet to fully utilize the strengths of the Web.

Content is distributed over http but what is distributed is still, in essence, a print journal over the Web. Little has changed since 1665 – the primary objects, the things a SMT STM publisher publishes remain the article, issue and journal.

The power of the Web is its ability to share information via URIs and more specifically its ability to globally distribute a wide range of documents and media types (from text to video to raw data and software (as source code or as binaries)). The second and possibly more powerful aspect of the Web is its ability to allow people to recombine information, to make assertions and statements about things in the world and information on the Web. These assertions can create new knowledge and aid discoverability of information.

This is not to say that there shouldn’t be research articles and journals – both provide value – for example journals provides a useful point of aggregation and quality assurance to the author and reader. The article is an immutable summary of the researchers work at a given date and, of course, the paper remains the primary means of communication between scientists. However, the Web provides mechanisms to greatly enhance the article, to make it more discoverable and allow it to place it into a wider context.

In addition to the published article STM publishers already publish supporting information in the form of ‘supplementary information’ unfortunately this is often little more than a PDF document. However, it is also not clear (to me at least) if the article is the right location for some of this material – it appears to me that a more useful approach is that of the ‘Research Object’ [pdf], semantically rich aggregations of resources, as proposed by the Force11 community.

It seems to me that the notion of a Research Object as the primary published object is a powerful one. One that might make research more useful.

What is a Research Object?

Well what I mean by a Research Object is a URI (and if one must a DOI) that identifies a distinct piece of scientific work. An Open Access ‘container’ that would allow an author to group together all the aspects of their research into a single location. These resources within it might include:

  • The published article or articles if a piece of research resulted in a number of articles (whether they be OA or not);
  • The raw data behind the paper(s) or individual figures within the paper(s) (published in a non-proprietary format e.g. csv not Excel);
  • The protocols used (so an experiment can be easily replicated);
  • Supporting or supplementary video;
  • URLs to News and Views or other commentary from the Publisher or elsewhere;
  • URLs to news stories;
  • URLs to university reading lists;
  • URLs to profile pages of the authors and researchers involved in the work;
  • URLs to the organizations involved in the work (e.g. funding bodies, host university or research lab etc.);
  • Links to other research (both historical i.e. bibliographic information but also research that has occurred since publication).

Furthermore, the relationship between the different entities within a Research Object should be explicit. It is not enough to treat a Research Object as a bag of stuff, there should be stated and explicit relationship between the resources held within a Research Object. For example, the relationship between the research and the funding organization should be defined via a vocabulary (e.g. funded_by), likewise any raw data should be identified as such and where appropriate linked to the relevant figures within a paper.

Something like this:

Domain model of a Research Object
The major components of a Research Object.

It is important to note that while the Research Object is open access the resources it contains may or may not be. For example, the raw data might be open whereas the article might not. People would therefore be able to reference the Research Object, point to it on the Web, discuss it and make assertions about it.

In the FRBR world a Research Object would be a Work i.e. a “distinct intellectual creation”.

Making research more discoverable

The current publishing paradigm places seriously limitations on the discoverability of research articles (or research objects).

Scientists work with others to research a domain of knowledge; in some respects therefore research articles are metadata about the universe (or at least the experiment). They are assertions, made by a group of people, about a particular thing based on their research and the data gathered. It would therefore be helpful if scientists could discover prior research along these lines of enquiry.

Implicit in the above description of a Research Object is the need to publish URIs about: people, organisations (universities, research labs, funding bodies etc.) and areas of research.

These URIs and the links between them would provide a rich network of science – a graph that describes and maps out the interrelationships between people, organisations and their area of interest, each annotated with research objects, such a graph would also allow for pages such as:

  • All published research by an author;
  • All published research by a research lab;
  • The researchers that have worked together in a lab;
  • The researchers who have collaborated on a published paper;
  • The areas of research by lab, funding body or individual;
  • Etc.

Such a graph would help readers to both ‘follow their nose’ to discover research and provide meaningful landing pages for search.

Digital curation

One of the significant benefits a journal brings to its readership is the role of curation. The editors of the journal selects and publishes the best research for their readers. On the Web there is no reason this role couldn’t be extended beyond the editor to the users and readers of a site.

Different readers will have different motivations for doing so but providing a mechanism for those users to aggregate and annotate research objects provides a new and potentially powerful mechanism by which scientific discoveries could be surfaced.

For example, a lecturer might curate a collection of papers for an undergraduate class on genomics, combining research objects with their own comments, video and links to other content across the web. This collection could then be shared and used more widely with other lecturers. Alternatively a research lab might curate a collection of papers relevant to their area of research but choose to keep it private.

Providing a rich web of semantically linked resources in this way would allow for the development of a number of different metrics (in addition to Impact Factor). These metrics would not need to be limited to scientific impact; they could be extended to cover:

  • Educational indices – a measure of the citations in university reading lists;
  • Social impact – a measure of citations in the mainstream media;
  • Scientific impact of individual papers;
  • Impact of individual scientists or research labs;
  • Etc.

Such metrics could be used directly e.g. research indexes or; indirectly e.g. to help readers find the best/ most relevant content.

Finally it is worth remembering that in all cases this information should be available for both humans and machines to consume and process. In other words this information should be available in structured, machine readable formats.

Science ontology — take three

Paul, Michael and Silver have done a bit more work refining the nascent science ontology — unfortunately I was caught up doing something a lot less interesting so this version is all their work and not mine, and it is all the better for it.

The big change to this version is the removal of much of the publication specific stuff since this is handled elsewhere otherwise otherwise it should look like a fairly obvious evolution from the previous versions.

Version 3 of the science domain model

And here’s a N3 serialisation of the model. There’s still lots to do, it needs checking against what happens when there are multiple ranges are given for a property, we need to write proper definitions, add namespaces, look for existing ontology reuse etc.

<!-- Science Ontology - First version! Still to do: Declare namespaces Define ontology (name, author etc) Finish definitions Look for existing ontologies for reuse etc. Publish! -->

<!-- Classes -->

so:Observation a owl:Class;
	rdfs:label "Observation";
	rdfs:comment "Definition goes here" .

so:Hypothesis a owl:Class;
	rdfs:label "Hypothesis";
	rdfs:comment "Definition goes here" .

so:Experiment a owl:Class;
	rdfs:label "Experiment";
	rdfs:comment "Definition goes here" .

so:Equipment a owl:Class;
	rdfs:label "Equipment";
	rdfs:comment "Definition goes here" .

so:Method a owl:Class;
	rdfs:label "Method";
	rdfs:comment "Definition goes here" .

so:Collaboration a owl:Class;
	rdfs:label "Collaboration";
	rdfs:comment "Definition goes here" .

so:ExperimentalObservation a owl:Class;
	rdfs:label "Experimental Observation";
	rdfs:comment "Definition goes here";
	rdfs:subClassOf so:Observation .

so:Data a owl:Class;
	rdfs:label "Data";
	rdfs:comment "Definition goes here" .

so:Analysis a owl:Class;
	rdfs:label "Analysis";
	rdfs:comment "Definition goes here" .

so:Publication a owl:Class;
	rdfs:label "Publication";
	rdfs:comment "Definition goes here" .

so:Theory a owl:Class;
	rdfs:label "Theory";
	rdfs:comment "Definition goes here" .

so:Prediction a owl:Class;
	rdfs:label "Prediction";
	rdfs:comment "Definition goes here" .

so:Agent a owl:Class;
	rdfs:label "Agent";
	rdfs:comment "Definition goes here"
	rdfs:subClassOf foaf:Agent .

<!-- Properties -->

so:inspiredBy a owl:ObjectProperty;
	rdfs:label "inspiredBy";
	rdfs:comment "definition goes here - but what happens with multiple ranges? hypotheses can be inspired by Observations, Theories and Predictions...";
	rdfs:domain so:Hypothesis;
	rdfs:range so:Observation;
	rdfs:range so:Theory;
	rdfs:range so:Prediction .

so:makes a owl:ObjectProperty;
	rdfs:label "makes";
	rdfs:comment "definition goes here";
	rdfs:domain so:Theory;
	rdfs:range so:Prediction .

so:tests a owl:ObjectProperty;
	rdfs:label "tests";
	rdfs:comment "definition goes here";
	rdfs:domain so:Experiment;
	rdfs:range so:Hypothesis .

so:equipment a owl:ObjectProperty;
	rdfs:label "equipment";
	rdfs:comment "Relates a piece of equipment to an experiment it is used in.";
	rdfs:domain so:Experiment;
	rdfs:range so:Equipment .

so:method a owl:ObjectProperty;
	rdfs:label "method";
	rdfs:comment "Relates a method to an experiment it was used in.";
	rdfs:domain so:Experiment;
	rdfs:range so:Method .

so:experimentalObservation a owl:ObjectProperty;
	rdfs:label "experimental observation";
	rdfs:comment "Relates an observation made as a result of an experiment to the experiment it was made in.";
	rdfs:domain so:Experiment;
	rdfs:range so:ExperimentalObservation .

so:captures a owl:ObjectProperty;
	rdfs:label "captures";
	rdfs:comment "Relates data to an experimental observation it was captured in.";
	rdfs:domain so:ExperimentalObservation;
	rdfs:range so:Data .

so:analyses a owl:ObjectProperty;
	rdfs:label "analyses";
	rdfs:comment "Definition goes here";
	rdfs:domain so:Analysis;
	rdfs:range so:Data .

so:published a owl:ObjectProperty;
	rdfs:label "published";
	rdfs:comment "Relates an Analysis to a Publication it was published in.";
	rdfs:domain so:Analysis;
	rdfs:range so:Publication .

<!-- Analysis to Theory -->

so:establishes a owl:ObjectProperty;
	rdfs:label "establishes";
	rdfs:comment "Definition goes here";
	rdfs:domain so:Analysis;
	rdfs:range so:Theory .

so:validates a owl:ObjectProperty;
	rdfs:label "validates";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Analysis;
	rdfs:range so:Theory .

so:modifies a owl:ObjectProperty;
	rdfs:label "modifies";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Analysis;
	rdfs:range so:Theory .

so:contradicts a owl:ObjectProperty;
	rdfs:label "contradicts";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Analysis;
	rdfs:range so:Theory .

<!-- Analysis to Hypothesis -->

so:supports a owl:ObjectProperty;
	rdfs:label "supports";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Analysis;
	rdfs:range so:Hypothesis .

so:modifies a owl:ObjectProperty;
	rdfs:label "modifies";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Analysis;
	rdfs:range so:Hypothesis .

so:disproves a owl:ObjectProperty;
	rdfs:label "disproves";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Analysis;
	rdfs:range so:Hypothesis .

<!-- Agent properties -->

so:proposes a owl:ObjectProperty;
	rdfs:label "proposes";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Agent;
	rdfs:range so:Hypothesis .

so:collaborates a owl:ObjectProperty;
	rdfs:label "collaborates";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Agent;
	rdfs:range so:Collaboration .

so:funds a owl:ObjectProperty;
	rdfs:label "funds";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Agent;
	rdfs:range so:Experiment .

so:performs a owl:ObjectProperty;
	rdfs:label "performs";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Agent;
	rdfs:range so:Experiment .

so:observes a owl:ObjectProperty;
	rdfs:label "proposes";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Agent;
	rdfs:range so:Observation .

so:forms a owl:ObjectProperty;
	rdfs:label "forms";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Agent;
	rdfs:range so:Analysis .

so:creates a owl:ObjectProperty;
	rdfs:label "creates";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Agent;
	rdfs:range so:Publication .

so:creditedWith a owl:ObjectProperty;
	rdfs:label "credited with";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Agent;
	rdfs:range so:Theory .

so:participates a owl:ObjectProperty;
	rdfs:label "participates";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Agent;
	rdfs:range so:Agent .

so:collaboratesOn a owl:ObjectProperty;
	rdfs:label "proposes";
	rdfs:comment "Definition goes here.";
	rdfs:domain so:Collaboration;
	rdfs:range so:Experiment;
	rdfs:range so:Hypothesis .

Apis and APIS a wildlife ontology

By a mile the highlight of last week or so was the 2nd Linked Data meet-up. Silver and Georgi did a great job of organising the day and I came away with a real sense that not only are we on the cusp of seeing a lot of data on the web but also that the UK is at the centre of this particular revolution. All very exciting.

For my part I presented the work we’ve been doing on Wildlife Finder – how we’re starting to publish and consume data on the web. Ed Summers has a great write up of what we’re doing I’ve also published my slides here:

I also joined Paul Miller, Jeni Tennison, Ian Davis and Timo Hannay on a panel session discussing Linked Data in the enterprise.

In terms of Wildlife Finder there are a few things that I wanted to highlight:

  1. If you’re interested in the RDF and how we’re modelling the data we’ve documented the wildlife ontology here. In addition to the ontology itself we’ve also included some background on why we modelled the information in the way we have.
  2. If you want to get you’re hands on the RDF/XML then either add .rdf to the end of most of our URLs (more on this later) or configure your client to request RDF/XML – we’ve implemented content negotiation so you’ll just get the data.
  3. But… we’ve not implemented everything just yet. Specifically the adaptations aren’t published as RDF – this is because we’re making a few changes to the structure of this information and I didn’t want to publish the data and then change it. Nor have we published information on the species conservation status that’s simply because we’ve not finish yet (sorry).
  4. It’s not all RDF – we are also marking-up our taxa pages with the species microformat which gives more structure to the common and scientific names.

Anyway I hope you find this useful.

Online information conference

I’ve really been neglecting this blog recently – apologies but my attention has been elsewhere recently. Anyway, while I get round to actually writing something here’s a presentation I gave at the Online Information Conference recently.

The presentation is largely based upon the article Michael and I wrote for Nodalities this time last year.

Lego, Wombles and Linked Data

As a child I loved Lego. I could let my imagination run riot, design and build cars, space stations, castles and airplanes.

Blue lego brick

My brother didn’t like Lego, instead preferring to play with Action Men and toy cars. These sorts of toys did nothing for me, and from the perspective of an adult I can understand why. I couldn’t modify them, I couldn’t create anything new. Perhaps I didn’t have a good enough imagination because I needed to make my ideas real. I wanted to build things, I still do.

Then the most exciting thing happened. My dad bought a BBC micro.

Obviously computers such as the BBC Micro were in many, many ways different from today’s Macs and if you must PCs. Obviously they were several orders of magnitude less powerful than today’s computers but, and importantly, they were designed to be programmed by the user, you were encouraged to do so. It was expected that that’s what you would do. So from a certain perspective they were more powerful.

BBC Micro’s didn’t come preloaded with word processors, spreadsheets and graphics editors and they certainly weren’t WIMPs.

What they did come with was BBC BASIC and Assembly Language.

They also came with two thick manuals. One telling you how to set the computer up; the other how to programme it.

This was all very exciting, I suddenly had something with which I could build incredibly complex things. I could, in theory at least, build something that was more complex than the planes, spaceships and cars which I modelled with Lego a few years before.

Like so many children of my age I cut my computing teeth on the BBC Micro. Learnt to programme computers, and played a lot of games!

Unfortunately all was not well. You see I wasn’t very good at programming my BBC micro. I could never actually build the things I had pictured in my mind’s eye, I just wasn’t talented enough.

You see Lego hit a sweet spot which those early computers on the one hand and Action Man on the other missed.

What Lego provided was reusable bits.

When Christmas or my birthdays came around I would start off by building everything suggested by the sets I was given. But I would then dismantle the models and reuse those bricks to build something new, whatever was in my head. By reusing bricks from lots of different sets I could build different models. The more sets I got given, the more things I could build.

Action men simply didn’t offer any of those opportunities, I couldn’t create anything new.

Early computers where certainly very capable of providing a creative platform; but they lacked the reusable bricks, it was more like being given an infinite supply of clay. And clay is harder to reuse than bricks.

Today, with the online world we are in a similar place but with digital bits and bytes rather than moulded plastic bits and bricks.

The Web allows people to create their own stories – it allows people to follow their nose to create threads through the information about the things that interest them, commenting, and discussing it on the way. But the Web also allows developers to reuse previously published information within new, different context to tell new stories.

But only if we build it right.

Most Lego bricks are designed to allow you to stick one brick to another. But not all bricks can be stuck to all others. Some can only be put at the top – these are the tiles and pointy bricks to build your spires, turrets and roofs. These bricks are important, but they can only be used at the end because you can’t build on top of them.

The same is true of the Web – we need to start by building the reusable bits, then the walls and only then the towers and spires and twiddly bits.

But this can be difficult – the shinny towers are seductive and the draw to start with the shiny towers can be strong; only to find out that you then need to knock it down and start again when you want to reuse the bits inside.

We often don’t give ourselves the best opportunity to womble with what we’ve got – to reuse what others make, to reuse what we make ourselves. Or to let others outside our organisations build with our stuff. If you want to take these opportunities then publish your data the webby way.