Some thoughts on working out who to trust online

Some thoughts on working out who to trust online

The deplorable attempts to use social media (and much of the mainstream media’s response) to find the bombers of the Boston marathon and then the tweets coming out of the Social Media Summit in New York got me thinking again about how we might get a better understanding of who and what to trust online.

When it comes to online trust I think there are two related questions we should be asking ourselves as technologists:

  1. can we help people better evaluate the accuracy, trustworthiness or validity of a given news story, tweet, blogpost or other publication?;
  2. and can we use social media to better filter those publications to find the most trustworthy sources or article?

This second point is also relevant in scientific publishing (a thing I’m trying to help out with these days) where there is keen interest in ‘altmetrics‘ as a mechanism to help readers discover and filter research articles.

In academic publishing the need for altmetrics has been driven in part by the rise in the number of articles published which in turn is being fuelled by the uptake of Open Access publishing. However, I would like to think that we could apply similar lessons to mainstream media output.

MEDLINE literature growth chart

Historically a publisher’s brand has, at least in theory, helped its readers to judge the value and trustworthiness of an article. If I see an article published in Nature, the New York Times or broadcast by the BBC the chances are I’m more likely to trust it than an article published in say the Daily Mail.

Academic publishing has even gone so far as to codify this in a journal’s Impact Factor (IF) an idea that Larry Page later used as the basis for his PageRank algorithm.

The premiss behind the Impact Factor is that you can identify the best journals and therefore the best content by measuring the frequency with which the average article in that journal has been cited in a particular year or period.

Simplistically then, a journal can improve their Impact Factor by ensuring they only publish the best research. ‘Good Journals’ can then act as a trusted guides to their readership – pre filtering the world’s research output to bring their readers only the best.

Obviously this can go wrong. Good research is published outside of high impact factor journals, journals can publish poor research; and mainstream media is so rife with examples of published piffle that the likes of Ben Goldacre can make a career out of exposing it.

As is often noted the web has enabled all of us to be publishers. It scarcely needs saying that it is now trivially easy for anyone to broadcast their thoughts or post a video or photograph to the Web.

This means that social media is now able to ‘break’ a story before the mainstream media. However, it also presents a problem: how do you know if it’s true? Without brands (or IF) to help guide you how do you judge if a photo, tweet or blogpost should be trusted?

There are plenty of services out there that aggregating tweets, comments, likes +1s etc. to help you find the most talked about story. Indeed most social media services themselves let you find ‘what’s hot’/ most talked about. All these services seem however to assume that there is wisdom in crowds – that the more talked about something is the more trustworthy it is. But as Oliver Reichenstein pointed out:

There is one thing crowds have a flair for, and it is not wisdom, it’s rage.”

Relying on point data (most tweeted, commented etc.) to help filter content or evaluate its trustworthiness whether that be social media or mainstream media seems to me to be foolish.

It seems to me that a better solution would be to build a ‘trust graph’ which in turn could be used to assign a score to each person for a given topic based on their network of friends and followers. It could work something like this…

If a person is followed by a significant number of people who have published peer reviewed papers on a given topic, or if they have publish in that field, then we should trust what that person says about that topic more than the average person.

Equally if a person has posted a large number of photos, tweets etc. over a long period of time from a given city and they are followed by other people from that city (as defined by someone who has a number of posts, over a period of time from that city) then we might conclude that their photographs are going to be from that city if they say they are.

Or if a person is retweeted by someone that for other reasons you trust (e.g. because you know them) then that might give you more confidence their comments and posts are truthful and accurate.

PageRank is Google's link analysis algorithm, that assigns a numerical weighting to each element of a hyperlinked set of documents, with the purpose of "measuring" its relative importance within the set.

Whatever the specifics the point I’m trying to make is that rather than relying on a single number or count we should try to build a directed graph where each person can be assigned a trust or knowledge score based on the strength of their network in that subject area. This is somewhat analogous to Google’s PageRank algorithm.

Before Google, search engines effectively counted the frequency of a given word on a Webpage to assign it a relevancy score – much as we do today when we count the number of comments, tweets etc. to help filter content.

What Larry Page realised was that by assigning a score based on the number and weight of inbound links for a given keyword he and Sergey Brin where able to design and build a much better search engine – one that relies not just on what the publisher tells us, nor simply on the number of links but on the quality of those links. A link from a trusted source is worth more than a link from an average webpage.

Building a trust graph along similar lines – where we evaluate not just the frequency of (re)tweets, comments, likes and blogposts but also consider who those people are, who’s in their network and what their network of followers think of them – could help us filter and evaluate content whether it be social or mainstream media and minimise the damage of those who don’t tweet responsibly.

Scientific publishing on the Web

As usual these are my thoughts, observations and musings not those of my employer.

Scientific publishing has in many ways remained largely unchanged since 1665. Scientific discoveries are still published in journal articles where the article is a review, a piece of metadata if you will, of the scientists’ research.

Nature 1869
Cover of the first issue of Nature, 4 November 1869.

This is of course not all bad. For example, I think it is fair to say that this approach has played a part in creating the modern world. The scientific project has helped us understand the universe, helped eradicate diseases, helped decreased child mortality and helped free us from the drudgery of mere survival. The process of publishing peer reviewed articles is the primary means of disseminating this human knowledge and as such has been, and remains, central to the scientific project.

And if I am being honest nor is it entirely fair, to claim that things haven’t changed in all those years – clearly they have. Recently new technologies, notably the Web, have made it easier to publish and disseminate those articles, which in turn has lead to changes in the associated business models of publishers e.g. Open Access publications.

However, it seems to me that scientific publishers and the scientific community at large has yet to fully utilize the strengths of the Web.

Content is distributed over http but what is distributed is still, in essence, a print journal over the Web. Little has changed since 1665 – the primary objects, the things a SMT STM publisher publishes remain the article, issue and journal.

The power of the Web is its ability to share information via URIs and more specifically its ability to globally distribute a wide range of documents and media types (from text to video to raw data and software (as source code or as binaries)). The second and possibly more powerful aspect of the Web is its ability to allow people to recombine information, to make assertions and statements about things in the world and information on the Web. These assertions can create new knowledge and aid discoverability of information.

This is not to say that there shouldn’t be research articles and journals – both provide value – for example journals provides a useful point of aggregation and quality assurance to the author and reader. The article is an immutable summary of the researchers work at a given date and, of course, the paper remains the primary means of communication between scientists. However, the Web provides mechanisms to greatly enhance the article, to make it more discoverable and allow it to place it into a wider context.

In addition to the published article STM publishers already publish supporting information in the form of ‘supplementary information’ unfortunately this is often little more than a PDF document. However, it is also not clear (to me at least) if the article is the right location for some of this material – it appears to me that a more useful approach is that of the ‘Research Object’ [pdf], semantically rich aggregations of resources, as proposed by the Force11 community.

It seems to me that the notion of a Research Object as the primary published object is a powerful one. One that might make research more useful.

What is a Research Object?

Well what I mean by a Research Object is a URI (and if one must a DOI) that identifies a distinct piece of scientific work. An Open Access ‘container’ that would allow an author to group together all the aspects of their research into a single location. These resources within it might include:

  • The published article or articles if a piece of research resulted in a number of articles (whether they be OA or not);
  • The raw data behind the paper(s) or individual figures within the paper(s) (published in a non-proprietary format e.g. csv not Excel);
  • The protocols used (so an experiment can be easily replicated);
  • Supporting or supplementary video;
  • URLs to News and Views or other commentary from the Publisher or elsewhere;
  • URLs to news stories;
  • URLs to university reading lists;
  • URLs to profile pages of the authors and researchers involved in the work;
  • URLs to the organizations involved in the work (e.g. funding bodies, host university or research lab etc.);
  • Links to other research (both historical i.e. bibliographic information but also research that has occurred since publication).

Furthermore, the relationship between the different entities within a Research Object should be explicit. It is not enough to treat a Research Object as a bag of stuff, there should be stated and explicit relationship between the resources held within a Research Object. For example, the relationship between the research and the funding organization should be defined via a vocabulary (e.g. funded_by), likewise any raw data should be identified as such and where appropriate linked to the relevant figures within a paper.

Something like this:

Domain model of a Research Object
The major components of a Research Object.

It is important to note that while the Research Object is open access the resources it contains may or may not be. For example, the raw data might be open whereas the article might not. People would therefore be able to reference the Research Object, point to it on the Web, discuss it and make assertions about it.

In the FRBR world a Research Object would be a Work i.e. a “distinct intellectual creation”.

Making research more discoverable

The current publishing paradigm places seriously limitations on the discoverability of research articles (or research objects).

Scientists work with others to research a domain of knowledge; in some respects therefore research articles are metadata about the universe (or at least the experiment). They are assertions, made by a group of people, about a particular thing based on their research and the data gathered. It would therefore be helpful if scientists could discover prior research along these lines of enquiry.

Implicit in the above description of a Research Object is the need to publish URIs about: people, organisations (universities, research labs, funding bodies etc.) and areas of research.

These URIs and the links between them would provide a rich network of science – a graph that describes and maps out the interrelationships between people, organisations and their area of interest, each annotated with research objects, such a graph would also allow for pages such as:

  • All published research by an author;
  • All published research by a research lab;
  • The researchers that have worked together in a lab;
  • The researchers who have collaborated on a published paper;
  • The areas of research by lab, funding body or individual;
  • Etc.

Such a graph would help readers to both ‘follow their nose’ to discover research and provide meaningful landing pages for search.

Digital curation

One of the significant benefits a journal brings to its readership is the role of curation. The editors of the journal selects and publishes the best research for their readers. On the Web there is no reason this role couldn’t be extended beyond the editor to the users and readers of a site.

Different readers will have different motivations for doing so but providing a mechanism for those users to aggregate and annotate research objects provides a new and potentially powerful mechanism by which scientific discoveries could be surfaced.

For example, a lecturer might curate a collection of papers for an undergraduate class on genomics, combining research objects with their own comments, video and links to other content across the web. This collection could then be shared and used more widely with other lecturers. Alternatively a research lab might curate a collection of papers relevant to their area of research but choose to keep it private.

Providing a rich web of semantically linked resources in this way would allow for the development of a number of different metrics (in addition to Impact Factor). These metrics would not need to be limited to scientific impact; they could be extended to cover:

  • Educational indices – a measure of the citations in university reading lists;
  • Social impact – a measure of citations in the mainstream media;
  • Scientific impact of individual papers;
  • Impact of individual scientists or research labs;
  • Etc.

Such metrics could be used directly e.g. research indexes or; indirectly e.g. to help readers find the best/ most relevant content.

Finally it is worth remembering that in all cases this information should be available for both humans and machines to consume and process. In other words this information should be available in structured, machine readable formats.

Our development manifesto

Our development manifesto

Manifesto’s are quite popular in the tech community — obviously there’s the agile manifesto and I’ve written before about the kaizen manifesto and then there’s the Manifesto for Software Craftsmanship. They all try to put forward a way of working, a way of raising professionalism and a way of improving the quality of what you do and build.

Anyway when we started work on on the BBC’s Nature site we set out our development manifesto. I thought you might be interested in it:

  1. Peristence — only mint a new URIs if one doesn’t already exist: once minted, never delete it
  2. Linked open data — data and documents describe the real world; things in the real world are identified via HTTP URIs; links describe how those things are related to each other.
  3. The website is the API
  4. RESTful — the Web is stateless, work with this architecture, not against it.
  5. One Web – one canonical URI for each resource (thing), dereferenced to the appropriate representation (HTML, JSON, RDF, etc.).
  6. Fix the data don’t hack the code
  7. Books have pages, the web has links
  8. Do it right or don’t do it at all — don’t hack in quick fixes or ‘tactical solutions’ they are bad for users and bad for the code.
  9. Release early, release often — small, incremental changes are easy to test and proof.

It’s worth noting that we didn’t always live up to these standards — but at least when we broke our rules we did so knowingly and had a chance of fixing them at a later date.