Some thoughts on working out who to trust online

Some thoughts on working out who to trust online

The deplorable attempts to use social media (and much of the mainstream media’s response) to find the bombers of the Boston marathon and then the tweets coming out of the Social Media Summit in New York got me thinking again about how we might get a better understanding of who and what to trust online.

When it comes to online trust I think there are two related questions we should be asking ourselves as technologists:

  1. can we help people better evaluate the accuracy, trustworthiness or validity of a given news story, tweet, blogpost or other publication?;
  2. and can we use social media to better filter those publications to find the most trustworthy sources or article?

This second point is also relevant in scientific publishing (a thing I’m trying to help out with these days) where there is keen interest in ‘altmetrics‘ as a mechanism to help readers discover and filter research articles.

In academic publishing the need for altmetrics has been driven in part by the rise in the number of articles published which in turn is being fuelled by the uptake of Open Access publishing. However, I would like to think that we could apply similar lessons to mainstream media output.

MEDLINE literature growth chart

Historically a publisher’s brand has, at least in theory, helped its readers to judge the value and trustworthiness of an article. If I see an article published in Nature, the New York Times or broadcast by the BBC the chances are I’m more likely to trust it than an article published in say the Daily Mail.

Academic publishing has even gone so far as to codify this in a journal’s Impact Factor (IF) an idea that Larry Page later used as the basis for his PageRank algorithm.

The premiss behind the Impact Factor is that you can identify the best journals and therefore the best content by measuring the frequency with which the average article in that journal has been cited in a particular year or period.

Simplistically then, a journal can improve their Impact Factor by ensuring they only publish the best research. ‘Good Journals’ can then act as a trusted guides to their readership – pre filtering the world’s research output to bring their readers only the best.

Obviously this can go wrong. Good research is published outside of high impact factor journals, journals can publish poor research; and mainstream media is so rife with examples of published piffle that the likes of Ben Goldacre can make a career out of exposing it.

As is often noted the web has enabled all of us to be publishers. It scarcely needs saying that it is now trivially easy for anyone to broadcast their thoughts or post a video or photograph to the Web.

This means that social media is now able to ‘break’ a story before the mainstream media. However, it also presents a problem: how do you know if it’s true? Without brands (or IF) to help guide you how do you judge if a photo, tweet or blogpost should be trusted?

There are plenty of services out there that aggregating tweets, comments, likes +1s etc. to help you find the most talked about story. Indeed most social media services themselves let you find ‘what’s hot’/ most talked about. All these services seem however to assume that there is wisdom in crowds – that the more talked about something is the more trustworthy it is. But as Oliver Reichenstein pointed out:

There is one thing crowds have a flair for, and it is not wisdom, it’s rage.”

Relying on point data (most tweeted, commented etc.) to help filter content or evaluate its trustworthiness whether that be social media or mainstream media seems to me to be foolish.

It seems to me that a better solution would be to build a ‘trust graph’ which in turn could be used to assign a score to each person for a given topic based on their network of friends and followers. It could work something like this…

If a person is followed by a significant number of people who have published peer reviewed papers on a given topic, or if they have publish in that field, then we should trust what that person says about that topic more than the average person.

Equally if a person has posted a large number of photos, tweets etc. over a long period of time from a given city and they are followed by other people from that city (as defined by someone who has a number of posts, over a period of time from that city) then we might conclude that their photographs are going to be from that city if they say they are.

Or if a person is retweeted by someone that for other reasons you trust (e.g. because you know them) then that might give you more confidence their comments and posts are truthful and accurate.

PageRank is Google's link analysis algorithm, that assigns a numerical weighting to each element of a hyperlinked set of documents, with the purpose of "measuring" its relative importance within the set.

Whatever the specifics the point I’m trying to make is that rather than relying on a single number or count we should try to build a directed graph where each person can be assigned a trust or knowledge score based on the strength of their network in that subject area. This is somewhat analogous to Google’s PageRank algorithm.

Before Google, search engines effectively counted the frequency of a given word on a Webpage to assign it a relevancy score – much as we do today when we count the number of comments, tweets etc. to help filter content.

What Larry Page realised was that by assigning a score based on the number and weight of inbound links for a given keyword he and Sergey Brin where able to design and build a much better search engine – one that relies not just on what the publisher tells us, nor simply on the number of links but on the quality of those links. A link from a trusted source is worth more than a link from an average webpage.

Building a trust graph along similar lines – where we evaluate not just the frequency of (re)tweets, comments, likes and blogposts but also consider who those people are, who’s in their network and what their network of followers think of them – could help us filter and evaluate content whether it be social or mainstream media and minimise the damage of those who don’t tweet responsibly.

Interesting semantic web stuff

It’s starting to feel like the world has suddenly woken up to the whole Linked Data thing — and that’s clearly a very, very good thing. Not only are Google (and Yahoo!) now using RDFa but a whole bunch of other things are going on, all rather exciting, below is a round up of some of the best. But if you don’t know what I’m talking about you might like to start off with TimBL’s talk at TED.

"Semantic Web Rubik's Cube" by dullhunk. Some rights reserved.
"Semantic Web Rubik's Cube" by dullhunk. Some rights reserved.

TimBL is working with the UK Cabinet Office (as an advisor) to make our information more open and accessible on the web [cabinetoffice.gov.uk]
The blog states that he’s working on:

  • overseeing the creation of a single online point of access and work with departments to make this part of their routine operations.
  • helping to select and implement common standards for the release of public data
  • developing Crown Copyright and ‘Crown Commons’ licenses and extending these to the wider public sector
  • driving the use of the internet to improve consultation processes.
  • working with the Government to engage with the leading experts internationally working on public data and standards

The Guardian has an article on the appointment.

Closer to home there have been a few interesting developments

Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data to Make Connections [pdf]
Our paper at this years European Semantic Web Conference (ESWC2009) looking at how the BBC has adopted semantic web technologies, including DBpedia, to help provide a better, more coherent user experience. For which we won best paper of the in-use track – congratulations to Silver and Georgie.

The BBC has announced a couple SPARQL endpoints, hosted by talis and openlink
Both platforms allow you to search and query the BBC data in a number of different ways, including SPARQL — the standard query language for semantic web data. If you’re not familiar with SPARQL, the Talis folk have published a tutorial that uses some NASA data.

A social semantic BBC?
Nice presentation from Simon and Ben on how social discovery of content could work… “show me the radio programmes my friends have listen to, show me the stuff my friends like that I’ve not seen” all built on people’s existing social graph. People meet content via activity.

PriceWaterhouseCooper’s spring technology forecast focuses on Linked Data [pwc.com]
“Linked Data is all about supply and demand. On the demand side, you gain access to the comprehensive data you need to make decisions. On the supply side, you share more of your internal data with partners, suppliers, and—yes—even the public in ways they can take the best advantage of. The Linked Data approach is about confronting your data silos and turning your information management efforts in a different direction for the sake of scalability. It is a component of the information mediation layer enterprises must create to bridge the gap between strategy and operations… The term “Semantic Web” says more about how the technology works than what it is. The goal is a data Web, a Web where not only documents but also individual data elements are linked.”

Including an interview with me!

You should also check out…

sameas.org a service to help link up equivalent URIs
It helps you to find co-references between different data sets. Interestingly it’s also licenced under CC0 which means all copyright and related or neighboring rights are waived.