In the last post I explored two approaches to making computers do smart things, in particular relating to search engines. The knowledge representation approach (affiliated with traditional AI and the semantic web) involves creating ontologies, defining objects and relations, and getting software to make logical inferences over them. What I called the statistical approach (also known as machine learning) involves using data, often generated by human activity, to detect patterns and make a probabilistic assessment of the right answer. In the case of search, what we click on in response to queries and inbound hyperlinks are used to rank search results.
This brings us to the recent paper by some engineers at Google, on what they call knowledge-based trust (KBT). The problem faced by the statistical approach is that it is based on what millions of ordinary, fallible humans do on the web. That includes clicking on and linking to pages with sensational but unsubstantiated headlines, or dubious medical information. This means our biases get picked up by the system alongside our better judgement. If you train a computer with flawed data, it’s going to return flawed results; garbage in, garbage out. What the paper proposes is a new way to suppress (or at least, downgrade) such content based on the number of facts it contains.
But how can a search engine determine the factual content of a web page, if all it measures are clicks and links? It can’t. This is where the knowledge representation approach comes back to the rescue. By comparing statements extracted from web pages with a pre-existing body of knowledge, the researchers hope that a search engine could assess the trustworthiness of a page.
Google have been working on both the knowledge representation and statistical approaches for a long time. This proposal is one example of how the two approaches could be usefully integrated. Those little information boxes that crop up for certain Google searches are another. Try searching ‘Tiger vs Shark‘ and the first thing you’ll see above the normal search results is a tabular comparison of their respective properties – useful for those ‘who would win in a fight between x and y’ questions. These factoids are driven by a pre-existing body of structured data.
But hold on, where does this pre-existing body of knowledge come from, and why should we trust it, especially if it’s used to re-order search results? It comes from the ‘Knowledge Vault‘, Google’s repository of machine-readable information about the world, from geography, biology, history – you name it, they probably have it. It’s based on a collaboratively generated database called Freebase, created (or, perhaps more accurately, ‘curated’) since 2007 by Metaweb, and acquired by Google in 2010. It’s now due to shut down and be replaced by Wikidata, another source of structured data, extracted from Wikipedia.
So while our collective clicks and links may be a bad measure of truthiness, perhaps our collaborative encyclopedia entries can serve as a different standard for truth-assessment. Of course, if this standard is flawed, then the knowledge-based-trust score is going to be equally flawed (garbage in, garbage out). If you think Wikipedia (and hence Wikidata) is dodgy, then you won’t be very impressed by KBT-enhanced search results. If, on the other hand, you think it’s good enough, then it could lead to a welcome improvement. But we can’t escape some of the foundational epistemic questions whichever approach we adopt. In attempting to correct one source of bias, we introduce another. Whether the net effect is positive, or the biases cancel each other out, I don’t know. But what I do know is that isn’t just a question for software engineers to answer.
The main content of the paper itself is highly technical and, dare I say, boring for those of us outside of this branch of computer science. Its main contribution is a solution to the problem of distinguishing noise in the knowledge extraction process from falsehood in the source, something which has so far held back the practical application of such techniques to search ranking. But the discussion that the paper has prompted poses some very important social and political questions.
Risks of the Knowledge-Based Trust approach
The most immediate concern has come from the search engine optimisation community. Will SEO experts now be recommending websites to ‘push up the fact quotient’ on their content? Will marketers have even more reason to infiltrate Wikipedia in an effort to push their ‘facts’ into Wikidata? What about all the many contexts in which we assert untrue claims for contextually acceptable and obvious reasons (e.g. fiction, parody, or hyperbole)? Will they have a harder time getting hits?
And what about all the claims that are ‘unverifiable’ and have no ‘truth value’, as the logical positivists (see previous post) would have said? While KBT would only be one factor in the search rankings, it would still punish content containing many of these kinds of claims. Do we want an information environment that’s skewed towards statements that can be verified and against those that are unverifiable?
The epistemological status of what the researchers call ‘facts’ is also intriguing. The researchers seem to acknowledge that the knowledge base might not be completely accurate, when they include sentences like “facts extracted by automatic methods such as KV may be wrong”. This does seem to be standard terminology in this branch of computer science, but for philosophers, linguists, logicians, sociologists and others, the loose use of the ‘f’ word will ring alarm bells. Even putting aside these academic perspectives, our everyday use of ‘fact’ usually implies truth. It would be far less confusing for to simply call them statements, which can be either true or false.
Finally, while I don’t think it presents a serious danger right now, and indeed it could improve search engines in some ways, moving in this direction has risks for public debate, education and free speech. One danger is that sources containing claims that are worth exploring, but have insufficient evidence, will be systematically suppressed. If there’s no way for a class of maybe-true claims to get into the Knowledge Vault or Wikidata or whatever knowledge base is used, then you have to work extra hard to get people to even consider them. Whatever process is used to revise and expand the knowledge base will inevitably become highly contested, raising conflicts that may often prove irreconcilable.
It will be even harder if your claim directly contradicts the ‘facts’ found in the search engine’s knowledge base. If your claim is true, then society loses out. And even if your claim is false, as John Stuart Mill recognised, society may still benefit from having received opinion challenged:
“Even if the received opinion be not only true, but the whole truth; unless it is suffered to be, and actually is, vigorously and earnestly contested, it will, by most of those who receive it, be held in the manner of a prejudice, with little comprehension or feeling of its rational grounds.” – On Liberty (1859)
Search engines that rank claims by some single standard of truthiness are just one more way that free speech can be gradually, messily eroded. Of course, the situation we have now – the tyranny of the linked and clicked – may be erosive in different, better or worse ways. Either way, the broader problem is that search engines – especially those with a significant majority of the market – can have profound effects on the dissemination of information and misinformation in society. We need to understand these effects and find ways to deal with their political and social consequences.