The Who, What and Why of Personal Data (new paper in JOAL)

Some of my PhD research has been published in the current issue of the Journal of Open Access to Law, from Cornell University.

Abstract:
“Data protection laws require organisations to be transparent about how they use personal data. This article explores the potential of machine-readable privacy notices to address this transparency challenge. We analyse a large source of open data comprised of semi-structured privacy notifications from hundreds of thousands of organisations in the UK, to investigate the reasons for data collection, the types of personal data collected and from whom, and the types of recipients who have access to the data. We analyse three specific sectors in detail; health, finance, and data brokerage. Finally, we draw recommendations for possible future applications of open data to privacy policies and transparency notices.”

It’s available under open access at the JOAL website (or download the PDF)

Searching for Truthiness, Part 2: Knowledge-Based Trust

In the last post I explored two approaches to making computers do smart things, in particular relating to search engines. The knowledge representation approach (affiliated with traditional AI and the semantic web) involves creating ontologies, defining objects and relations, and getting software to make logical inferences over them. What I called the statistical approach (also known as machine learning) involves using data, often generated by human activity, to detect patterns and make a probabilistic assessment of the right answer. In the case of search, what we click on in response to queries and inbound hyperlinks are used to rank search results.

This brings us to the recent paper by some engineers at Google, on what they call knowledge-based trust (KBT). The problem faced by the statistical approach is that it is based on what millions of ordinary, fallible humans do on the web. That includes clicking on and linking to pages with sensational but unsubstantiated headlines, or dubious medical information. This means our biases get picked up by the system alongside our better judgement. If you train a computer with flawed data, it’s going to return flawed results; garbage in, garbage out. What the paper proposes is a new way to suppress (or at least, downgrade) such content based on the number of facts it contains.

But how can a search engine determine the factual content of a web page, if all it measures are clicks and links? It can’t. This is where the knowledge representation approach comes back to the rescue. By comparing statements extracted from web pages with a pre-existing body of knowledge, the researchers hope that a search engine could assess the trustworthiness of a page.

Google have been working on both the knowledge representation and statistical approaches for a long time. This proposal is one example of how the two approaches could be usefully integrated. Those little information boxes that crop up for certain Google searches are another. Try searching ‘Tiger vs Shark‘ and the first thing you’ll see above the normal search results is a tabular comparison of their respective properties – useful for those ‘who would win in a fight between x and y’ questions. These factoids are driven by a pre-existing body of structured data.

But hold on, where does this pre-existing body of knowledge come from, and why should we trust it, especially if it’s used to re-order search results? It comes from the ‘Knowledge Vault‘, Google’s repository of machine-readable information about the world, from geography, biology, history – you name it, they probably have it. It’s based on a collaboratively generated database called Freebase, created (or, perhaps more accurately, ‘curated’) since 2007 by Metaweb, and acquired by Google in 2010. It’s now due to shut down and be replaced by Wikidata, another source of structured data, extracted from Wikipedia.

So while our collective clicks and links may be a bad measure of truthiness, perhaps our collaborative encyclopedia entries can serve as a different standard for truth-assessment. Of course, if this standard is flawed, then the knowledge-based-trust score is going to be equally flawed (garbage in, garbage out). If you think Wikipedia (and hence Wikidata) is dodgy, then you won’t be very impressed by KBT-enhanced search results. If, on the other hand, you think it’s good enough, then it could lead to a welcome improvement. But we can’t escape some of the foundational epistemic questions whichever approach we adopt. In attempting to correct one source of bias, we introduce another. Whether the net effect is positive, or the biases cancel each other out, I don’t know. But what I do know is that isn’t just a question for software engineers to answer.

The main content of the paper itself is highly technical and, dare I say, boring for those of us outside of this branch of computer science. Its main contribution is a solution to the problem of distinguishing noise in the knowledge extraction process from falsehood in the source, something which has so far held back the practical application of such techniques to search ranking. But the discussion that the paper has prompted poses some very important social and political questions.

Risks of the Knowledge-Based Trust approach

The most immediate concern has come from the search engine optimisation community. Will SEO experts now be recommending websites to ‘push up the fact quotient’ on their content? Will marketers have even more reason to infiltrate Wikipedia in an effort to push their ‘facts’ into Wikidata? What about all the many contexts in which we assert untrue claims for contextually acceptable and obvious reasons (e.g. fiction, parody, or hyperbole)? Will they have a harder time getting hits?

And what about all the claims that are ‘unverifiable’ and have no ‘truth value’, as the logical positivists (see previous post) would have said? While KBT would only be one factor in the search rankings, it would still punish content containing many of these kinds of claims. Do we want an information environment that’s skewed towards statements that can be verified and against those that are unverifiable?

The epistemological status of what the researchers call ‘facts’ is also intriguing. The researchers seem to acknowledge that the knowledge base might not be completely accurate, when they include sentences like “facts extracted by automatic methods such as KV may be wrong”. This does seem to be standard terminology in this branch of computer science, but for philosophers, linguists, logicians, sociologists and others, the loose use of the ‘f’ word will ring alarm bells. Even putting aside these academic perspectives, our everyday use of ‘fact’ usually implies truth. It would be far less confusing for to simply call them statements, which can be either true or false.

Finally, while I don’t think it presents a serious danger right now, and indeed it could improve search engines in some ways, moving in this direction has risks for public debate, education and free speech. One danger is that sources containing claims that are worth exploring, but have insufficient evidence, will be systematically suppressed. If there’s no way for a class of maybe-true claims to get into the Knowledge Vault or Wikidata or whatever knowledge base is used, then you have to work extra hard to get people to even consider them. Whatever process is used to revise and expand the knowledge base will inevitably become highly contested, raising conflicts that may often prove irreconcilable.

It will be even harder if your claim directly contradicts the ‘facts’ found in the search engine’s knowledge base. If your claim is true, then society loses out. And even if your claim is false, as John Stuart Mill recognised, society may still benefit from having received opinion challenged:

“Even if the received opinion be not only true, but the whole truth; unless it is suffered to be, and actually is, vigorously and earnestly contested, it will, by most of those who receive it, be held in the manner of a prejudice, with little comprehension or feeling of its rational grounds.” – On Liberty (1859)

Search engines that rank claims by some single standard of truthiness are just one more way that free speech can be gradually, messily eroded. Of course, the situation we have now – the tyranny of the linked and clicked – may be erosive in different, better or worse ways. Either way, the broader problem is that search engines – especially those with a significant majority of the market – can have profound effects on the dissemination of information and misinformation in society. We need to understand these effects and find ways to deal with their political and social consequences.

Searching for Truthiness, Part 1: Logical Positivism vs. Statistics

Wittgenstein (second from right), whose early work inspired logical positivism

 

Recent coverage of a research paper by some Google engineers has ruffled some feathers in the world of SEO. The paper demonstrates a method for what they call a ‘knowledge-based trust’ approach to ranking search results. Instead of using ‘exogenous’ signals like the number of inbound hyperlinks to a web resource (as in the traditional Google PageRank algorithm), the KBT approach factors in ‘endogenous’ signals, namely, the ‘correctness of factual information’ found on the resource.

To understand what this change means, I think it’s worth briefly considering two approaches to knowledge: one is based on statistical measures and exemplified by modern search engines; the other has its roots in a key movement in 20th century philosophy.

One of the fundamental suppositions of analytic philosophy is that there is an objective, rigorous method for pursuing answers to complex questions. The idea that our ethical, political or metaphysical beliefs aren’t just matters of subjective opinion but can be interrogated, revised and improved using objective analytical methods that transcend mere rhetoric.

A group of philosophers in the 1920’s took this idea to an extreme in a movement called logical positivism. They believed that every sentence in any human language could in principle be classified as either verifiable or unverifiable. ‘Analytic’ statements, like those in mathematics, can be verified through logic. ‘Synthetic’ statements, like ‘water is h20’ can be verified through scientific experiment. Every other kind of statement, according to the logical positivists, was an expression of feeling, an exhortation to action or just plain nonsense, and unless you already agree with it there’s no objective way you could be convinced.

The allure of verificationism was that it offered a systematic way to assess any deductive argument. Take every statement, determine an appropriate method of verification for the statement, discarding any which are unverifiable. Sort the statements into premises and conclusions, and determine the truth value of each premise by reference to trusted knowledge sources. Finally, assess whether the conclusions validly follow from the premises using the methods of formal logic. To use a tired syllogism as an example, take the premises ‘All men are mortal’, ‘Socrates is a man’, and the conclusion ‘Socrates is mortal’. The premises can be verified as true through reference to biology and the historical record. Each statement can then be rendered in predicate logic so that the entire argument can be shown to be sound.

While I doubt that the entirety of intellectual debate and enquiry can be reduced down in this way without losing some essential meaning (not to mention rhetorical force), it certainly provides a useful model for certain aspects of reasoning. For better or worse, this model has been used time and time again in attempts to build artificial intelligence. Armed with predicate logic, ontologies to classify things, and lots of fact-checked machine-readable statements, computers can all sorts of clever things.

Search engines could not only find pages based on keywords but do little bits of reasoning to help give us new information that isn’t explicitly written anywhere but can be inferred from a stock of pre-existing information. This is a perfect job for computers because they are great at following well defined rules incredibly fast over massive amounts of data. This is the purpose of projects like Freebase and Wikidata – to take the knowledge we’ve built up in natural language and translate it into machine readable data (stored as key-value pairs or triples). It’s the vision of the semantic web outlined by Tim Berners-Lee.

The search engines we know and love are based on a different approach. This is less focused on logic and knowledge representation and more on statistics. Rather than attempting to represent and reason about the world, the statistical approach tries to get computers to learn how to perform a task based on data (usually generated as a by-product of human activity). For instance, the relevance of a response to a search query isn’t determined by the ‘meaning’ of the query and pre-digested statements about the world, but by the number of inbound links and clicks on a page. We gave up trying to get computers to understand what we’re talking about, and allowed them to guess what we’re after based on the sheer brute force of correlation.

In the next post I’ll look at how Google might integrate these two approaches to improve search engine results.

‘Privacy and consumer markets’ – talk at 31c3

I just gave a talk at the 31st annual Chaos Communication Congress in Hamburg. The blurb:

“The internet may be the nervous system of the 21st century, but its main business purpose is helping marketers work out how to make people buy stuff. This talk maps out a possible alternative, where consumers co-ordinate online, pooling their data and resources to match demand with supply.”

It was live-streamed and the video should be up on the ccc-tv soon. Slides from the talk are available here in PDF or ODP

Thanks to all the organisers for running such a great event!

How to improve how we prove; from paper-and-ink to digital verified attributes

'Stamp of Approval' by Sudhamshu Hebbar, CC-BY 2.0
‘Stamp of Approval’ by Sudhamshu Hebbar, CC-BY 2.0

Personal information management services (PIMS) are an emerging class of digital tools designed to help people manage and use data about themselves. At the core of this is information about your identity and credentials, without which you cannot prove who you are or that you have certain attributes. This is a boring but necessary part of accessing services, claiming benefits and compensation, and a whole range of other general ‘life admin’ tasks.

Currently the infrastructure for managing these processes is stuck somewhere in the Victorian era, dominated by rubber stamps, handwritten signatures and paper forms, dealt with through face-to-face interactions with administrators and shipped around through snail mail. A new wave of technology aims to radically simplify this infrastructure through digital identities, certificates and credentials. Examples include GOV.UK Verify, the UK government identity scheme, and services like MiiCard and Mydex which allow individuals to store and re-use digital proofs of identity and status. The potential savings from these new services are estimated at £3 billion in the UK alone (disclosure: I was part of the research team behind this report).

Yesterday I learned a powerful first-hand lesson about the current state of identity management, and the dire need for PIMS to replace it. It all started when I realised that a train ticket, which I’d bought in advance, would be invalid because my discount rail card expired before the date of travel. After discovering I could not simply pay off the excess to upgrade to a regular ticket, I realised my only option would be to renew the railcard.

That may sound simple, but it was not. To be eligible for the discount, I’d need to prove to the railcard operator that I’m currently a post-graduate student. They require a specific class of (very busy) University official to fill in, sign and stamp their paper form and verify a passport photo. There is a semi-online application system, but this still requires a University administrator to complete the paperwork and send a scanned copy, and then there’s an additional waiting time while a new railcard is sent by post from an office in Scotland.

So I’d need to make a face-to-face visit to one of the qualified University administrators with all the documents, and hope that they are available and willing to deal with them. Like many post-graduate students, I live in a different city so this involves an 190 minute, £38 train round-trip.  When I arrive, the first administrator I ask to sign the documentation tells me that I will have to leave the documentation with their office for an unspecified number of days (days!) while they ‘check their system’ to verify that I am who I say I am.

I tried to communicate the absurdity of the situation: I had travelled 60 miles to get a University-branded pattern of ink stamped on a piece of paper, in order to verify my identity to the railcard company, but the University administrators couldn’t stamp said paper because they needed several days to check a database to verify that I exist and I am me – while I stand before them with my passport, driver’s license, proof of address and my student identity card.

Finally I was lucky enough to speak to another administrator whom I know personally, who was able to deal with the paperwork in a matter of seconds. In the end, the only identity system which worked was a face to face interaction predicated on interpersonal trust; a tried-and-tested protocol which pre-dates the scanned passport, the Kafka-esque rubber stamp, and the pen-pushing Victorian clerk.

Here’s how an effective digital identity system would have solved this problem. Upon enrolment, the university would issue me with a digital certificate, verifying my status as a postgraduate, which would be securely stored and regularly refreshed in my personal data store (PDS). When the time comes to renew my discount railcard, I would simply log in to my PDS and accept a connection from the railcard operator’s site. I pay the fee and they extend the validity of my existing railcard.

From the user experience perspective, that’s all there is to it – a few clicks and it’s done. In the background, there’s a bit more complexity. My PDS would receive a request from the railcard operator’s system for the relevant digital certificate (essentially a cryptographically signed token generated by the University’s system). After verifying the authenticity of the request, my PDS sends a copy of the certificate. The operator’s back-end system then checks the validity of the certificate against the public key of the issuer (in this case, the university). If it all checks out, the operator has assurance from the University that I am eligible for the discount. It should take a matter of seconds.

From a security perspective, it’s harder to fake a signature made out of cryptography than one made out of ink (ironically, it would probably have been less effort for me to forge the ink signature than to obtain it legitimately). Digital proofs can also be better for privacy, as they reveal the minimal amount of information about me that the railcard operator needs to determine my eligibility, and the data is only shared when I permit it.

Identity infrastructure is important for reasons beyond convenience and security – it’s also about equality and access. I’m lucky that I can afford to pay the costs when these boring parts of ‘life admin’ go wrong – paying for a full price ticket wouldn’t have put my bank balance in the red. But if you’re at the bottom of the economic ladder, you have much more to lose when you can’t access the discounted services, benefits and compensation you are entitled to. Reforming our outdated systems could therefore have a disproportionately positive impact for the least well-off.

YouGov Profiles

I haven’t blogged here in a while. But I did write this piece on YouGov’s Profiler app –  a rather fun but warped view on the research company’s consumer profiling data.

It’s published in The Conversation – if you haven’t come across them yet, I strongly recommend taking a look. They publish topical and well-informed opinion pieces from academics, and their motto is ‘academic rigour, journalistic flair’. Best of all, all the articles are licensed under a Creative Commons (BY-ND) license – ensuring they can be republished and shared as widely as possible.

Public Digital Infrastructure: Who Pays?

Glen Canyon Bridge & Dam, Page, Arizona, by flickr user Thaddeus Roan under CC-BY 2.0
Glen Canyon Bridge & Dam, Page, Arizona, by flickr user Thaddeus Roan under CC-BY 2.0

Every day, we risk our personal security and privacy by relying on lines of code written by a bunch under-funded non-profits and unpaid volunteers. These essential pieces of infrastructure go unnoticed and under-funded; that is, until they fail.

Take OpenSSL, one of the most common tools for encrypting internet traffic. It means that things like confidential messages and credit card details aren’t transferred as plain text. It probably saves you from identity fraud, theft, stalking, blackmail, and general inconvenience dozens of times a day. At the time when a critical security flaw (known as ‘Heartbleed’) was discovered in OpenSSL’s code last April, there was just one person paid to work full-time on the project – the rest of it being run largely by volunteers.

What about the Network Time Protocol? It keeps most of the world’s computer’s clocks synchronised so that everything is, you know, on time. NTP has been developed and maintained over the last 20 years by one university professor and a team of volunteers.

Then there is OpenSSH, which is used to securely log in to remote computers across a network – used every day by systems administrators to keep IT systems, servers, and websites working whilst keeping out intruders. That’s maintained by another under-funded team who recently started a fundraising drive because they could barely afford to keep the lights on in their office.

Projects like these are essential pieces of public digital infrastructure; they are the fire brigade of the internet, the ambulance service for our digital lives, the giant dam holding back a flood of digital sewage. But our daily dependence on them is largely invisible and unquantified, so it’s easy to ignore their importance. There is no equivalent to pictures of people being rescued from burning buildings. The image of a programmer auditing some code is not quite as visceral.

So these projects survive on small handouts, occasionally large ones from large technology companies. Whilst it’s great that commercial players want to help secure the open source code they use in their products, this alone is not an ideal solution. Imagine if the ambulance service were funded by ad-hoc injections of cash from various private hospitals, who had no obligation to maintain their contributions. Or if firefighters only got new trucks and equipment when some automobile manufacturer thinks it would be good PR.

There’s a good reason to make this kind of critical public infrastructure open-source. Proprietary code can only be audited behind closed doors, so that means everyone who relies on it has to trust the provider to discover its flaws, fix them, and be honest when they fail. Open source code, on the other hand, can be audited by anyone. The idea is that ‘many eyes make all bugs shallow’ – if everyone can go looking for them, bugs are much more likely to be found.

But just because anyone can, that doesn’t mean that someone will. It’s a little like the story of four people named Everybody, Somebody, Anybody, and Nobody:

There was an important job to be done and Everybody was sure that Somebody would do it. Anybody could have done it, but Nobody did it. Somebody got angry about that because it was Everybody’s job. Everybody thought that Anybody could do it, but Nobody realized that Everybody wouldn’t do it. It ended up that Everybody blamed Somebody when Nobody did what Anybody could have done.

Everybody would benefit if Somebody audited and improved OpenSSL/NTP/OpenSSH/etc, but Nobody has sufficient incentive to do so. Neither proprietary software nor the open source world is delivering the quality of critical public digital infrastructure we need.

One solution to this kind of market failure is to treat critical infrastructure as a public good, deserving of public funding. Public goods are traditionally defined as ‘non-rival’, meaning that one person’s use of the good does not reduce its availability to others, and ‘non-excludable’, meaning that it is not possible to exclude certain people from using it. The examples given above certainly meet this criteria. Code is infinitely reproducible at nearly zero marginal cost, and its use, absent any patents or copyrights, is impossible to constrain.

The costs of creating and sustaining a global, secure, open and free-as-in-freedom digital infrastructure are tiny in comparison to the benefits. But direct, ongoing public funding for those who maintain this infrastructure is rare. Meanwhile, we find that billions have been spent on intelligence agencies whose goal is to make security tools less secure. Rather than undermining such infrastructure, governments should be pooling their resources to improve it.


Related: The Linux foundation have an initiative to address this situation, with the admirable backing of some industry heavyweights http://www.linuxfoundation.org/programs/core-infrastructure-initiative/
While any attempt to list all the critical projects of the internet is likely to be incomplete and lead to disagreement, Jonathan Wilkes and volunteers have nevertheless begun one https://wiki.pch.net/doku.php?id=pch:public:critical-internet-software

‘Surprise Minimisation’

A little while ago I wrote a short post for the IAPP on the notion of ‘surprise minimisation’. In summary, I’m not that keen on it;

I’m left struggling to see the point of introducing yet another term in an already jargon-filled debate. Taken at face-value, recommending surprise minimisation seems no better than simply saying “don’t use data in ways people might not like”—if anything, it’s worse because it unhelpfully equates surprise with objection, and vice-versa. The available elaborations of the concept don’t add much either, as they seem to boil down to an ill-defined mixture of existing principles.

Why Surprise Minimisation is a Misguided Principle

A Study of International Personal Data Transfers

Whilst researching open registers of data controllers, I was left with some interesting data on international data transfers which didn’t make it into my main research paper. This formed the basis of a short paper for the 2014 Web Science conference which took place last month.

The paper presents a brief analysis of the destinations of 16,000 personal data transfers from the UK. Each ‘transfer’ represents an arrangement between a data controller in the UK to send data to a country located overseas. Many of these destinations are simply listed by the rather general categories of ‘European Economic Area’ or ‘Worldwide’, so the analysis focuses on just those transfers where specific countries were mentioned.

I found that even when we adjust for the size of their existing UK export market, countries whose data protection regimes are approved as ‘adequate’ by the European Commission had higher rates of data transfers. This indicates that easing legal restrictions on cross-border transfers does indeed positively correlate with a higher number of transfers (although the direction of causation can’t be established). I was asked by the organisers to produce a graphic to illustrate the findings, so I’m sharing that below.

datatransfers

What do they know about me? Open data on how organisations use personal data

I recently wrote a guest post for the Open Knowledge Foundation’s working group on Personal Data and Privacy Working Group. It delves into the UK register of data controllers – a data source I’ve written about before and which forms the basis of a forthcoming research paper. This time, I’m looking through the data in light of some of the recent controversies we’ve seen in the media including care.data and the construction worker’s blacklist fiasco…

Publishing this information in obscure, unreadable and hidden privacy policies and impact assessments is not enough to achieve meaningful transparency. There’s simply too much of it out there to capture in a piecemeal fashion, in hidden web pages and PDFs. To identify the good and bad things companies do with our personal information, we need more data, in a more detailed, accurate, machine-readable and open format. In the long run, we need to apply the tools of ‘big data’ to drive new services for better privacy management in the public and private sector, as well as for individuals themselves.

You can read the rest here. Thanks to the OKF/ORG for kick-starting such interesting discussions through the mailing list – I’m looking forward to continuing them at the OKF event in Berlin this summer and elsewhere. If you want to participate, do join the working group.