Barclays was in the news recently with regards to ‘anonymity’ and a potential big data solution. The anonymity of data is without doubt a big deal thanks to big data – it’s a substantial industry with an extensive body of scientific papers and it can take a tremendous volume of analysis to effectively anonymise even a few hundred megabytes of data. This is due to the sheer number of cross comparisons that might need to be done, especially for tables of data with hundreds or thousands of columns.
Big data and anonymity is a fascinating topic and at one time or another most people have wondered if, for example, Google can work out who they are and what they've been doing (it can). While the average person perhaps doesn't mind Google or Facebook knowing things about them, would that still be the case if they thought their data was being sold on to less scrupulous companies?
Personal health data being passed (sold) on has also been in the press again recently. Names might have been removed from such data, but it is not such a long stretch to imagine that an insurance company that buys it might be able to put two and two together by matching it with other data items in their possession.
Riskcare teams sometimes build risk system away from clients’ sites and are able to advise on regulatory standards and adhering to best practice with regards to anonymised data.
All of the big cloud companies gather an astonishing amount of metrics on how people interact with their systems; from how long they were on a page, their current location, what search terms they’ve used, the previous few sites they’ve visited (by looking at cookies) and potentially how long the mouse pointer hovered over an article or advert. Risk system exports can have many columns in the .csv files, etc. How many pieces of information does it take to uniquely identify a specific row? It turns out to be surprisingly few; blame science and Hadoop.
The key concept here is ‘k-anonymity’. This concept was coined in 2002 and describes just how many pieces of information (think columns in a spreadsheet) are required to uniquely identify a row of data. It’s written as ‘2-anonymity’ or ‘3-anonymity’ etc.
K-anonymity is improved by either suppressing data (e.g. using ‘*’ instead of data) or by generalisation (e.g. instead of exact ‘age’, use a range such as 20-30). This works to a degree, but falls down eventually due to ‘the curse of dimensionality’.
Although this sounds like the title of a Dr Who episode, it’s a real thing. The data in a single table will probably have hundreds of columns and millions of rows. Some of the rows and columns will be very specific to certain types of data, so the more columns of data there are, the more empty cells that will exist and therefore the above techniques won't work. This means that, alongside Hadoop/big data analysis, we will soon be able to pinpoint anyone.
In fact, four pieces of information are required in order to identify an individual to within 95% certainty. Four. That's how hard it is. Apache Spark could whittle through all the UK phone records with only four facts and pinpoint said person.
Researchers at MIT have recently built a system called Chronos and with only a single IP address they can geographically locate a person to within 4cm in real time. This used to require four or five WiFi base stations and a chunk of processing time, but by using different IP packet routes and normalising the switches and routing delays – combined with a super accurate clock – they say they can tell whether a person is inside or just outside a coffee shop to within a certainty of 97%.
This is great news for cafés that want to limit free WiFi to internal customers, but not so great for personal privacy. The system is also sufficiently responsive to ensure that a drone under its control is never closer than 4cm from a human body.
It also follows that mobile phone networks can now tell which room of a house a person is in. But don’t worry, I’m sure the data will be anonymised…