About this talk
Sifting through big data can provide fantastic insights into customer behaviour. However, besides map reduce and rigorous scientific methodology, there can often be interesting insights gleaned from the outliers. Zooming into small data can reveal either bugs in your big data scope or oddball user behaviour. Both of these can be nuggets of gold. From the Iraq War Logs to Fifa Files, Nicola Hughes, an investigative data journalist turned developer will go through the data principles learned from digging deeper into data. How to isolate human behaviour. Why to be distrustful of your data. And how to interrogate your data like an investigative journalist. Ultimately, deciphering a bug from a story.
This is a story that what lead me at my time at CNN, my first job out of university, to actually realise that there's things in data that I can use the web to find stories better than a traditional beat journalist would. It starts off with a series of conversations. These conversations were in the news gallery. So what you see is a new studio. You would think that people behind the cameras, but the only person in the studio is the person on camera. There's a room behind with which there's someone controlling the cameras, they were all on trolleys. They're all robots, and audio and sound. Everything is really wonderfully done. It's a fantastic place to be, especially when things are going on. I was in the gallery because my first job was to turn the dial of the autohue. It as basically that. Train a monkey to it and probably programme it. It was the first job out of university. I did the recession, I was glad to be paid. In the gallery we had some very interesting characters. The people who worked in the sound studio, they really liked music. They really like modern music, they really like dance music, they really liked electro. They also went to all these parties and they would tell me, they would talk to each other about the parties they went to, the raves that they went to. They would also talk about the new names and the different types of cocktails of drugs that were being made available to this. This was all news to me, these words and everything. It was a completely different space. But I was hearing about it and I decided to experiment because being a journalist in a studio, the only thing I had in front of me was a computer, as well as the knob. And when I wasn't doing that I used to have to flip the computer. At that time Twitter was very nascent. I was listening in and thinking I don't mean to be in the middle of this. I don't need to be, just listen in to what people are saying. But it's a two-way system. I can speak out, right? I don't need to be me. So I made Twitter accounts and verbatim copied these conversations that were going on about these events that were happening. People started following me. I started following them and I got Twitter accounts that were involved in this community of underground raves and drug parties. This lead to something that I found out about called squat parties. And you can have Nicola's squat notice. And at the time of the recession there were loads of buildings left abandoned. And what they realised, put a squat notice up you get a thousand people there. You get a massive stage, you get a massive sound system, you bring all of your recreational drugs, you have a massive party there because the bobbies on the beat that turn up one, to get through the door they have to go and get legal ramifications to break this squat notice. They just noticed there was this friction. Then there were thousands of people raving. They weren't going to call the riot squad. They just kind of left it be. That's what I was hearing about on Twitter at the time. I also volunteered at a homeless shelter and one of the ants I was following, sometimes he spoke French, sometimes he spoke English, was basically complaining about how burnt out he was. He was like I need someplace quiet. I can't keep on doing this, I'm totally frazzled. So one of my Twitter accounts sent him a link to this homeless shelter that he can go to and get a sound night's sleep. So I turn up at the homeless shelter. I had been no way relatable to that Twitter account and there's two people speaking French. So I go up to them, I talk to them. They tell me about their nights out and they're like give us your details and we'll let you know when this is going on. And I'm like sorry, you know the deal here. Can't give you my details. So all of them have smart phones, they've got tablets. So he takes out his smart phone. He actually takes my smart phone and he puts a link and this link is to a newsletter for a company that organises rave parties. It's a legitimate company which I cannot name. He said go sign up for the newsletter, but don't put in your email, put your phone number. You put in your phone number and you will get a text message and you need to be there within three hours. And this is how this is done. So I have a website. Because I have a website I have an email. I do a whois lookup. Any website you do a whois lookup and it will give you the email of the owner. I have an email, therefore. So what I do with this email is I create my own email account, gmail, Yahoo, I always create a brand new one, not my own. And I put in one contact in the contacts list, and that's this person's email. With this new email I sign up. And when you sign up they try to onboard you as quickly as possible by letting you connect with all your friends by asking you if they can have access to your contacts list. That's a backdoor into their database, so I said yes. And they find this individual, my only contact on LinkedIn, it's a unique identifier. I found him, I found the company he worked for. Not the website and the company I was given. No, the company that he works for as an architect, as a technologist and architect, is very much connected with the ATMs here and money transactions. I look up that company and company's account. At that point in time I had also been looking through government contracts and that company, I remember that company appearing in my list of government contracts. The company that he works for helps run our PA boarding system. So I got a very, very interesting connexion and story from just sitting in a newsroom, from seeding out data and and dating it back. I'm going to go through some of my journeys and my stories after that. One I got to the Guardian, once I got to the Times, the Sunday Times. It is going to be interactive because in many ways I coax stories and sometimes. So that story is not important. I had no proof there was any wrongdoing. I cannot name an individual if they aren't paid for by the taxpayer and the government services. So I found something, but I couldn't act upon it until I got more information. I'm going to go through some of my case studies which I did as a working data journalist. I ask you do you think it's a bug? Which means not whether it's true or not, but is this something you would act upon? Is this something you would make a decision, business decision upon? Is this something you would publish and feel that you are okay to publish? If not, it's a bug or it is story? So without further ado, one of the last things I worked on was the FIFA Files for the FIFA 2022 bid, which was a big splash. This was leaked emails from a person called Mohammad bin Mam, kind of talking about things and favours they can do in order to gain votes. So we got access to around 700 emails. These were in the form of locked PDFs. I programmatically unlocked them, changed them to text, did some text mining. The two fields, the from fields, the BCC, the CC, gave me emails and names and made this network diagram. In the middle I cannot make this any bigger. I cannot show you the names. In the middle is Mohammad bin Mon's secretary, would have done all these emails. The square is the step ladder. After this there was a series of stories and expose and that led to the resignation of stepladder from FIFA. What I want to ask you is this legitimate to publish this information about I got, I made this. Is this something, if I wasn't so ugly let's say I made it visually enticing and interactive. This is just a PDF and you can't read the names. But I would like hands up who thinks this is legitimate? This is a legitimate piece of data, this is legitimate, this is a legitimate story, you would publish this? Hands up, hands up, hands up, interactive. Who thinks there's a bug? Who thinks there's something not quite right with this? All hands have to go up at some point. Why interactive? I'm not going to do anything, right? This is a bug. It's a bug because it's partial data, because we were the source has access to an Outlook server and plugged the emails in by one person. But I don't have the emails of all the other people in that network. Let's say I did, it could be a very, very different picture. They could be a different stem, they could be different branches. So it was a legitimate way to look through the data. It was a legitimate way to see how things are connected and what that individual did. But a picture like that could be too damning because the visualisation. Because you're immediately drawn to the centre and go ha, that's the person. By that might not be the centre when you get the entire picture. So partial data. Always think, do I have everything? You will always have things that are missing. Can you quantify that? It's the unknown unknowns and you will always have some things missing. Unless you are very, very sure you can quantify that it is a safe amount, you should be very, very careful with how you tread. This is from the Guardian, and this has to do with U.S. food aid. This was again at the height of the recession. U.S. budget, a lot of it had been slashed, absolutely slashed. Everyone was cutting their budget. The U.S. food aid was referenced. It was said that we are not slashing this because it bolsters up our local business. It helps our farmers. They're going to be here for the recession in terms of supplying food. So we need this, it's going to help the economy. What we did was we made it interactive after getting all of the procurement contracts of who was supplying what food. We got that. We ended up also, I tracked down the FTP site for shipping reports for which people making shipments could double check that their container was on that boat that was going to this destination. That was an open site, so I was able to get on and search through it for each and every individual shipment of food aid. Having to match up from the procurements because they didn't all go on the same container in the same ship. I had to do some algorithmic added to see where they came from and their ultimate designation. So we made it interactive here having gone through CSV files, PDFs, and just a regular text file. Knowing that has been a process. Hands up who thinks that this is a legitimate story? Hands up, legitimate? Who thinks there might be a bug in this, might be something up? I see this as a legitimate story mostly for one reason in that this is data that we're gathering ourselves, which a lot of people already are fearful to deal with. They want something supplied for and secured by someone else. That being said, we didn't have the exact data, but we had a metric from with to gauge by. So the government itself did not log where the things were going, who was getting most of the money, where it was going and for what type of fraud use. But they did have a budget. They did say that this is how much we're going to give in food aid this financial year. I was able to aggregate from our numbers and chasing all the individuals, get that number and compare it to what they said. It was owned by a fact of 10%. So I had another metric through which to gauge by how we lost, our unknown unknown. And that was legitimate. If you are gathering your own data, if you're doing your own metrics, let's say it's analytics, always go back to whoever's running it and say does this make sense? Do these numbers make sense? How much is a percentage of traffic do we get everyday? How much do we get every week? If it's off by a factor of 10, you're missing something. So one, we always double-check something with another metric. Two, the interesting thing that came from this story was we named aggie businesses, we named companies. So they came down on us hard. We had to go to them for a comment before this was released, and the first thing they did was get their lawyers onto us and try to squash the story. Even if what we have done is legit, it's a tactic that they use because it scares people and it costs money. And they also think that if this is not something that is readily available, they've done a lot of work to get it, we will make them do that work again. That will cost them time and money and we can delay things or at least make it too difficult for them that they'll give up. For anyone who is in my workshop I work off of a virtual machine. All of this I did programmatically, yet it's open environment. So I was like whoever wants to sue us I will send them not just my programme or my files of what I did, cause I'd taken a copy of the website with a date timestamp and everything and I work off that and in chunks. But I can give them the whole environment with my run down. With my one little X page file, every step that I've done completely transparent. They can spin it up and run it in two minutes. Knowing that, they backed down. They stopped threatening to take us to court and we published. That was a very interesting lesson to learn in terms of reproducibility and being able to back up what you've done. If you have made anything, if you've used data to make business decisions, if you can't reproduce that and it causes the same decision on half the data you're not doing things right. So that was a lesson learned. Prescriptions, again please don't judge this by the title of the story. It's not a thing I'm necessarily proud of. But because we have the NHS in this country we have a single source of medical data. It isn't split across different companies and in different formats. We can get things clean. That was great, one of the best cleanest data is prescriptions. Every time you go to a GP they are registered with the NHS. You get the same sort of file. And whenever you go to a chemist, they log that. That is open data. It doesn't have your name on it, but it has certain breakdowns of it. This is every single prescription which is around two gigs every single month. Me being a new computational journalist starting up at the Times, this was the probably the first story that I worked on. We knew that this was really good data. But by the time I got on board it was like they had already been at it for like a year. They had already been publishing this. And the first thing we had to do was well what did the Guardian publish on this? Who published anything before? They always wanted to know what somebody else did. And no one had published anything on it. I found out why. A two gig file is too big to open in Excel especially if you're working off of an old Microsoft machine. None of them could open the data, it's too big. They didn't have enough RAM. So I programmatically read it line by line. I also joined it with the other datasets. So the prescriptions data set has codes and numbers for the chemical components inside the prescription. They don't actually have the name of the drug. However, there's another file that uses chemical components, details what chemical components is and says what area it's prescribed for. Antidepressants, anticoagulants, ADHD, erectile dysfunctions, all of the different things. So I was able to parse through 60 gigs worth of data pulling out individual types of medication. Along with that individual prescription was a code for the GP itself, the surgery it came from. They have to be registered. They have a unique code. There's another file, there's another piece of data which breaks down the demographics for each and every GP surgery. Number of men, number of women, number of children. Breakdown by age groups. So I could normalise the data depending on the type of people who would need that. So having found this out I went to the editors and was like this is fantastic. We can do things. We can pull the ADHD drugs, drugs for heart attacks, even erectile dysfunctions. They were like erectile dysfunction, that's what we want. I was like no, no. But we were able to isolate down to GP surgery the areas which actually have the largest prescription erectile dysfunction drugs which is why they wanted to run with the headline oh God. Oh God, it wasn't my proudest moment, but the way in which I got it was really interesting. So that being said, this is clean data. It is linked data. It was medium sized data, too big, not too big for a machine. How many people, hands up, feel that this is a story? Hands up, hands up, hands up. How many people think there might be a bug in this? Yeah, see this always gets everyone. There is a bug in this and the clue was GP surgery. A GP surgery can be incredibly small and incredibly focused. There is a GP surgery for the Yeomen of the Queen's Guards that guard the crown jewels. There's a GP surgery inside some office buildings. We've got it that granular. It just so happens that Her Majesty's guards are made up mostly of old men. So they were the number one in terms of erectile dysfunction needs and prescriptions. Hence they wanted to run with the headline Her Majesty's limpid service. However, we realised there was a bug. We had deanonymized this data. This surgery was way too small. You can find individuals from it and we fought and we fought to get it, not necessarily to get it squashed but to get it reaggregated on constituency level because we actually ended up getting it to granular. So if you say you have deanonymized data and you do analysis or anything on it, sorry if you say you will anonymize it that deanonymizes it you cannot legitimately work off of that data. You really, really cannot. That was a very, very interesting bug and that was a very hard lesson learned. The next is about predictions. So this all census data, large data, consistent, quite clean and trustworthy because it is a full census. It's not sample, very detailed census. What we did was from we took the census actually changes every year so we had to look at the most consistent in terms of their categorising. And our data scientists made a prediction that I think 2028 in U.K. the number of non-U.K. born residents is going to outnumber the U.K. born residents from doing an analysis, regression of the data that we had. That being said, how many people think that that is a legitimate story? Hands up, hands up legitimate, legitimate. - [Female] I raised my hand up again. - How many people think it's a bug, there might be a bug in here? There's a bug in here, yeah. I hate and refuse to publish predictions. It's such a bad idea. Basically the way I see it is there has been many news reports. If you go back any time of year saying that taking Olympics data, sports data and saying by this year women will be faster than men, right? If you look at it, it's fitting the line. It's fitting a line to data. So let's say we, before 1945 we took the data up 'til then and we did a prediction on it. We would have predicted that women would outpace men by around 1950. And what if we took it here? A prediction is fitting to the data. A census doesn't have people's movement. Doesn't have a beginning and an end to it. It is continuous. You have to take a sample of it and then you have to assume that it's been linear all along. There are different behaviours. The behaviour for male runners has plateaued because to a certain extent they have almost exhausted all the things they can do in terms of training and in terms of food and are waiting. The people that end up breaking these records do it in jumps and starts because they end up getting these people that are unique, that are really biologically and fantastically unique. We see that in the high jump as well. Whereas women are in a different stage in our game. They can still push more. So it's different behaviour and the U.K. born and the non-U.K. born have a very different behaviour of how they live and settle in this country. So you cannot assume that they will have the same linearity in their output and you cannot assume there isn't anything that may be different. I hate predictions. In terms of making predictions they have to be near field. So if you're looking at predictions of football matches you take it for the league that is currently going because that has the players who are going to be in that league. That's something that you do at near field. Doing the long, what's going to happen in 20, 30 years is a very, very stupid idea. That being said, my next one actually has to do with algorithms, using algorithms to report use. This was for the general election. I know we have a new prime minister but we haven't had a general election. But this was for the last one run. This was during the election day when we have the information coming in on what constituencies voted where. We had data as well, polling data that told you something about the type of people that lived there, their level of education, home ownership. And we used random force algorithm to do these decision trees to actually not predict, but to say what components, what factors and variables was the most, biggest deciding factor with how they voted. And you can clearly see from the top it's like is the constituency in Scotland yes, the web MSP. But then it breaks down which is a bit more interesting of the conservative labour breakdown of actually had to do with home ownership moreso than employment rates in terms of what people assumed wealth rather than potential earnings. So we were able to come up with that actually on the brain next day after the election. That being said, it is algorithmic. Who here thinks this is a legitimate story then compared to a regression? Hands up? Who here things that it's a bug, you should never use advanced mathematics proportions? I'm not going to tell people off, it's fine. I think this is a legitimate story because it is retrospective. It also is something that you can kind of put a human thought process on it and see does that make sense. The first thing we saw is Scotland SMP. Okay that kind of makes sense because human beings and a political correspondence understand the political landscape more. We can run things past them. With random forced algorithms you need to prove there is human decisions in there of what is a bit too granular. We got to a stage working with our political correspondence of what actually made sense. Interestingly enough actually getting people to use something like algorithms to report, this was very difficult to sell to the newsroom. First thing I did was I went on Candle. I got the Titanic data and I showed them the analysis and producing a random forced algorithm using R of what were the biggest deciding factors which led someone on the Titanic to to help either live or die? Which we actually can see that it makes sense. It's class, male or female, right? We kind of know that is a big story. After seeing that, did they legitimately say okay, we can try this? And I since left that team, but my editor told me that for the referendum the entries came back actually asking the team please can you do this again? This is really, really popular, we want you to do this. We're interested in the deciding factors. So in terms of rather than predicting things like saying oh if we do this our revenue will be x amount, what everyone always does is whatever prediction model gives them the best revenue is the most correct. Don't do that. Look at what's gone before. Look at what has led to a month being higher than another, and work on those factors. So that being said I was only in reporting, but now I'm in software and I have very much this data mindset. Very much this analytics mindset and it can be applied to business as well. It can be applied to massive business decision making. Whatever you do, whatever you play with technology, you have no idea how applicable it is to everything else. So don't give up, don't be impatient. Don't say what on earth do I need lists of fruit for? What am I am going to lose lists of fruit? Anyone who came to my workshop, you're going to use lists a lot. And you'll be amazed what you can use it for. That being said, I'm GirlMeetsCode on Twitter. That's my Thought Works email if people want to contact me or talk afterwards. I've been told, I take questions. - I'm a purist, I don't like using software. I like doing everything myself. I really don't like using Tableau. I would rather use a programme on live feed like D3. If you are at my workshop it's mostly about control and reproducibility. I don't like doing drop and drag stuff anymore. I will programme as much, everything like hand, by hand. - I actually have another talk with I do called Data Science Bullshit. A big part of it is understanding when things aren't linear and overfitting things. So I say with that, that a lot of models use big data and big data is this idea that if you take a telescopic view you don't see the cracks, and that's okay. Whereas I've actually learned to use data in the small to midfield microscopically. So I would say you always need to do sampling. The main thing is you need to find someone who can say whether this legitimately makes sense. The human mind is much better than any model. If you wrap it around incredible, the ridiculous terms like polynomial and machine learning and deep learning within your regression, a human isn't going to be able to clarify whether it makes sense or not. So you need to tell a story from it. You need to find individual cases. If rather than massive predictions of money, find something that predicts one thing. Tell it as a story and see if it makes sense to anyone. A lot of people fail to do that and this is where you end up hitting bugs. It's not that using big data means you can legitimately do things without bugs being there. Often times it hides things. So in that it's always, always make sure it works on a granular level. Look deep inside of it. If you know what you're doing as a data scientist you should be able to communicate it to a lay person. That's really, really important. Don't wrap it in these words and knowledges. Make sure you do that. Make sure you run it by someone. If it doesn't make sense your model doesn't make sense. - That's the million dollar question. A lot of the times you should report it as correlation because it's incredibly difficult to do causation. That being said, if you have a story that sounds likes it's correlation and you find the data backing it up, so for instance, kind of a way to think about it with data is if they show that an NHS trust has been badly performing and that they're fiddling their books. What happens is the NHS first points to a finance guy and says that person that hasn't been doing his job, it's not endemic in our group. We have fired this person, we are fixing it. That's one story, that's one leak of someone that's told you there's something not right going on. If you have one story they can legitimately do that. They can legitimately say it's one person. All of the figures is this correlation. If you didn't get the figures and it correlating year in and year out you could say there's something systemically wrong. But that's more of a journalistic side of things. So if you can comment from two says saying that it correlates for the individual person's account of what that person has perceived as going on and the data that backs it up. A lot of people look for data. And with data you can't say, you really cannot say if it's causation or correlation. And it's the same thing with unconscious bias as well. You get stories, you get people telling you. You get people on Twitter accounting and saying how they've been treated and then you get the numbers as well. - [Facilitator] One more before we break. - First of all you do your own filter. We learned quite early on not to bring things to them. Should not have mentioned erectile dysfunction. If they don't know that they can do it then they won't do it. So with the predictions of the U.K., non-U.K. born I like whack the data scientists. I go why did you even do that? Such a bad idea. So a lot of it is just not bringing it to people in the first place. But there's always a legitimate question with what you published and whether it's a financial report, whether it's doing predictive models. Everyone can always question how much truth is in there. A lot of it is we had to explain to them the mass but not from a high level. Like you see this intact, you see this. I know bigger words that you, therefore I'm right. And it's not a legitimate case. It's explaining to them no, I know that this is open data, but this is what I've done with it and you know there are privacy laws. I'm sure you can find an individual here, this will get us sued. And there is to-ing and fro-ing from that. A lot of times the legal aspect of things is a complete grey area. It was about standing your ground, often times saving people from themselves. It's not just editors, it's business decision makers as well. Often times it's a whole group of people trying to save them from themselves. So it is about having the knowledge of the data knowing, having that pre-filter of what to bring to someone. Because it's not just what the data says, it's what people thin the data says. It's about what will this look like in the wild? As a person running the media you can report a legitimate story. But it's true but it sounds rather racist, and it sounds horrendously sexist, and it's just like the fact of the matter is these cases are highlighted and what does it imply to the people that are reading it? What does it look like in the wild? It's the same thing of going from being a data scientist in an organisation and bringing it all the way up to the board room. You have to think about the Chinese whisperers along the way. And that is, sadly it is your responsibility because if they made a decision that turns out wrong they will blame you. They will blame the advisors by what they get. You've seen it with a problem, right? I do something wrong, blame the people who gave me information even though I did not do it properly. There isn't just an I sit and I do numbers. There is definitely a responsibility side of things, a personality side of things, and you have to be part manager. You have to be part reporter and everything like that. Which again is why I find reading articles and interpretations by female data scientists really, really interesting and really consumable, and quite approachable. In that they have a tendency to write about what they do to make it communicable and make it connectable. - [Facilitator] Thanks, great thank you so much for that Nicola.