OpenRefine – an experiment in data cleaning

Photo by r2hox (Flickr/Creative Commons)
In a recent blog post on Northern Ireland’s Renewal Heat Incentive (RHI) scandal [here] I spent quite a bit of time recording all of the changes, tweaks, and decisions I had to make to get the data into a usable format. With any dataset it is important to understand the transformations that went into bringing it to its final form. If other researchers are unable to follow your process and consistently achieve the same results from the same dataset it brings your analysis into question. Beyond that, it brings the whole endeavour of data science and data analysis into disrepute. If you can’t rely on the figures to tell a consistent story, you can’t make consistent decisions, and you can’t gain reliable insights. You certainly can’t trust the folks who are furnishing you this flawed and unreliable nonsense. If you can’t rely on the information you’re seeing on your dashboard, what is it other than a collection of interesting, but meaningless, colours and shapes?

While this should be a consideration for any dataset you look at, it’s particularly apposite in this instance. Even leaving aside the fact that the beneficiaries of the botched scheme went to the High Court to try and suppress it, it’s a dataset that essentially brought down the Northern Ireland Assembly, redrew the political map here, and will have long-lasting impacts on how we are governed.

The other thing that was on my mind was the fact that the Department of the Economy’s website said that they were preparing a similar dataset of the private individuals who received money from the scheme and that both lists would probably be updated on a six-monthly basis. If there is sustained interest in this scandal (and the dataset), I’ll probably be quite keen to keep the dashboard up to date. The problem here is, of course, that I’m peculiarly lazy and the thought of endlessly repeating all of my carefully outlined steps with each new release of data just fills me with dread.

There is a whole industry dedicated to data cleansing, and the brightest light in this firmament is Alteryx. I’ve seen these guys demo their product on a few occasions and it is simply stunning. Their website is littered with the logos of all the huge, instantly-recognisable corporations they serve. Understandably, their fee structure is commensurate to the work they do. Equally understandable is the fact that this is way beyond the financial reach of a blogger like me, so I need to find something else. I’ve been looking around at a number of more affordable options when I happened to fall upon OpenRefine (formerly Google Refine). You can find out all about it on their web page [here] where you can buy a book and watch some training videos. My initial reaction to the product is that it is really powerful and probably would require a significant expenditure of time and mental energy to get the fullest out of it. That said, I watched the three training videos on their site twice (the second time, taking notes). That has been the entirety of my training with the product. It was still sufficient to work through the list of changes I’d made to the original dataset and in about the same amount of time it originally took to complete. I’d particularly note the automated clustering feature that independently noted a number of the near identical Business names I’d spotted manually. It also discovered a couple that I’d missed the first time round.

But here’s the good bit – it’s not just done for now and the next time the Department make an updated version I’ve got to do it all again. OpenRefine allows the user to export the actions taken as a JSON script. It not only allows you to run through the process you created in a matter of seconds, it is easily publishable and can be reviewed by other researchers to ensure consistent, robust datasets and analyses.

It is, of course, early days, but I already feel that OpenRefine will become a permanent fixture of my data analysis. One thing I am certain of is that when the Department of the Economy publishes the next dataset of RHI beneficiaries, I’ll be ready, knowing that my inherent laziness is no impediment to decent quality data analysis!

In the meantime, check out OpenRefine [ Webpage | Facebook | Twitter ]
If you want to replicate my analysis or error check my process, I’ve made the JSON file available [here]

[... a little time passes ...]

The above blog has been sitting on the server for about a week or so and I've been toying with the idea as to when would be the best time to publish it. The thing that has been on my mind is that it's lovely (if a but gushy) and it doesn't have any follow up in terms of how repeatable the process is. Well ... I was browsing the internet yesterday afternoon and happened upon the Department of the Environment's website [here] (Yep! I'm totally Rock 'n' Roll). They note that the previously available list had a number of errors that they've now fixed and that the list has been republished. In their accompanying list of errata [here], it's clear that these are not big changes - Green Energy Engineering had seven boilers, not nine; Stephens Catering Co Ltd had only one boiler, not five; and two companies with three boilers were added to the list. While there are significant amounts of money involved, the changes to the dataset aren't huge. This is also the situation I had dreaded - the idea of starting over from scratch having to make the same edits and changes over and over for little reward ... it was just! so! dull!

My first instinct was to simply go to the excel sheet and do the edits manually, but then I thought: "OpenRefine! Let's see how well this works!" ... I downloaded the PDF (come on Dept of Economy! ... it data - publish it as a spreadsheet ... how is this difficult to understand?), converted it to Excel (thank you Smallpdf), started OpenRefine, opened the sheet, found the document where I'd saved off the JSON of the process, hit Apply > Perform Operations. Boom! it was done! Two hours of work last week reduced down to less time than it takes to type this sentence! ... Actually, it was done in less time than it takes to read this sentence! After that it was simply a case of exporting it out & badda bing, badda boom updated Tableau dashboard. Seriously, what is not to like?

The original blog post is [here] and the updated dashboard is below (or on TableauPublic if you have any difficulties [here]). You won't notice much change in the actual data (the scheme has already cost us over £27m and the number of boilers has only dropped from 869 to 866), and you certainly won't notice how easy OpenRefine made the process ... and that's as it should be!




Comments

Popular posts from this blog

Renewable Heat Incentive (RHI) non-domestic beneficiaries: an interactive analysis of the data

Follow the money! A Visual Analysis of the Community Hall Grant Scheme, NI

Just a little glimpse of a data story I’ll tell ’Bout a North Country singer that you all know well