OpenRefine – an experiment in data cleaning

Photo by r2hox (Flickr/Creative Commons) In a recent blog post on Northern Ireland’s Renewal Heat Incentive (RHI) scandal [ here ] I spent quite a bit of time recording all of the changes, tweaks, and decisions I had to make to get the data into a usable format. With any dataset it is important to understand the transformations that went into bringing it to its final form. If other researchers are unable to follow your process and consistently achieve the same results from the same dataset it brings your analysis into question. Beyond that, it brings the whole endeavour of data science and data analysis into disrepute. If you can’t rely on the figures to tell a consistent story, you can’t make consistent decisions, and you can’t gain reliable insights. You certainly can’t trust the folks who are furnishing you this flawed and unreliable nonsense. If you can’t rely on the information you’re seeing on your dashboard, what is it other than a collection of interesting, but meaningles...