Posts

OpenRefine – an experiment in data cleaning

Image
Photo by r2hox (Flickr/Creative Commons) In a recent blog post on Northern Ireland’s Renewal Heat Incentive (RHI) scandal [ here ] I spent quite a bit of time recording all of the changes, tweaks, and decisions I had to make to get the data into a usable format. With any dataset it is important to understand the transformations that went into bringing it to its final form. If other researchers are unable to follow your process and consistently achieve the same results from the same dataset it brings your analysis into question. Beyond that, it brings the whole endeavour of data science and data analysis into disrepute. If you can’t rely on the figures to tell a consistent story, you can’t make consistent decisions, and you can’t gain reliable insights. You certainly can’t trust the folks who are furnishing you this flawed and unreliable nonsense. If you can’t rely on the information you’re seeing on your dashboard, what is it other than a collection of interesting, but meaningles...

Renewable Heat Incentive (RHI) non-domestic beneficiaries: an interactive analysis of the data

Image
Screenshot of the Tableau Dashboard. Available [ here ] and at the end of this post. (Updated: see notes at end) After much legal wrangling and foot-dragging, the Northern Ireland Department of the Economy have finally published a partial list of recipients of money from the botched Renewable Heat Initiative scheme . At present only limited companies and limited liability partnerships who received in excess of £5,000 (cumulative) are listed. The data runs from the start of the scheme to 28 February 2017. After the list was published (16 March 2017) a number of people complained that they should treated as individuals, and not as limited companies. These corrections were made and a second list was issued the same afternoon. The dataset used here is based on this second list. The first thing I want to note about the document made available by the Department of the Economy is that it is presented as a PDF. This is data! To analyse data you need it in a suitable format su...

Finding your Public Amenities in Northern Ireland

Image
Screenshot of the Tableau Dashboard. Available [ here ] and at the end of this post. I was recently thinking back to last year’s ODI Belfast unConference where someone mentioned one of the smallest (and, by implication, the least useful) datasets available on their website [ here ] – the  list of 22 Bowling Pavilions in Belfast City. The question was ‘what can you possibly do with a list of 22 bowling pavilions?’ I seem to remember the suggestion that they might be useful in the event of a Zombie Apocalypse where they could be put to use as holding centres for the contaminated. I’m not a huge fan of the Zombie genre, but even I know that this approach will not end well. So … what do you do with a list of 22 Bowling Pavilions? This is a question that has been ticking away in the back of my head for a while now (wow … I really don’t have the richest internal life, do I?) … how do you use this kind of data in a useful way? Sure … I can build you a dashboard that shows all of...

European Testate Amoeba dataset: an interactive visualisation

Image
Screenshot of the Tableau Dashboard. Available [here] and at the end of this post. I’ve recently been working with Dr Graeme Swindles (University of Leeds) and Dr Matt Amesbury (University of Exeter) on producing an interactive visualisation based on their European Testate Amoeba dataset.  Testate amoebae are microscopic, unicellular shelled protozoa that are abundant in a range of wetlands, including peatlands. Study of fossil testate amoebae allows for the reconstruction of ancient hydrological variability. Amesbury and Swindles were the lead authors of a group that published Development of a new pan-European testate amoeba transfer function for reconstructing peatland palaeohydrology in Quaternary Science Reviews (Volume 152, 15 November 2016) [ https://goo.gl/15n4QW ]. They placed their dataset in the public domain under a Creative Commons licence, allowing others access to reanalyse and re-examine their data, and incorporate it into future research. The dataset is...

Notifiable Infectious Diseases Reports (NoIDs) Northern Ireland | Trends & Predictions

Image
Screenshot of the Tableau Dashboard. Available  [here]  and at the end of this post. (Updated: see notes at end) The OpenDataNI website holds Notifiable Infectious Diseases Reports , provided by the Public Health Agency . The available data runs from Week 50 2014 to (at the time of writing) Week 50 2016. It is a relatively simple dataset, reported weekly, giving the numbers of occurrences of some 35 Reportable Diseases in Northern Ireland. The individual reports are based on figures recorded by the Duty Room and Surveillance. The reports also indicate that: “Food poisoning notifications include those formally notified by clinicians and reports of Salmonella, Campylobacter, Cryptosporidium, Giardia, Listeria and E Coli O 157 informally ascertained from laboratories.” What I wanted to do with this dataset was examine a few different ways that the selection of timescales can result in different understanding of the data and how it can be used to examine trends an...