First impressions of the ONS’s new beta data services

6 min readFeb 10, 2019

I was inspired by a recent tweet to experiment with some of the ONS’s new beta data functionality. There’s a lot to be excited about: these new offerings are a real game changer in how government statistical data can be used in modern analytical workflows (of the kind I describe here). It also makes it much easier to create interactive web-based graphics like this in just a few lines of code.

Here’s the tweet:

The rest of this post is quite a technical look at these services from the perspective of a data scientist wanting to use the data for statistical/analytical purposes. It’s great that the team have released the functionality in beta for real users to try, and I hope this feedback helps the team understand real-world users.

What I aimed to achieve

I wanted to demonstrate the potential of using these data services in a RAP-style workflow — creating analysis that is driven directly from the source data, and updates itself as the new data is released.

I figured I’d see if I could use the dataset that Andy posted (Population Estimates for the UK) to produce some interactive population pyramids, a quick, dirty and incomplete version of the ones the ONS themselves publish.

Here’s a picture of part of what I made, a working version of which is here.

Summary

The new services are a huge step in the right direction. I found the the various services (the API, the new download links, the schemas) easy, predictable and intuitive to use.

ONS data now works natively in modern data tools like Python and R, and in web-based front ends— meaning that users no longer have to write any complex code or do manual downloads simply to read data.

This is important because it means analysts can read data directly from the canonical source in their analysis, avoiding local copies. This improves their analysis because it preserves data lineage, improves reproducibility, and reduces the scope for errors. It means their analysis can update itself as new data is released.

It also dramatically simplifies the task of creating new websites that visualise the data.

However, there’s still room for improvement, in two main areas. (Caveat: I only experimented with the population dataset)

The data contains surprises in its formatting that make it difficult to use and could lead to analysts making mistakes. Neither thecsv download or the filter API (sample code) provides data in ‘tidy’ format (an convention for data formatting which is expected by modern analytical tools). Instead, the dataset seemed to be more like an Excel spreadsheet that had been converted to csv.

Analysts using tools like R or Python interact with data in a fundamentally different way to Excel. Elements of formatting such as totals which may be useful to an Excel user become a hinderance. R and Python enable users to cut and aggregate the data with ease, and the presence of pre-existing totals in a dataset breaks these tools.

The data is also not typed in either the main csv file, data returned by the filter API or the datasets json API. This seems partly a result of the problematic design of the dataset: columns are of mixed datatypes (e.g. age contains ‘90+’ and ‘Total’), in addition to integers.

The wcsvmetadata is a very welcome addition, which again was hampered by the dataset itself (e.g. the columns have mixed types and column metadata is incomplete/inaccurate) .

API perform/design caused a little frustration. There is a performant, synchronous datasetsAPI for small amounts of data, and an asynchronous filters API for larger amounts of data which, in practice, is pretty fast (<1 second).

I’m impressed that both exist — this tells me that the team understand their users well. Analysts are often after bulk data rather than atomic records, and the filters API is designed for this user need.

But I would have preferred the synchronous API that would return a bit more data (e.g. enabling more than one wildcard). The asynchronous nature of the filters API adds significant complexity to code, and may not be fast enough to use in interactive data visualisation. One option that may be particularly effective for interactive data vis may be to cache previous filters, and return a result synchronously if it has been previously requested.

I appreciate that performance or cost considerations may be the reason behind some of these choices.

Finally, I found that the csv file seemed to fetch into the browser considerably slower than the same file hosted in Github. See here for comparison.

What I built step-by-step

What follows are some fairly messy notes I made whilst trying various strategies to build a performant population pyramid.

I used Observable notebooks at my prototyping environment, which are great. I’ve blogged before about their usefulness for this kind of work.

Attempt 1: Full dataset into browser memory

In my first attempt, I attempted to read the full dataset (34.8 MB) into browser memory and then filter and aggregate in javascript.

This attempt was successful in that it was quick and easy to build a population pyramid with filtering and interactivity. However, to my slight surprise, this relatively small dataset took a long time to load into the browser.

You can find the results here as a forkable Observable notebook, or here as a gov-themed standalone website. [Warning: Might crash if you select ‘full dataset’!]

This approach was feasible with a smaller dataset containing only some geographies and years.

Some observations and things that went wrong:

The presence of ‘90+’ in the age field is unexpected and makes the dataset difficult to work with. My code bins ages into 5 year chunks, which breaks when it encounters this string. My code also attempted to convert it into a number and failed, giving me NaN problems.
The presence of ‘totals’ in the dataset is unexpected, and caused a large ‘NaN’ bar in original version of the chart.
The canonical URL doesn’t perform that well. I get ‘failed to fetch’ errors quite regularly. The download seems to take quite a while. I didn’t get these errors when I switch to the raw.github url.

Attempt 2: Use the datasets API

Due to the performance problems loading such a large csv into memory, I decided to explore using the datasets API. This ultimately failed because you can only wildcard one field. This is by design: I think the filters API is intended for my use case.

Some things that maybe could be better:

The datasets API returns json, which makes it easier to impose datatypes compared to csv . However, the dataset itself doesn’t really have types so this is not possible. Every dimension is basically treated as a code. But even the facts (observations) are strings!
In an attempt to search for the API documentation I used trial and error of a few URLs that didn’t work, before Googling (which did work). Here are the ones I tried at random: https://api.beta.ons.gov.uk, https://api.beta.ons.gov.uk/v1/datasets/ https://download.beta.ons.gov.uk/

Attempt 3: Use the filter API

This worked pretty well. The output in the form of a website is here and the observable notebook is here.

I was able to make a working population pyramid with satisfactory performance. Furthermore, this approach is ‘scalable’ in the sense that it would work even if the underlying dataset was much larger than 34.8MB.