The phenomenal success of data driven businesses like Google and Amazon has led to increased recognition of the value of data, and a corresponding desire for new investment.

As a result, seemingly every big organisation has a new data platform that’s always just around the corner — a platform which…


Learning more about a tool that can filter and aggregate two billion rows on your laptop in two seconds

In my work as a data scientist, I’ve come across Apache Arrow in a range of seemingly-unrelated circumstances. However, I’ve always struggled to describe exactly what it is, and what it does.

The official description of Arrow is:

a cross-language development platform for in-memory analytics

which is quite abstract —…


Introducing splink, a Pyspark library for record linkage at scale using unsupervised learning

Visualising the results of splink so match scores can be interpreted and explained

A common data quality problem is to have multiple different records that refer to the same entity but no unique identifier that ties these entities together.

For instance, customer data may have been entered multiple times in multiple different computer systems, with different spellings of names, different addresses, and other…


I once worked for a slightly terrifying senior analyst.

When presenting my modelling results to her, she would often do a back-of-the-envelope calculation of her own, to estimate roughly what she expected the answer to be. If our answers differed, I knew she would challenge me to explain why.

This…


How to improve the likelihood of success whilst reducing the governance burden on teams

Governance is a particularly hard problem for data improvement projects¹ because it is difficult to assess and communicate how well things are going.

I suspect that the difficulty in communicating a clear picture of progress and the value that is being delivered is the key driver for the high failure…


I was inspired by a recent tweet to experiment with some of the ONS’s new beta data functionality. There’s a lot to be excited about: these new offerings are a real game changer in how government statistical data can be used in modern analytical workflows (of the kind I describe…


The range of data visualisation tools available to data scientists is vast¹. If they’re anything like me, beginner data scientists often don’t put too much thought into which tool to learn — and often just pick a tool on the basis of some impressive outputs they’ve seen online.

On any…


In government and beyond, organisations are aiming to become more data driven. The widespread adoption of data science approaches throughout analytical teams is key to achieving these aims. …


Until recently, building websites with interactive data content was time consuming and required substantial technical expertise. Authoring professional-looking web content was out of reach for many analysts¹.

These hurdles stifled demand for standards compliant open data, because few users could take full advantage of its benefits. …

Robin Linacre

Data scientist for UK Government. All views my own. robinlinacre.com twitter.com/robinlinacre

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store