Open in app

Sign In

Write

Sign In

Robin Linacre
Robin Linacre

116 Followers

Home

About

Jan 10

Why parquet files are my preferred API for bulk open data

Summary Statically hosted parquet files provide one of the easiest to use and most performant APIs for accessing bulk¹ data, and are far simpler and cheaper to provide than custom APIs. …

Data Science

7 min read

Why parquet files are my preferred API for bulk open data
Why parquet files are my preferred API for bulk open data
Data Science

7 min read


Published in Towards Data Science

·Oct 14, 2022

The Intuition Behind the Use of Expectation Maximisation to Train Record Linkage Models

How unsupervised learning is used to estimate model parameters in Splink — Splink is a free probabilistic record linkage library that predicts the likelihood that two records refer to the same entity. For example, what is the probability that the following two records match?

Data Science

5 min read

The Intuition Behind the Use of Expectation Maximisation to Train Record Linkage Models
The Intuition Behind the Use of Expectation Maximisation to Train Record Linkage Models
Data Science

5 min read


Aug 6, 2022

Splink Version 3: Fast, Accurate and Scalable Record Linkage in Python

Splink now offers support for Python and AWS Athena backends, in addition to Spark. It’s now easier to use, faster and more flexible, and can be used for close to real time linkage. — Two years ago, we introduced Splink, a Python library for data deduplication and linkage (entity resolution) at scale. Since then, Splink has been used in government, the private sector, and academia to link and deduplicate huge datasets, some in excess of 100 million records, and it’s been downloaded over 3…

Record Linkage

3 min read

Splink 3: Fast, accurate and scalable linkage and deduplication in Python with support for…
Splink 3: Fast, accurate and scalable linkage and deduplication in Python with support for…
Record Linkage

3 min read


Nov 8, 2020

The Downfall of Command and Control Data Leadership

The phenomenal success of data driven businesses like Google and Amazon has led to increased recognition of the value of data, and a corresponding desire for new investment. As a result, seemingly every big organisation has a new data platform that’s always just around the corner — a platform which…

Data Engineering

6 min read

Data Engineering

6 min read


Published in Towards Data Science

·Oct 22, 2020

Demystifying Apache Arrow

Learning more about a tool that can filter and aggregate two billion rows on your laptop in two seconds — In my work as a data scientist, I’ve come across Apache Arrow in a range of seemingly-unrelated circumstances. However, I’ve always struggled to describe exactly what it is, and what it does. The official description of Arrow is: a cross-language development platform for in-memory analytics which is quite abstract —…

Data Science

7 min read

Data Science

7 min read


Published in Towards Data Science

·Apr 16, 2020

Fuzzy Matching and Deduplicating Hundreds of Millions of Records with Splink

Fast, accurate and scalable record linkage with support for Python, PySpark and AWS Athena — Summary Splink is a Python library for probabilistic record linkage (entity resolution). It supports running record linkage workloads using the Apache Spark, AWS Athena, or DuckDB backends. Its key features are: It is extremely fast. It is capable of linking a million records on a modern laptop in under two minutes…

Data Matching

4 min read

Fuzzy Matching and Deduplicating Hundreds of Millions of Records using Apache Spark
Fuzzy Matching and Deduplicating Hundreds of Millions of Records using Apache Spark
Data Matching

4 min read


Aug 26, 2019

Effective testing of analytical models using automated sense checks

I once worked for a slightly terrifying senior analyst. When presenting my modelling results to her, she would often do a back-of-the-envelope calculation of her own, to estimate roughly what she expected the answer to be. If our answers differed, I knew she would challenge me to explain why. This…

Testing

5 min read

Testing

5 min read


Mar 14, 2019

Questions Senior Leaders Should Ask Their Data Delivery Teams

How to improve the likelihood of success whilst reducing the governance burden on teams — Governance is a particularly hard problem for data improvement projects¹ because it is difficult to assess and communicate how well things are going. I suspect that the difficulty in communicating a clear picture of progress and the value that is being delivered is the key driver for the high failure…

Data Engineering

8 min read

Data Engineering

8 min read


Feb 10, 2019

First impressions of the ONS’s new beta data services

I was inspired by a recent tweet to experiment with some of the ONS’s new beta data functionality. There’s a lot to be excited about: these new offerings are a real game changer in how government statistical data can be used in modern analytical workflows (of the kind I describe…

Data Science

6 min read

First impressions of the ONS’s new beta data services
First impressions of the ONS’s new beta data services
Data Science

6 min read


Aug 22, 2018

Why I’m backing Vega-Lite as our default tool for data visualisation

The range of data visualisation tools available to data scientists is vast¹. If they’re anything like me, beginner data scientists often don’t put too much thought into which tool to learn — and often just pick a tool on the basis of some impressive outputs they’ve seen online. On any…

Data Visualization

5 min read

Data Visualization

5 min read

Robin Linacre

Robin Linacre

116 Followers

Data scientist for UK Government. All views my own. robinlinacre.com twitter.com/robinlinacre

Following
  • DataKitchen

    DataKitchen

  • Joel Samuel

    Joel Samuel

  • Ryan Dunn

    Ryan Dunn

  • Adam Locker

    Adam Locker

  • Beck Strickland

    Beck Strickland

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech