Published inTowards Data ScienceWhy Probabilistic Linkage is More Accurate than Fuzzy Matching or Term Frequency based approachesHow effectively do different approaches to record linkage use information in the records to make predictions?Oct 26, 2023Oct 26, 2023
Why parquet files are my preferred API for bulk open dataThey provide one a cheap, easy to use and performant API for accessing bulk data, and SQL can be used in-browser as a universal APIJan 10, 2023Jan 10, 2023
Published inTowards Data ScienceThe Intuition Behind the Use of Expectation Maximisation to Train Record Linkage ModelsHow unsupervised learning is used to estimate model parameters in SplinkOct 14, 20222Oct 14, 20222
Splink 3: Fast, accurate and scalable linkage and deduplication in Python with support for…Splink 3 now offers support for Python and AWS Athena backends, in addition to Spark. It’s now easier to use, faster and more flexible…Aug 6, 20221Aug 6, 20221
The Downfall of Command and Control Data LeadershipSeemingly every big organisation has a new data platform that’s always just around the corner. Why they fail to live up to expectations?Nov 8, 2020Nov 8, 2020
Published inTowards Data ScienceDemystifying Apache ArrowIn my work as a data scientist, I’ve come across Apache Arrow in a range of seemingly-unrelated circumstances. However, I’ve always…Oct 22, 2020Oct 22, 2020
Published inTowards Data ScienceFuzzy Matching and Deduplicating Hundreds of Millions of Records using Apache SparkIntroducing splink, a Pyspark library for record linkage at scale using unsupervised learningApr 16, 20202Apr 16, 20202
Effective testing of analytical models using automated sense checksI once worked for a slightly terrifying senior analyst.Aug 26, 2019Aug 26, 2019
Questions Senior Leaders Should Ask Their Data Delivery TeamsHow to improve the likelihood of success whilst reducing the governance burden on teamsMar 14, 20191Mar 14, 20191
First impressions of the ONS’s new beta data servicesI was inspired by a recent tweet to experiment with some of the ONS’s new beta data functionality. There’s a lot to be excited about…Feb 10, 2019Feb 10, 2019