Robin Linacre 's Avatar

Robin Linacre

@robinlinacre.bsky.social

Lead developer of Splink. Data scientist at Ministry of Justice. Trustee, GiveDirectly UK. Pledgee, http://givingwhatwecan.org. All views my own.

142 Followers  |  314 Following  |  39 Posts  |  Joined: 23.09.2024  |  2.1368

Latest posts by robinlinacre.bsky.social on Bluesky

Screenshot of sample of Islington's Council Tax address data, visualised in Google Earth

Screenshot of sample of Islington's Council Tax address data, visualised in Google Earth

More progress on #openaddresses:

Islington Council in London has released its Council Tax address list for re-use as #opendata under the Open Government Licence www.owenboswarva.com/blog/post-ad...

I've made a geocoded version by adding coordinates from ONS

#FOI #localgov #UKhousing #proptech

28.10.2025 08:46 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

No worries - thanks for the report on the repo, we'll take a look

02.10.2025 13:08 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
uk_address_matcher/examples at main Β· moj-analytical-services/uk_address_matcher Contribute to moj-analytical-services/uk_address_matcher development by creating an account on GitHub.

(Incidentally, uk_address_matcher should work ok for non-UK addresses, that's just no our focus. See examples here for how to use the package github.com/moj-analytic...)

02.10.2025 06:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - moj-analytical-services/uk_address_matcher Contribute to moj-analytical-services/uk_address_matcher development by creating an account on GitHub.

Did you try github.com/moj-analytic...?

The trie is WIP, but the idea is that it will be used as an initial step to skim off the easy ones. The remainder will go through to the main matching phase which already exists in uk_address_matcher, but is more computationally intensive

02.10.2025 06:21 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

New ✨interactive✨ explainer: Address matching using a fault tolerant trie:

robinlinacre.com/fault_tolera...

Which illustrates a powerful technique for address matching that we're currently working on building into uk_address_matcher (github.com/moj-analytic...)

24.09.2025 07:51 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

You select the columns you want, and it handles the joins for you.

It's just a rough sketch for now. I feel like it must have done before, but couldn't find anything. Feedback welcome!

18.08.2025 06:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

When working a complex postgres schema, I find it time consuming to figure out the joins.

I had an idea: a 'join generator' that traverses the relationship graph for you, and writes the joins.

You give it a dump of the postgres schema, and it gives you a UI.

www.robinlinacre.com/vite_live_pg...

18.08.2025 06:40 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We're working on a DuckDB community extension called `splink_udfs` to add some record linkage related functions to DuckDB. It's currently very much WIP, but you can already use it wherever you're using DuckDB.
github.com/moj-analytic...

22.07.2025 16:50 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Speed enhancement: 'Pushing up' common elements of CASE statements into reused computations by RobinL Β· Pull Request #2738 Β· moj-analytical-services/splink This is a clean rewrite of #2630. The rationale is explained further in #2580, but in a nutshell it eliminates repeated computations of potentially expensive functions in some backends, e.g CASE ...

For more details see

github.com/moj-analytic...

15.07.2025 16:46 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Speed enhancement: 'Pushing up' common elements of CASE statements into reused computations by RobinL Β· Pull Request #2738 Β· moj-analytical-services/splink This is a clean rewrite of #2630. The rationale is explained further in #2580, but in a nutshell it eliminates repeated computations of potentially expensive functions in some backends, e.g CASE ...

If you're using Splink with DuckDB you should see significant speed improvements by updating to DuckDB 1.3.x. You can also add more granularity to your comparison levels statements without an impact on run times. Depending on your model spec, it could be twice as fast or better.

15.07.2025 16:46 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Then give output to VS Code copilot in agent mode to implement

11.07.2025 08:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

My most commonly used pattern for AI coding: Dump entire source code into Gemini 2.5 pro, write prompt specifying what I want, and then: Give precise instructions for an LLM to follow to implement this feature. Break the solution down into steps where each step is verifiable.

11.07.2025 08:33 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I think more blocking stage. UK blocking is relatively easy because postcode gets you down to about 50 or fewer addresses. So if your postcodes are accurate, blocking isn't too hard. For addresses outside UK, you might need to lean more heavily on the signature based approaches

05.07.2025 21:01 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Building Accurate Address Matching Systems A bag of tricks to improve the accuracy of geocoding

I have been working on a free, high performance address matcher. I've written up some key tricks, techniques, and ideas into a blog post here: www.robinlinacre.com/address_matc...

05.07.2025 10:00 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Visual Fraction Addition

Rough working app here:
rupertlinacre.com/fraction_add... and code github.com/RupertLinacr...

22.05.2025 22:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

The 'build' button in google AI studio is unbelievably good. I had an idea to visualise fractions, three prompts total and it's pretty close to something useful (it does this for any arbitrary fractions). Even the one-shot attempt was pretty good

22.05.2025 20:15 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Robin Linacre - Rapid deduplication and fuzzy matching of large datasets using Splink
YouTube video by PyData Robin Linacre - Rapid deduplication and fuzzy matching of large datasets using Splink

My PyData Global talk "Rapid deduplication and fuzzy matching of large datasets using Splink" is now on Youtube: www.youtube.com/watch?v=eQtF...

17.04.2025 15:03 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

I see too much focus on trying to find applications of LLMs to help other people 'at scale' with their jobs. At the moment, the output of LLMs is rarely useful for business rules or passive consumption. The lower hanging fruit is helping people use AI directly & however they see fit in their job.

04.04.2025 08:38 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image 02.04.2025 13:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

If you're using duckdb in a python script or jupyter notebook, you can run con.execute('CALL start_ui()') at any point, and the ui will pop right up in your web browser with the current database automatically available.

(I knew about the UI, but I had missed this trick!)

01.04.2025 06:28 β€” πŸ‘ 6    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - simonw/files-to-prompt: Concatenate a directory full of files into a single prompt for use with LLMs Concatenate a directory full of files into a single prompt for use with LLMs - simonw/files-to-prompt

Gemini 2.5 pro is really good. Grok 3 felt like a big step forwards and was my 'go-to' for hard problems, and this feels like another significant step forward.

So nice with small codebases to be able put everything into context (I use github.com/simonw/files... )

29.03.2025 11:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Ended up writing a follow up post with the final approach and learnings from getting this running on GitHub Actions!

All original datasets weight more than 500GB combined. The final ones published on πŸ€—, only 1 GB. Took some tinkering to get there but was fun!

davidgasquez.com/exporting-in...

20.03.2025 12:56 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

It's pretty easy to set up a markdown-based blog using github pages for free. Custom styling is much easier now we're in the world of ChatGPT!

16.03.2025 20:19 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Why DuckDB is my first choice for data processing Why DuckDB has become my go-to tool for data processing, offering simplicity, speed, and powerful features.

New blog: Why DuckDB is my first choice for data processing:
www.robinlinacre.com/recommend_du...

16.03.2025 19:17 β€” πŸ‘ 42    πŸ” 12    πŸ’¬ 0    πŸ“Œ 1
Breakout Maths Game

I vibe coded a primary school maths breakout game - aimed to be fun and educational.
rupertlinacre.com/breakout_mat...
In the process I created and open sourced a maths problem generator aligned to the national curriculum, so you can vibe code your own maths games!
www.npmjs.com/package/math...

15.03.2025 19:40 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Linking businesses - Splink

Just added an example/tutorial to the Splink docs of matching business data.

It uses some feature engineering tricks that help improve accuracy vs. just fuzzy matching on names.

moj-analytical-services.github.io/splink/demos...

14.02.2025 14:01 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Undoubtably this will change as models improve, but at the moment there's usually not quite enough 9s of reliability to use in fully automated use cases

13.02.2025 13:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I think the single most productivity-enhancing use of LLMs in gov would be give all devs and data scientists access to Cursor (or equivalent). I am not yet convinced of the widespread value of 'behind the scenes' uses of LLMs, but v. bullish on skilled human-in-the-loop uses, especially coding

13.02.2025 13:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

With DuckDB WASM it's possible to run a full Splink model in your browser in a single, standalone .html page.

Here's an example:
www.robinlinacre.com/live_splink/

And the git repo:
github.com/RobinL/vite_...

03.02.2025 16:47 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Playing around with a spatial duckdb wasm database in a static webpage. Absolutely amazing how far you can get with geospatial in the browser using entirely open source tools

26.01.2025 15:57 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@robinlinacre is following 19 prominent accounts