Our team dedicated so much over the last year: @johntzwei.bsky.social, me, @aflah02101.bsky.social, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P Gummadi, @willieneis.bsky.social, @robinjia.bsky.social
We are grateful to everyone who provided support throughout the project! 🥹
Website 🌎: allegro-lab.github.io/hubble/
Paper 🔗: arxiv.org/abs/2510.19811
Models 🤗: huggingface.co/allegrolab
For this project, NVIDIA AI provided 200K A100 hours on DGX cloud through NSF NAIRR pilot, and @hf.co provided 100TB of storage. Training used @eleutherai.bsky.social's NeoX and eval harness.
Thank you for your commitment to open-source science!
Since the perturbations are randomly duplicated 0 or more times, you can make a wide range of comparisons and measurements. 🔍💫
We show Hubble is an ideal benchmark for membership inference and unlearning, and we invite the community to further explore and build on Hubble ✨
Hubble enables a wide range of memorization research. Analyzing the inserted biographies 🧑💼 alone yields rich insights, and e.g. reveals how readily different types of PII are memorized.
And there’s a lot more — book passages 📚, paraphrases 🔁, chat logs 💬, and test sets🎯
Besides our core runs, we release several collections:
•🔀Interference runs (confirming that perturbations minimally interfere across domains)
•⏱️Timing runs (confirming perturbations inserted later in pretraining are memorized more strongly)
•✍️Paraphrased runs (trained on paraphrased perturbations)
🪐Our core release is 8 runs:
2 data conditions (standard, perturbed) ×2 model sizes (1B, 8B) ×2 pretraining sizes (100B, 500B).
They establish *dilution* as a best practice to broadly address memorization risks — sensitive data can be diluted by scaling up the training corpus!
Announcing 🔭Hubble, a suite of open-source LLMs to advance the study of memorization!
Pretrained 1B/8B param models, with controlled insertion of texts designed to emulate key memorization risks: copyright (e.g., book passages), privacy (e.g., synthetic biographies), and test set contamination
🤔 We know what people are using LLMs for, but do we know how they collaborate with an LLM?
🔍 In a recent paper we answered this by analyzing multi-turn sessions in 21 million Microsoft Copilot for consumers and WildChat interaction logs: arxiv.org/abs/2505.16023