Ameya Godbole @ameyagodbole

Our team dedicated so much over the last year: @johntzwei.bsky.social, me, @aflah02101.bsky.social, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P Gummadi, @willieneis.bsky.social, @robinjia.bsky.social

We are grateful to everyone who provided support throughout the project! 🥹

24.10.2025 18:21 — 👍 2 🔁 0 💬 0 📌 0

Hubble Suite Hubble Suite

Website 🌎: allegro-lab.github.io/hubble/
Paper 🔗: arxiv.org/abs/2510.19811
Models 🤗: huggingface.co/allegrolab

24.10.2025 18:21 — 👍 2 🔁 0 💬 1 📌 0

For this project, NVIDIA AI provided 200K A100 hours on DGX cloud through NSF NAIRR pilot, and @hf.co provided 100TB of storage. Training used @eleutherai.bsky.social's NeoX and eval harness.

Thank you for your commitment to open-source science!

24.10.2025 18:21 — 👍 4 🔁 0 💬 1 📌 0

Since the perturbations are randomly duplicated 0 or more times, you can make a wide range of comparisons and measurements. 🔍💫

We show Hubble is an ideal benchmark for membership inference and unlearning, and we invite the community to further explore and build on Hubble ✨

24.10.2025 18:21 — 👍 1 🔁 0 💬 1 📌 0

Various attack formats are used to infer private information from memorized biographies.

Hubble enables a wide range of memorization research. Analyzing the inserted biographies 🧑‍💼 alone yields rich insights, and e.g. reveals how readily different types of PII are memorized.

And there’s a lot more — book passages 📚, paraphrases 🔁, chat logs 💬, and test sets🎯

24.10.2025 18:21 — 👍 4 🔁 1 💬 1 📌 0

Besides our core runs, we release several collections:
•🔀Interference runs (confirming that perturbations minimally interfere across domains)
•⏱️Timing runs (confirming perturbations inserted later in pretraining are memorized more strongly)
•✍️Paraphrased runs (trained on paraphrased perturbations)

24.10.2025 18:21 — 👍 2 🔁 0 💬 1 📌 0

Memorization of sensitive data can be diluted by training on larger corpora. We report the evaluations on a subset of tasks for the core 8B models trained on 100B and 500B tokens. For the same duplicate level, memorization is weaker for the model trained on 500B tokens compared to 100B. A separate finding shows that larger models memorize at lower duplications.

🪐Our core release is 8 runs:
2 data conditions (standard, perturbed) ×2 model sizes (1B, 8B) ×2 pretraining sizes (100B, 500B).

They establish *dilution* as a best practice to broadly address memorization risks — sensitive data can be diluted by scaling up the training corpus!

24.10.2025 18:21 — 👍 2 🔁 0 💬 1 📌 0

Hubble Suite logo (cloth patch with names of key organizations involved: USC, MPI, NVIDIA)

Announcing 🔭Hubble, a suite of open-source LLMs to advance the study of memorization!

Pretrained 1B/8B param models, with controlled insertion of texts designed to emulate key memorization risks: copyright (e.g., book passages), privacy (e.g., synthetic biographies), and test set contamination

24.10.2025 18:21 — 👍 6 🔁 4 💬 1 📌 2

Overview of the key questions, data, and idea driving our analysis.

🤔 We know what people are using LLMs for, but do we know how they collaborate with an LLM?

🔍 In a recent paper we answered this by analyzing multi-turn sessions in 21 million Microsoft Copilot for consumers and WildChat interaction logs: arxiv.org/abs/2505.16023

27.06.2025 20:13 — 👍 2 🔁 1 💬 1 📌 1

Ameya Godbole

Latest posts by ameyagodbole.bsky.social on Bluesky

@ameyagodbole is following 20 prominent accounts