Ameya Godbole

Ameya Godbole

@ameyagodbole.bsky.social

PhD student USC NLP working on generalization and reasoning, prev UMassAmherst, IITG (he/him)

65 Followers 219 Following 8 Posts Joined Nov 2024
4 months ago

Our team dedicated so much over the last year: @johntzwei.bsky.social, me, @aflah02101.bsky.social, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P Gummadi, @willieneis.bsky.social, @robinjia.bsky.social

We are grateful to everyone who provided support throughout the project! 🥹

2 0 0 0
4 months ago
Hubble Suite Hubble Suite

Website 🌎: allegro-lab.github.io/hubble/
Paper 🔗: arxiv.org/abs/2510.19811
Models 🤗: huggingface.co/allegrolab

2 0 1 0
4 months ago

For this project, NVIDIA AI provided 200K A100 hours on DGX cloud through NSF NAIRR pilot, and @hf.co provided 100TB of storage. Training used @eleutherai.bsky.social's NeoX and eval harness.

Thank you for your commitment to open-source science!

4 0 1 0
4 months ago

Since the perturbations are randomly duplicated 0 or more times, you can make a wide range of comparisons and measurements. 🔍💫

We show Hubble is an ideal benchmark for membership inference and unlearning, and we invite the community to further explore and build on Hubble ✨

1 0 1 0
4 months ago
Various attack formats are used to infer private information from memorized biographies.

Hubble enables a wide range of memorization research. Analyzing the inserted biographies 🧑‍💼 alone yields rich insights, and e.g. reveals how readily different types of PII are memorized.

And there’s a lot more — book passages 📚, paraphrases 🔁, chat logs 💬, and test sets🎯

4 1 1 0
4 months ago

Besides our core runs, we release several collections:
•🔀Interference runs (confirming that perturbations minimally interfere across domains)
•⏱️Timing runs (confirming perturbations inserted later in pretraining are memorized more strongly)
•✍️Paraphrased runs (trained on paraphrased perturbations)

2 0 1 0
4 months ago
Memorization of sensitive data can be diluted by training on larger corpora. We report the evaluations on a subset of tasks for the core 8B models trained on 100B and 500B tokens. For the same duplicate level, memorization is weaker for the model trained on 500B tokens compared to 100B. A separate finding shows that larger models memorize at lower duplications.

🪐Our core release is 8 runs:
2 data conditions (standard, perturbed) ×2 model sizes (1B, 8B) ×2 pretraining sizes (100B, 500B).

They establish *dilution* as a best practice to broadly address memorization risks — sensitive data can be diluted by scaling up the training corpus!

2 0 1 0
4 months ago
Hubble Suite logo (cloth patch with names of key organizations involved: USC, MPI, NVIDIA)

Announcing 🔭Hubble, a suite of open-source LLMs to advance the study of memorization!

Pretrained 1B/8B param models, with controlled insertion of texts designed to emulate key memorization risks: copyright (e.g., book passages), privacy (e.g., synthetic biographies), and test set contamination

7 4 1 2
8 months ago
Overview of the key questions, data, and idea driving our analysis.

🤔 We know what people are using LLMs for, but do we know how they collaborate with an LLM?

🔍 In a recent paper we answered this by analyzing multi-turn sessions in 21 million Microsoft Copilot for consumers and WildChat interaction logs: arxiv.org/abs/2505.16023

2 1 1 1