Our team dedicated so much over the last year: @johntzwei.bsky.social, me, @aflah02101.bsky.social, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P Gummadi, @willieneis.bsky.social, @robinjia.bsky.social
We are grateful to everyone who provided support throughout the project! π₯Ή
24.10.2025 18:21 β π 2 π 0 π¬ 0 π 0
Hubble Suite
Hubble Suite
Website π: allegro-lab.github.io/hubble/
Paper π: arxiv.org/abs/2510.19811
Models π€: huggingface.co/allegrolab
24.10.2025 18:21 β π 2 π 0 π¬ 1 π 0
For this project, NVIDIA AI provided 200K A100 hours on DGX cloud through NSF NAIRR pilot, and @hf.co provided 100TB of storage. Training used @eleutherai.bsky.social's NeoX and eval harness.
Thank you for your commitment to open-source science!
24.10.2025 18:21 β π 4 π 0 π¬ 1 π 0
Since the perturbations are randomly duplicated 0 or more times, you can make a wide range of comparisons and measurements. ππ«
We show Hubble is an ideal benchmark for membership inference and unlearning, and we invite the community to further explore and build on Hubble β¨
24.10.2025 18:21 β π 1 π 0 π¬ 1 π 0
Various attack formats are used to infer private information from memorized biographies.
Hubble enables a wide range of memorization research. Analyzing the inserted biographies π§βπΌ alone yields rich insights, and e.g. reveals how readily different types of PII are memorized.
And thereβs a lot more β book passages π, paraphrases π, chat logs π¬, and test setsπ―
24.10.2025 18:21 β π 4 π 1 π¬ 1 π 0
Besides our core runs, we release several collections:
β’πInterference runs (confirming that perturbations minimally interfere across domains)
β’β±οΈTiming runs (confirming perturbations inserted later in pretraining are memorized more strongly)
β’βοΈParaphrased runs (trained on paraphrased perturbations)
24.10.2025 18:21 β π 2 π 0 π¬ 1 π 0
Memorization of sensitive data can be diluted by training on larger corpora. We report the evaluations on a subset of tasks for the core 8B models trained on 100B and 500B tokens. For the same duplicate level, memorization is weaker for the model trained on 500B tokens compared to 100B. A separate finding shows that larger models memorize at lower duplications.
πͺOur core release is 8 runs:
2 data conditions (standard, perturbed) Γ2 model sizes (1B, 8B) Γ2 pretraining sizes (100B, 500B).
They establish *dilution* as a best practice to broadly address memorization risks β sensitive data can be diluted by scaling up the training corpus!
24.10.2025 18:21 β π 2 π 0 π¬ 1 π 0
Hubble Suite logo (cloth patch with names of key organizations involved: USC, MPI, NVIDIA)
Announcing πHubble, a suite of open-source LLMs to advance the study of memorization!
Pretrained 1B/8B param models, with controlled insertion of texts designed to emulate key memorization risks: copyright (e.g., book passages), privacy (e.g., synthetic biographies), and test set contamination
24.10.2025 18:21 β π 6 π 4 π¬ 1 π 2
Overview of the key questions, data, and idea driving our analysis.
π€ We know what people are using LLMs for, but do we know how they collaborate with an LLM?
π In a recent paper we answered this by analyzing multi-turn sessions in 21 million Microsoft Copilot for consumers and WildChat interaction logs: arxiv.org/abs/2505.16023
27.06.2025 20:13 β π 2 π 1 π¬ 1 π 1
CS PhD student at UMass Amherst (IESL; UMass-NLP). Working on search, optimization, and discovery with LLMs. Website: https://people.cs.umass.edu/~dagarwal/.
3rd year PhD student @uscnlp. Interested in NLP x CSS | she/her
NLP/ML @usc, @uwcse. she/her.πΉπ€π©π»βπ³. velocitycavalry.github.io. Multilinguality, retrieval, refining LLMsβ¦
(she/her) | #NLProc PhD student | NSF GRFP | LLM bias evalutation | community-engaged ethical AI | advocating for women and LGBTQ+ in STEM | okie living in LA
I (try to) do NLP research. Antipodean abroad.
currently doing PhD @uwcse,
prev @usyd @ai2
π¦πΊπ¨π¦π¬π§
ivison.id.au
Breakthrough AI to solve the world's biggest problems.
βΊ Join us: http://allenai.org/careers
βΊ Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm
NLP research - PhD student at UW
Associate professor of social computing at UW CSE, leading @socialfutureslab.bsky.social
social.cs.washington.edu
Researcher & entrepreneur | Co-founder @cosmik.network | Building https://semble.so/ | collective sensemaking | https://ronentk.me/ | Prev- Open Science Fellow @asterainstitute.bsky.social
Senior Lecturer at @CseHuji. #NLPROC
schwartz-lab-huji.github.io
UC Berkeley/BAIR, AI2 || Prev: UWNLP, Meta/FAIR || sewonmin.com
https://ananyahjha93.github.io
Second year PhD at @uwcse.bsky.social with @hanna-nlp.bsky.social and @lukezettlemoyer.bsky.social
π€ Building AI agents & interactive environments: π AppWorld (https://appworld.dev) #NLProc PhD @stonybrooku. Past intern Allen AI & visitor CILVR at NYU.
π¦ https://x.com/harsh3vedi
π https://harshtrivedi.me/
Training big models at @ai2.bsky.social.
https://cs.stanford.edu/~rewang
AI & Education β¨ On academic+industry job market. CS PhD @stanfordnlp
prev: MIT 2020, Google Brain, Google Brain Robotics,
@allen_ai
π {Creativity, AI, People} | HCI researcher & software eng | @allenai
Ph.D. student at University of Washington CSE. NLP. IBM Ph.D. fellow (2022-2023). Meta student researcher (2023-) . βοΈ π πββοΈπ§ββοΈπ³