We’re excited about Oolong as a challenging benchmark for information aggregation! Let us know which models we should benchmark next 👀
Paper: arxiv.org/abs/2511.02817
Dataset: huggingface.co/oolongbench
Code: github.com/abertsch72/o...
Leaderboard: oolongbench.github.io
Can LLMs accurately aggregate information over long, information-dense texts? Not yet…
We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!
Yes! tbh this method is probably much more immediately useful for helping one understand subtle differences between [models trained on] subtly different data subsets, vs a loftier goal of helping one find "the" best data mixture -- to anyone considering this method, please feel free to reach out :)
I almost never use these so I always thought that they were cute little things that let seatmates watch the same movie
Congrats Lucy!!
Come through! #492 in Hall 2!, 10am-12:30pm
Our paper documenting the environmental impacts of creating OLMo language models is the most honest and comprehensive characterization I know of, including training, development (!) and inference costs. If you're at ICLR chat with @jacobcares.bsky.social & @clarana.bsky.social Sat morning 10-12:30!
📜Paper: arxiv.org/abs/2503.05804
✍️Thanks to my illustrious coauthors @clarana.bsky.social @jaredfern.bsky.social timdettmers.com @strubell.bsky.social @jessedodge.bsky.social, t'was a fun project 🌏
I'm in Singapore for @iclr-conf.bsky.social ! Come check out our spotlight paper on the environmental impact of training OLMo (link in next tweet) during the Saturday morning poster session from 10-12:30 -- happy to chat about this or anything else! DMs should be open, email works too
We've received multiple notes that NOAA research services (Office of Oceanic and Atmospheric Research) may go offline at midnight. @safeguardingdata.bsky.social is working on web archiving, but if others want to nominate on this, that might be good: digital2.library.unt.edu/nomination/G...
How can we better think and talk about human-like qualities attributed to language technologies like LLMs? In our #CHI2025 paper, we taxonomize how text outputs from cases of user interactions with language technologies can contribute to anthropomorphism. arxiv.org/abs/2502.09870 1/n
Did you know? Gestures used to express universal concepts—like wishing for luck—vary DRAMATICALLY across cultures?
🤞means luck in US but deeply offensive in Vietnam 🚨
📣 We introduce MC-SIGNS, a test bed to evaluate how LLMs/VLMs/T2I handle such nonverbal behavior!
📜: arxiv.org/abs/2502.17710
the science of LMs should be fully open✨
today @akshitab.bsky.social @natolambert.bsky.social and I are giving our #neurips2024 tutorial on language model development.
everything from data, training, adaptation. published or not, no secrets 🫡
tues, 12/10, 9:30am PT ☕️
neurips.cc/virtual/2024...
How open is “open” AI, really?
It isn’t just about making models reusable. If the origin of data is opaque, if labor is hidden & exploited, if frameworks are dominated by Big Tech, if computational power is mastered by an oligopoly…‘open’ is just a label.
Meredith Whittaker & friends in Nature.
I noticed a lot of starter packs skewed towards faculty/industry, so I made one of just NLP & ML students: go.bsky.app/vju2ux
Students do different research, go on the job market, and recruit other students. Ping me and I'll add you!
💬 Have you or a loved one compared LM probabilities to human linguistic acceptability judgments? You may be overcompensating for the effect of frequency and length!
🌟 In our new paper, we rethink how we should be controlling for these factors 🧵:
@jaredfern.bsky.social is at 162
Hi I am at 232 in the back of the riverfront room!
I'm at EMNLP! Presenting the poster for this paper on Thursday morning (10:30-12), Session F Riverfront Hall, come say hi :)
(Hehe first bsky post!) I'll be at #EMNLP2024 💃🌴! Happy to chat about (among other things):
✨linguistically+cognitively motivated evaluation
✨NLP for low-resource+endangered languages
✨figuring out what features of language data LMs are *actually* learning
I'll be presenting two posters 🧵:
scrolling,,, minimal doom ?!
Understanding “Democratization” in NLP and ML Research - joint work @arjunsubgraph.bsky.social and I co-led with Dietrich Klakow and @zeerak.bsky.social
aclanthology.org/2024.emnlp-m...
hi ! :)
A starter pack for #NLP #NLProc researchers! 🎉
go.bsky.app/SngwGeS
I'll be presenting our paper at #EMNLP2024 next week -- see y'all in Miami🌴! This was my Summer 2023 work @ai2.bsky.social Grateful to my wonderful collaborators @ianmagnusson.bsky.social @ananyahjha93.bsky.social @tomsherborne.bsky.social & mentors @strubell.bsky.social Jesse, and Pradeep (6/n)
Check out the paper for details and our specific recommendations!
🤗Data and models: huggingface.co/collections/...
👩💻Repo: github.com/clarana/ez-d...
📄Paper again: arxiv.org/abs/2410.15661
(5/n)
We can even predict larger model perplexity scores w/ smaller model proxy evals, AND the relationship holds even when the actual ppl scores are high (4/n)
What does this mean? We can simulate *comprehensive and fine-grained* data ablations on language corpora, at scale! Required training compute scales only linearly wrt *new* training data, i.e. work for previously seen train data is "cached" and reusable in subsequent evals (3/n)
We show that there is a reliable *linear correlation* between perplexity evaluation scores for a model trained on a data mixture, and proxy scores from models trained on partitions of the mixture -- f(🟦🟩🟪) vs. f(🟦) f(🟩) f(🟪)
❗️This also works on arbitrary eval data (2/n)
Building/customizing your own LLM? You'll want to curate training data for it, but how do you know what makes the data good?
You can try out recipes👩🍳 iterate on ✨vibes✨ but we can't actually test all possible combos of tweaks,,, right?? 🙅♂️WRONG! arxiv.org/abs/2410.15661 (1/n) 🧵