Ben Newman @benn9 - Bluesky Profile

Did you know that LLMs suffer from serious mode collapse?

For example, if you ask models to tell you a joke, they almost always tell you the same joke? This is true across samples and even across model families!

Why does this happen? Can we improve it?

08.10.2025 14:22 — 👍 4 🔁 2 💬 1 📌 0

A scatter plot comparing language models by performance (y-axis, measured in average performance on 10 benchmarks) versus training computational cost (x-axis, in approximate FLOPs). The plot shows OLMo 2 models (marked with stars) achieving Pareto-optimal efficiency among open models, with OLMo-2-13B and OLMo-2-7B sitting at the performance frontier relative to other open models like DCLM, Llama 3.1, StableLM 2, and Qwen 2.5. The x-axis ranges from 4x10^22 to 2x10^24 FLOPs, while the y-axis ranges from 35 to 70 benchmark points.

Excited to share OLMo 2!

🐟 7B and 13B weights, trained up to 4-5T tokens, fully open data, code, etc
🐠 better architecture and recipe for training stability
🐡 staged training, with new data mix Dolmino🍕 added during annealing
🦈 state-of-the-art OLMo 2 Instruct models

#nlp #mlsky

links below👇

26.11.2024 20:59 — 👍 68 🔁 12 💬 1 📌 1

A photo of Boulder, Colorado, shot from above the university campus and looking toward the Flatirons.

I'm recruiting 1-2 PhD students to work with me at the University of Colorado Boulder! Looking for creative students with interests in #NLP and #CulturalAnalytics.

Boulder is a lovely college town 30 minutes from Denver and 1 hour from Rocky Mountain National Park 😎

Apply by December 15th!

19.11.2024 10:38 — 👍 304 🔁 136 💬 9 📌 12

Abhilasha Ravichander - Home

✨I am on the faculty job market in the 2024-2025 cycle!✨

My research centers on advancing Responsible AI, specifically enhancing factuality, robustness, and transparency in AI systems.

If you have relevant positions, let me know! lasharavichander.github.io Please share/RT!

11.11.2024 14:23 — 👍 51 🔁 22 💬 2 📌 1

Why and when do preference annotators disagree? And how do reward models + LLM-as-Judge evaluators handle disagreements?

Michael explored these questions in a new ✨preprint✨ from his @ai2.bsky.social internship with me!

07.11.2024 17:38 — 👍 29 🔁 8 💬 1 📌 1

ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models Benjamin Newman, Yoonjoo Lee, Aakanksha Naik, Pao Siangliulue, Raymond Fok, Juho Kim, Daniel S Weld, Joseph Chee Chang, Kyle Lo. Proceedings of the 2024 Conference on Empirical Methods in Natural Lang...

This is work with Yoonjoo Lee, @arnaik19.bsky.social @paopow.bsky.social, @juhokim.bsky.social, Dan Weld, @josephc.bsky.social, and @kylelo.bsky.social
at S2 @ai2.bsky.social, UW CSE and KAIST

For more, check out our
Dataset: github.com/bnewm0609/ar...
Paper: aclanthology.org/2024.emnlp-m...

11.11.2024 17:37 — 👍 5 🔁 0 💬 0 📌 0

Two plots of recall versus threshold for determining a match: one for GPT-3.5 Turbo and another for Mixtral 8x22B. There are five lines in each plot. Each line travels from the top left to bottom right of the plot with y-intercepts that are generally in increasing order by the following types of context: generated caption, baseline, gold caption, in-context examples, caption + in-text references.

We also find that providing more table context (captions, in-text references) to models leads to higher recall when generating columns but does not help when generating values.

11.11.2024 17:37 — 👍 1 🔁 0 💬 1 📌 0

A plot of recall versus threshold for determining a match between column headers. Llama3 has the highest recall because it hallucinates matches, but Sentence Transformers does better.

We find that using decontextualization with SBERT leads to a better evaluator than Llama 3, which hallucinates alignments.

11.11.2024 17:37 — 👍 2 🔁 0 💬 1 📌 0

A diagram showing two steps of table generation. There is text that says "Step 1: Schema Generation" with an arrow pointing to the column headers of a generated table. Under it, there is text that says "Step 2: Value Generation" with an arrow pointing to the body of the generated table.

We propose a two-step procedure for generating tables given the input papers:
1️⃣ Generate the schemas (sets of columns)
2️⃣ Fill in the values.

11.11.2024 17:37 — 👍 0 🔁 0 💬 1 📌 0

An example literature review table with four rows and four columns. Each row is a paper (labeled Paper 1, Paper 2, etc.). Each column is a different aspect: ("Dataset", "Size", "Task", and "Annotations").

This table generation task takes as input multiple papers, and synthesizes them into a single output table. We collect a dataset of such tables and associated papers, and augment the tables with additional context such as their captions and in-text references.

11.11.2024 17:37 — 👍 0 🔁 0 💬 1 📌 0

A screenshot of the first page of the paper discussed in the thread. Figure 1 contains a set of three cartoon papers with related text highlighted in three different colors. To its left, there's an arrow pointing to a cartoon table with a column corresponding to each color and a row corresponding to each paper.

✨EMNLP Paper! ✨
Have you ever constructed a table to organize your literature review process? Can we use LMs to generate these automatically?

We are excited to present ArxivDIGESTables 🍽️ a study of collecting, generating, and evaluating 🎓 scientific literature review tables 📃!

11.11.2024 17:37 — 👍 29 🔁 2 💬 2 📌 3

Ben Newman

Latest posts by benn9.bsky.social on Bluesky

@benn9 is following 20 prominent accounts