Ben Newman's Avatar

Ben Newman

@benn9.bsky.social

NLP research - PhD student at UW

748 Followers  |  115 Following  |  6 Posts  |  Joined: 05.10.2023  |  1.5274

Latest posts by benn9.bsky.social on Bluesky

Post image Post image

Did you know that LLMs suffer from serious mode collapse?

For example, if you ask models to tell you a joke, they almost always tell you the same joke? This is true across samples and even across model families!

Why does this happen? Can we improve it?

08.10.2025 14:22 β€” πŸ‘ 4    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
A scatter plot comparing language models by performance (y-axis, measured in average performance on 10 benchmarks) versus training computational cost (x-axis, in approximate FLOPs). The plot shows OLMo 2 models (marked with stars) achieving Pareto-optimal efficiency among open models, with OLMo-2-13B and OLMo-2-7B sitting at the performance frontier relative to other open models like DCLM, Llama 3.1, StableLM 2, and Qwen 2.5. The x-axis ranges from 4x10^22 to 2x10^24 FLOPs, while the y-axis ranges from 35 to 70 benchmark points.

A scatter plot comparing language models by performance (y-axis, measured in average performance on 10 benchmarks) versus training computational cost (x-axis, in approximate FLOPs). The plot shows OLMo 2 models (marked with stars) achieving Pareto-optimal efficiency among open models, with OLMo-2-13B and OLMo-2-7B sitting at the performance frontier relative to other open models like DCLM, Llama 3.1, StableLM 2, and Qwen 2.5. The x-axis ranges from 4x10^22 to 2x10^24 FLOPs, while the y-axis ranges from 35 to 70 benchmark points.

Excited to share OLMo 2!

🐟 7B and 13B weights, trained up to 4-5T tokens, fully open data, code, etc
🐠 better architecture and recipe for training stability
🐑 staged training, with new data mix DolminoπŸ• added during annealing
🦈 state-of-the-art OLMo 2 Instruct models

#nlp #mlsky

links belowπŸ‘‡

26.11.2024 20:59 β€” πŸ‘ 68    πŸ” 12    πŸ’¬ 1    πŸ“Œ 1
A photo of Boulder, Colorado, shot from above the university campus and looking toward the Flatirons.

A photo of Boulder, Colorado, shot from above the university campus and looking toward the Flatirons.

I'm recruiting 1-2 PhD students to work with me at the University of Colorado Boulder! Looking for creative students with interests in #NLP and #CulturalAnalytics.

Boulder is a lovely college town 30 minutes from Denver and 1 hour from Rocky Mountain National Park 😎

Apply by December 15th!

19.11.2024 10:38 β€” πŸ‘ 304    πŸ” 136    πŸ’¬ 9    πŸ“Œ 12
Abhilasha Ravichander - Home

✨I am on the faculty job market in the 2024-2025 cycle!✨

My research centers on advancing Responsible AI, specifically enhancing factuality, robustness, and transparency in AI systems.

If you have relevant positions, let me know! lasharavichander.github.io Please share/RT!

11.11.2024 14:23 β€” πŸ‘ 51    πŸ” 22    πŸ’¬ 2    πŸ“Œ 1
Post image

Why and when do preference annotators disagree? And how do reward models + LLM-as-Judge evaluators handle disagreements?

Michael explored these questions in a new ✨preprint✨ from his @ai2.bsky.social internship with me!

07.11.2024 17:38 β€” πŸ‘ 29    πŸ” 8    πŸ’¬ 1    πŸ“Œ 1
Preview
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models Benjamin Newman, Yoonjoo Lee, Aakanksha Naik, Pao Siangliulue, Raymond Fok, Juho Kim, Daniel S Weld, Joseph Chee Chang, Kyle Lo. Proceedings of the 2024 Conference on Empirical Methods in Natural Lang...

This is work with Yoonjoo Lee, @arnaik19.bsky.social @paopow.bsky.social, @juhokim.bsky.social, Dan Weld, @josephc.bsky.social, and @kylelo.bsky.social
at S2 @ai2.bsky.social, UW CSE and KAIST

For more, check out our
Dataset: github.com/bnewm0609/ar...
Paper: aclanthology.org/2024.emnlp-m...

11.11.2024 17:37 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Two plots of recall versus threshold for determining a match: one for GPT-3.5 Turbo and another for Mixtral 8x22B. There are five lines in each plot. Each line travels from the top left to bottom right of the plot with y-intercepts that are generally in increasing order by the following types of context: generated caption, baseline, gold caption, in-context examples, caption + in-text references.

Two plots of recall versus threshold for determining a match: one for GPT-3.5 Turbo and another for Mixtral 8x22B. There are five lines in each plot. Each line travels from the top left to bottom right of the plot with y-intercepts that are generally in increasing order by the following types of context: generated caption, baseline, gold caption, in-context examples, caption + in-text references.

We also find that providing more table context (captions, in-text references) to models leads to higher recall when generating columns but does not help when generating values.

11.11.2024 17:37 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
A plot of recall versus threshold for determining a match between column headers. Llama3 has the highest recall because it hallucinates matches, but Sentence Transformers does better.

A plot of recall versus threshold for determining a match between column headers. Llama3 has the highest recall because it hallucinates matches, but Sentence Transformers does better.

We find that using decontextualization with SBERT leads to a better evaluator than Llama 3, which hallucinates alignments.

11.11.2024 17:37 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
A diagram showing two steps of table generation. There is text that says "Step 1: Schema Generation" with an arrow pointing to the column headers of a generated table. Under it, there is text that says "Step 2: Value Generation" with an arrow pointing to the body of the generated table.

A diagram showing two steps of table generation. There is text that says "Step 1: Schema Generation" with an arrow pointing to the column headers of a generated table. Under it, there is text that says "Step 2: Value Generation" with an arrow pointing to the body of the generated table.

We propose a two-step procedure for generating tables given the input papers:
1️⃣ Generate the schemas (sets of columns)
2️⃣ Fill in the values.

11.11.2024 17:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
An example literature review table with four rows and four columns. Each row is a paper (labeled Paper 1, Paper 2, etc.). Each column is a different aspect: ("Dataset", "Size", "Task", and "Annotations").

An example literature review table with four rows and four columns. Each row is a paper (labeled Paper 1, Paper 2, etc.). Each column is a different aspect: ("Dataset", "Size", "Task", and "Annotations").

This table generation task takes as input multiple papers, and synthesizes them into a single output table. We collect a dataset of such tables and associated papers, and augment the tables with additional context such as their captions and in-text references.

11.11.2024 17:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
A screenshot of the first page of the paper discussed in the thread. Figure 1 contains a set of three cartoon papers with related text highlighted in three different colors. To its left, there's an arrow pointing to a cartoon table with a column corresponding to each color and a row corresponding to each paper.

A screenshot of the first page of the paper discussed in the thread. Figure 1 contains a set of three cartoon papers with related text highlighted in three different colors. To its left, there's an arrow pointing to a cartoon table with a column corresponding to each color and a row corresponding to each paper.

✨EMNLP Paper! ✨
Have you ever constructed a table to organize your literature review process? Can we use LMs to generate these automatically?

We are excited to present ArxivDIGESTables 🍽️ a study of collecting, generating, and evaluating πŸŽ“ scientific literature review tables πŸ“ƒ!

11.11.2024 17:37 β€” πŸ‘ 29    πŸ” 2    πŸ’¬ 2    πŸ“Œ 3

@benn9 is following 20 prominent accounts