Did you know that LLMs suffer from serious mode collapse?
For example, if you ask models to tell you a joke, they almost always tell you the same joke? This is true across samples and even across model families!
Why does this happen? Can we improve it?
08.10.2025 14:22 β π 4 π 2 π¬ 1 π 0
A scatter plot comparing language models by performance (y-axis, measured in average performance on 10 benchmarks) versus training computational cost (x-axis, in approximate FLOPs). The plot shows OLMo 2 models (marked with stars) achieving Pareto-optimal efficiency among open models, with OLMo-2-13B and OLMo-2-7B sitting at the performance frontier relative to other open models like DCLM, Llama 3.1, StableLM 2, and Qwen 2.5. The x-axis ranges from 4x10^22 to 2x10^24 FLOPs, while the y-axis ranges from 35 to 70 benchmark points.
Excited to share OLMo 2!
π 7B and 13B weights, trained up to 4-5T tokens, fully open data, code, etc
π better architecture and recipe for training stability
π‘ staged training, with new data mix Dolminoπ added during annealing
π¦ state-of-the-art OLMo 2 Instruct models
#nlp #mlsky
links belowπ
26.11.2024 20:59 β π 68 π 12 π¬ 1 π 1
A photo of Boulder, Colorado, shot from above the university campus and looking toward the Flatirons.
I'm recruiting 1-2 PhD students to work with me at the University of Colorado Boulder! Looking for creative students with interests in #NLP and #CulturalAnalytics.
Boulder is a lovely college town 30 minutes from Denver and 1 hour from Rocky Mountain National Park π
Apply by December 15th!
19.11.2024 10:38 β π 304 π 136 π¬ 9 π 12
Abhilasha Ravichander - Home
β¨I am on the faculty job market in the 2024-2025 cycle!β¨
My research centers on advancing Responsible AI, specifically enhancing factuality, robustness, and transparency in AI systems.
If you have relevant positions, let me know! lasharavichander.github.io Please share/RT!
11.11.2024 14:23 β π 51 π 22 π¬ 2 π 1
Why and when do preference annotators disagree? And how do reward models + LLM-as-Judge evaluators handle disagreements?
Michael explored these questions in a new β¨preprintβ¨ from his @ai2.bsky.social internship with me!
07.11.2024 17:38 β π 29 π 8 π¬ 1 π 1
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models
Benjamin Newman, Yoonjoo Lee, Aakanksha Naik, Pao Siangliulue, Raymond Fok, Juho Kim, Daniel S Weld, Joseph Chee Chang, Kyle Lo. Proceedings of the 2024 Conference on Empirical Methods in Natural Lang...
This is work with Yoonjoo Lee, @arnaik19.bsky.social @paopow.bsky.social, @juhokim.bsky.social, Dan Weld, @josephc.bsky.social, and @kylelo.bsky.social
at S2 @ai2.bsky.social, UW CSE and KAIST
For more, check out our
Dataset: github.com/bnewm0609/ar...
Paper: aclanthology.org/2024.emnlp-m...
11.11.2024 17:37 β π 5 π 0 π¬ 0 π 0
Two plots of recall versus threshold for determining a match: one for GPT-3.5 Turbo and another for Mixtral 8x22B. There are five lines in each plot. Each line travels from the top left to bottom right of the plot with y-intercepts that are generally in increasing order by the following types of context: generated caption, baseline, gold caption, in-context examples, caption + in-text references.
We also find that providing more table context (captions, in-text references) to models leads to higher recall when generating columns but does not help when generating values.
11.11.2024 17:37 β π 1 π 0 π¬ 1 π 0
A plot of recall versus threshold for determining a match between column headers. Llama3 has the highest recall because it hallucinates matches, but Sentence Transformers does better.
We find that using decontextualization with SBERT leads to a better evaluator than Llama 3, which hallucinates alignments.
11.11.2024 17:37 β π 2 π 0 π¬ 1 π 0
A diagram showing two steps of table generation. There is text that says "Step 1: Schema Generation" with an arrow pointing to the column headers of a generated table. Under it, there is text that says "Step 2: Value Generation" with an arrow pointing to the body of the generated table.
We propose a two-step procedure for generating tables given the input papers:
1οΈβ£ Generate the schemas (sets of columns)
2οΈβ£ Fill in the values.
11.11.2024 17:37 β π 0 π 0 π¬ 1 π 0
An example literature review table with four rows and four columns. Each row is a paper (labeled Paper 1, Paper 2, etc.). Each column is a different aspect: ("Dataset", "Size", "Task", and "Annotations").
This table generation task takes as input multiple papers, and synthesizes them into a single output table. We collect a dataset of such tables and associated papers, and augment the tables with additional context such as their captions and in-text references.
11.11.2024 17:37 β π 0 π 0 π¬ 1 π 0
A screenshot of the first page of the paper discussed in the thread. Figure 1 contains a set of three cartoon papers with related text highlighted in three different colors. To its left, there's an arrow pointing to a cartoon table with a column corresponding to each color and a row corresponding to each paper.
β¨EMNLP Paper! β¨
Have you ever constructed a table to organize your literature review process? Can we use LMs to generate these automatically?
We are excited to present ArxivDIGESTables π½οΈ a study of collecting, generating, and evaluating π scientific literature review tables π!
11.11.2024 17:37 β π 29 π 2 π¬ 2 π 3
CS PhD Student @University of Washington, CSxPhilosophy @Dartmouth College
Interested in MARL, Social Reasoning, and Collective Decision making in people, machines, and other organisms
kjha02.github.io
Assistant Professor, Stanford Law School. Aussie struggling with Β°F & online speech stuff.
Asst Prof. @ UCSD | PI of LeMπN Lab | Former Postdoc at ETH ZΓΌrich, PhD @ NYU | computational linguistics, NLProc, CogSci, pragmatics | he/him π³οΈβπ
alexwarstadt.github.io
cs phd student and kempner institute graduate fellow at harvard.
interested in language, cognition, and ai
soniamurthy.com
Author: Verified: How to Think Straight, Get Duped Less, and Make Better Decisions about What to Believe Online (University of Chicago Press).
Researcher, infolit/misinfo/rhetoric/civic reasoning. Currently researching AI as tool for critical thinking.
PhD student at the University of Washington in social computing + human-AI interaction @socialfutureslab.bsky.social. π kjfeng.me
Studying people and computers (https://www.nickmvincent.com/)
Blogging about data and steering AI (https://dataleverage.substack.com/)
associate prof at UMD CS researching NLP & LLMs
PhD student at UW iSchool | ai fairness, evaluation, and decision-making | she/her π₯
kyrawilson.github.io/me
PhD Student @nyudatascience.bsky.social, working with He He on NLP and Human-AI Collaboration.
Also hanging out @ai2.bsky.social
Website - https://vishakhpk.github.io/
Anti-cynic. Towards a weirder future. Reinforcement Learning, Autonomous Vehicles, transportation systems, the works. Asst. Prof at NYU
https://emerge-lab.github.io
https://www.admonymous.co/eugenevinitsky
Stanford Professor of Linguistics and, by courtesy, of Computer Science, and member of @stanfordnlp.bsky.social and The Stanford AI Lab. He/Him/His. https://web.stanford.edu/~cgpotts/
Independent AI researcher, creator of datasette.io and llm.datasette.io, building open source tools for data journalism, writing about a lot of stuff at https://simonwillison.net/
UW-Seattle PhD candidate, studying online attention dynamics & creators/influencers @ HCDE/CIP, advised by @katestarbird.bsky.social . SETS Fellow @ Cornell Tech, working w/ @informor.bsky.social . UAW 4121 Steward. He/Him
Postdoc at UW and Doctor of NLP/Vision+Language from UCSB
Evals, metrics, multilinguality, multiculturality, multimodality, and (dabbling in) reasoning
100% Product of public schools
https://saxon.me/
PhD student @ Penn
alonj.github.io
Current Ph.D. student @ CMU LTI
Personal Website: https://colinzhaoust.github.io/
Ex: Student Researcher @ Google DeepMind, Intern @ Tencent Bellevue, Stanford NLP, HKUST