Nishant Balepur's Avatar

Nishant Balepur

@nbalepur.bsky.social

CS PhD Student. Trying to find that dog in me at UMD. Babysitting (aligning) + Bullying (evaluating) LLMs nbalepur.github.io

141 Followers  |  190 Following  |  19 Posts  |  Joined: 25.11.2024
Posts Following

Posts by Nishant Balepur (@nbalepur.bsky.social)

πŸ˜‚

25.09.2025 12:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸŽ‰πŸŽ‰ Excited to have two papers accepted to #ACL2025!

Our first paper designs a preference training method to boost LLM personalization 🎨
While the second outlines our position on why MCQA evals are terrible and how to make them better πŸ™

Grateful for amazing collaborators!

21.05.2025 18:34 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models High-quality training data has proven crucial for developing performant large language models (LLMs). However, commercial LLM providers disclose few, if any, details about the data used for training. ...

Want to know what training data has been memorized by models like GPT-4?

We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models,

without requiring access to
πŸ™…β€β™€οΈ Model weights
πŸ™…β€β™€οΈ Training data
πŸ™…β€β™€οΈ Token probabilities 🧡 (1/5)

21.03.2025 19:08 β€” πŸ‘ 97    πŸ” 27    πŸ’¬ 4    πŸ“Œ 8
Graph showing that simple text completion models more accurately imitate the unrhymed form of C20 verse, whereas instruction-tuned models lapse into rhyme more often. 

Caption to graph: Given the first 5 lines of 10-20 line poems from poets born in each century, 1600-2000, LLMs are prompted to "complete" the poem. Rhyme is measured by exact phoneme match in the rime of the final syllable (or syllables, if final syllable unstressed). Poems randomly sampled from Chadwyck-Healey poetry collections, with 600 poems for each model for each century. Results shown for actual poems as well as the LLM imitations. Poems "memorized" by the model are excluded.

Graph showing that simple text completion models more accurately imitate the unrhymed form of C20 verse, whereas instruction-tuned models lapse into rhyme more often. Caption to graph: Given the first 5 lines of 10-20 line poems from poets born in each century, 1600-2000, LLMs are prompted to "complete" the poem. Rhyme is measured by exact phoneme match in the rime of the final syllable (or syllables, if final syllable unstressed). Poems randomly sampled from Chadwyck-Healey poetry collections, with 600 poems for each model for each century. Results shown for actual poems as well as the LLM imitations. Poems "memorized" by the model are excluded.

Finally may have figured out why LLMs rhyme so compulsively: instruction-tuning. Training an LLM to respond "helpfully" to user queries may push models into more "pleasing" aesthetic forms.

21.03.2025 09:57 β€” πŸ‘ 29    πŸ” 8    πŸ’¬ 3    πŸ“Œ 3
Post image Post image Post image

Had a great time presenting my research on building more helpful QA systems @imperialcollegeldn.bsky.social! Thank you @joestacey.bsky.social for letting me invite myself 🫢

And loved visiting London+Edinburgh this week, hope to be back soon! πŸ™

21.03.2025 12:07 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 0    πŸ“Œ 1

🚨 Our team at UMD is looking for participants to study how #LLM agent plans can help you answer complex questions

πŸ’° $1 per question
πŸ† Top-3 fastest + most accurate win $50
⏳ Questions take ~3 min => $20/hr+

Click here to sign up (please join, reposts appreciated πŸ™): preferences.umiacs.umd.edu

11.03.2025 14:30 β€” πŸ‘ 2    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Post image

🚨 New Position Paper 🚨

Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬

We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠

Here's why MCQA evals are broken, and how to fix them 🧡

24.02.2025 21:03 β€” πŸ‘ 46    πŸ” 13    πŸ’¬ 2    πŸ“Œ 0

if it is truly helpful, honest, and harmless, yes πŸ™

26.02.2025 01:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The alignment is a system prompt saying "if the user asks X, do Y" 😝

26.02.2025 01:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

⚠️Current methods for generating instruction-following data fall short for long-range reasoning tasks like narrative claim verification.

We present CLIPPER βœ‚οΈ, a compression-based pipeline that produces grounded instructions for ~$0.5 each, 34x cheaper than human annotations.

21.02.2025 16:25 β€” πŸ‘ 21    πŸ” 8    πŸ’¬ 1    πŸ“Œ 2

And huge thanks to my friends and labmates who let me bother them to find the right people, review the paper, and for useful discussions πŸ™
@saxon.me @lasha.bsky.social @yysung.bsky.social @maharshigor.bsky.social @matthewshu.com @houyu0930.bsky.social

(and many more I'm forgetting, sorry!)

24.02.2025 21:03 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

This was a really fun paper to put together with Rachel and @boydgraber.bsky.social allowing me to vent many of my frustrations working with MCQA over the past year πŸ˜ͺ🫑

Please check out the paper, we would love to hear your feedback! πŸ“„πŸ‘‡

24.02.2025 21:03 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

In short, here’s how to build better evals:
βœ… Check if MCQA the right format for what you want to test
βœ… Use design choices to limit leakage/errors/shortcuts
βœ… Keep questions easy for humans, hard for models

If we don’t put in this effort, what is MCQA even testing? πŸ€·β€β™‚οΈ

24.02.2025 21:03 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Lastly, we discuss persistent flaws of LLMs when running MCQA:
πŸ”©Robustness Issues
🌎 Biases
πŸ’¬ Unfaithful Explanations

Many of our previous solutions to MCQA's format/datasets can better address or evaluate these issues 😁

24.02.2025 21:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Two of the most pressing and promising dataset improvements include:
πŸ“‹ Writing MCQs using educators' rubrics to improve question quality
πŸ§‘β€πŸŽ“ Designing MCQs hard for models but easy for humans (adversarial), rather than creating needlessly impossible/obscure questions

24.02.2025 21:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Next, we show even when MCQA is a good format, our datasets still have issues πŸ₯²

We discuss:
πŸ”“ Dataset Leakage
❓ Unanswerable Questions
⚑️ Shortcuts
πŸ“ˆ Saturation

More good news: educators again already have solutions! We also discuss recent work tackling these problems! πŸ’ͺ

24.02.2025 21:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

So what's better? β€οΈβ€πŸ©Ή

We explore two possible improvements:
1️⃣ Constructed Response (short-form QA)
2️⃣ Explanation MCQA (justifying answers)

Both are grounded in education research, better align with LLM use cases, and test deeper knowledge levels versus MCQA ⭐️

24.02.2025 21:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

First, we show MCQA is flawed as a standardized LLM eval format because it often fails to:
πŸ”’ Test subjectivity and generation
πŸ‘₯ Align with real LLM use cases
🧠 Assess knowledge (based on education research)

When's the last time you asked ChatGPT to answer an MCQ? πŸ€”

24.02.2025 21:03 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We break our position into three points:
1️⃣ Flaws in MCQA’s format
2️⃣ Issues in datasets
3️⃣ Weaknesses in how LLMs run MCQA

The good news? Best practices in education made for effective student testing can help fix these πŸ§‘β€πŸ«

Yet, we rarely use these insights in LLM evaluation 🀦

24.02.2025 21:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

🚨 New Position Paper 🚨

Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬

We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠

Here's why MCQA evals are broken, and how to fix them 🧡

24.02.2025 21:03 β€” πŸ‘ 46    πŸ” 13    πŸ’¬ 2    πŸ“Œ 0

Namely, @boydgraber.bsky.social @lasha.bsky.social, Rachel, Feng, and folks from Adobe Research 🫑

31.01.2025 14:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Excited to share 2 papers at #NAACL2025 main!

πŸ“„βœοΈ MoDS: Multi-Doc Summarization for Debatable Queries (Adobe intern work, coming soon!)
πŸ€”β“Reverse QA: LLMs struggle with the simple task of giving questions for answers

Grateful for all my collaborators 😁

31.01.2025 14:31 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

People often claim they know when ChatGPT wrote something, but are they as accurate as they think?

Turns out that while general population is unreliable, those who frequently use ChatGPT for writing tasks can spot even "humanized" AI-generated text with near-perfect accuracy 🎯

28.01.2025 14:55 β€” πŸ‘ 189    πŸ” 66    πŸ’¬ 10    πŸ“Œ 19
Post image

Manifesting some good luck for my experiment running tonight 🀞

Best of luck to anyone submitting tmrw :)

15.12.2024 05:03 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Exciting research on an AI-driven mnemonic generator for easier vocabulary memorization by @nbalepur.bsky.social, Jordan Boyd-Graber, Rachel Rudinger, & @alexanderhoyle.bsky.social. Part of 21 CLIP projects at #EMNLP2024. πŸ‘‰ Read more: go.umd.edu/1u48 #AI

03.12.2024 15:46 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

OLMo 2 is out πŸ₯³ 7B and 13B trained on 5T tokens, and meticulousy instruction tuned using Tulu 3 recipe.

Simply the best fully open models yet.

Really proud of the work & the amazing team at
@ai2.bsky.social

26.11.2024 21:12 β€” πŸ‘ 260    πŸ” 44    πŸ’¬ 9    πŸ“Œ 2