Yekyung Kim's Avatar

Yekyung Kim

@yekyung.bsky.social

PhD student @ UMass NLP

39 Followers  |  108 Following  |  8 Posts  |  Joined: 07.11.2024  |  1.6413

Latest posts by yekyung.bsky.social on Bluesky

πŸ“„ Paper: arxiv.org/pdf/2503.01996
πŸ’» Code & Data: github.com/mungg/OneRuler
Thanks to my amazing coauthors @jennarussell.bsky.social, @markar.bsky.social and @miyyer.bsky.social

05.03.2025 18:10 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Reasoning models "overthink" simple tasks! 🀯

o3-mini-high and Deepseek-R1 overthink for a word frequency task! Also, Incorrect answers often had longer reasoning chains than correct ones.

More reasoning β‰  better accuracy!

05.03.2025 17:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

Instruction language shifts accuracy by up to 20%! πŸ—οΈ

πŸ“‰ πŸ‡ΊπŸ‡Έ context + πŸ‡°πŸ‡· instruction β†’ 91% β†’ 71%
πŸ“ˆ πŸ‡°πŸ‡· context + πŸ‡ΊπŸ‡Έ instruction β†’ 67% β†’ 75%

Instruction language matters more than expected for multilingual LLMs!

05.03.2025 17:06 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ† Gemini 1.5 Flash shines in Sesotho & Swahili, but struggles on non-Latin scripts like ZH, KO and HI.
🚨 o3-mini-high underperforms on English at long contexts.
πŸ“Š Qwen2.5 > LLaMA 3.3 across all context lengths.
🚩 Non-Latin & non-Cyrillic scripts remain a challenge.

05.03.2025 17:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The "nonexistent needle" problem πŸͺ‘

We added the option to answer "none" if the needle wasn’t in the context. 🚨 o3-mini-high especially struggled, accuracy dropped 32% at 128K! It frequently answers "none" even when the needle was there.

05.03.2025 17:06 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Performance gaps grow with context length! ⏳

At 8K tokens, high vs. low-resource language gap = 11%
At 128K tokens, the gap triples to 34%! πŸ“‰

LLMs struggle to generalize long-context skills across diverse languages.

05.03.2025 17:06 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

English ranks only 6th! 🀯

πŸ‡΅πŸ‡± Polish takes the top spot, while πŸ‡¨πŸ‡³ Chinese ranks 4th from the bottom, despite forming a large proportion of pretraining data.

Slavic, Romance & Germanic languages dominate, suggesting long-context strength isn’t just about training data size!

05.03.2025 17:06 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers?

We create ONERULER πŸ’, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!

Our analysis across 26 languages πŸ§΅πŸ‘‡

05.03.2025 17:06 β€” πŸ‘ 14    πŸ” 5    πŸ’¬ 1    πŸ“Œ 3
Abhilasha Ravichander - Home

✨I am on the faculty job market in the 2024-2025 cycle!✨

My research centers on advancing Responsible AI, specifically enhancing factuality, robustness, and transparency in AI systems.

If you have relevant positions, let me know! lasharavichander.github.io Please share/RT!

11.11.2024 14:23 β€” πŸ‘ 52    πŸ” 22    πŸ’¬ 2    πŸ“Œ 1
Post image

Long-form text generation with multiple stylistic and semantic constraints remains largely unexplored.

We present Suri πŸ¦™: a dataset of 20K long-form texts & LLM-generated, backtranslated instructions with complex constraints.

πŸ“Ž arxiv.org/abs/2406.19371

11.11.2024 12:41 β€” πŸ‘ 36    πŸ” 6    πŸ’¬ 9    πŸ“Œ 1
Image showing prompt token count as per the tokenizer (tiktoken) which is 117,609 tokens, and as per what openai API claims it to be, which is 125,385 tokens. There is about 7000 extra tokens added coming from who knows where.

Image showing prompt token count as per the tokenizer (tiktoken) which is 117,609 tokens, and as per what openai API claims it to be, which is 125,385 tokens. There is about 7000 extra tokens added coming from who knows where.

I really wanted to run NEW #nocha benchmark claims on #o1 but it won't behave 😠
- 6k reasoning tokens is often not enough to get an ans and more means being able to process only short books
- OpenAI adds sth to the prompt: ~8k extra tokens-> less room for book+reason+generation!

11.11.2024 17:11 β€” πŸ‘ 6    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1
Post image

🌊Heading to #EMNLP2024 tmr, presenting PostMark on Tue. morning! πŸ”— arxiv.org/abs/2406.14517

Aside from this, I'd love to chat about:
β€’ long-context training
β€’ realistic & hard eval
β€’ synthetic data
β€’ tbh any cool projects people are working on

Also, I'm on the lookout for a summer 2025 internship!

10.11.2024 19:35 β€” πŸ‘ 6    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0

@yekyung is following 19 prominent accounts