Clara Na's Avatar

Clara Na

@clarana.bsky.social

PhD student @ CMU LTI. efficiency/data in NLP/ML

2,192 Followers  |  390 Following  |  17 Posts  |  Joined: 10.08.2023
Posts Following

Posts by Clara Na (@clarana.bsky.social)

Preview
Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently...

We’re excited about Oolong as a challenging benchmark for information aggregation! Let us know which models we should benchmark next πŸ‘€

Paper: arxiv.org/abs/2511.02817
Dataset: huggingface.co/oolongbench
Code: github.com/abertsch72/o...
Leaderboard: oolongbench.github.io

07.11.2025 17:07 β€” πŸ‘ 4    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0
Performance of a sweep of models on Oolong-synth and Oolong-real. Performance decreases with increasing context length, sometimes steeply.

Performance of a sweep of models on Oolong-synth and Oolong-real. Performance decreases with increasing context length, sometimes steeply.

Can LLMs accurately aggregate information over long, information-dense texts? Not yet…

We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!

07.11.2025 17:07 β€” πŸ‘ 50    πŸ” 20    πŸ’¬ 3    πŸ“Œ 3

Yes! tbh this method is probably much more immediately useful for helping one understand subtle differences between [models trained on] subtly different data subsets, vs a loftier goal of helping one find "the" best data mixture -- to anyone considering this method, please feel free to reach out :)

06.05.2025 04:16 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

I almost never use these so I always thought that they were cute little things that let seatmates watch the same movie

06.05.2025 04:06 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Congrats Lucy!!

05.05.2025 20:59 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Come through! #492 in Hall 2!, 10am-12:30pm

26.04.2025 01:59 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Our paper documenting the environmental impacts of creating OLMo language models is the most honest and comprehensive characterization I know of, including training, development (!) and inference costs. If you're at ICLR chat with @jacobcares.bsky.social & @clarana.bsky.social Sat morning 10-12:30!

25.04.2025 13:14 β€” πŸ‘ 21    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Preview
Holistically Evaluating the Environmental Impact of Creating Language Models As the performance of artificial intelligence systems has dramatically increased, so too has the environmental impact of creating these systems. While many model developers release estimates of the po...

πŸ“œPaper: arxiv.org/abs/2503.05804
✍️Thanks to my illustrious coauthors @clarana.bsky.social @jaredfern.bsky.social timdettmers.com @strubell.bsky.social @jessedodge.bsky.social, t'was a fun project 🌏

23.04.2025 15:21 β€” πŸ‘ 9    πŸ” 4    πŸ’¬ 0    πŸ“Œ 3

I'm in Singapore for @iclr-conf.bsky.social ! Come check out our spotlight paper on the environmental impact of training OLMo (link in next tweet) during the Saturday morning poster session from 10-12:30 -- happy to chat about this or anything else! DMs should be open, email works too

23.04.2025 15:21 β€” πŸ‘ 10    πŸ” 5    πŸ’¬ 1    πŸ“Œ 1
Nomination Tool: Project URL Nomination

We've received multiple notes that NOAA research services (Office of Oceanic and Atmospheric Research) may go offline at midnight. @safeguardingdata.bsky.social is working on web archiving, but if others want to nominate on this, that might be good: digital2.library.unt.edu/nomination/G...

03.04.2025 21:36 β€” πŸ‘ 46    πŸ” 22    πŸ’¬ 1    πŸ“Œ 1
Image of the first page of the CHI 2025 paper titled "A Taxonomy of Linguistic Expressions That Contribute To Anthropomorphism of Language Technologies" by authors Alicia DeVrio, Myra Cheng, Lisa Egede, Alexandra Olteanu, & Su Lin Blodgett

Image of the first page of the CHI 2025 paper titled "A Taxonomy of Linguistic Expressions That Contribute To Anthropomorphism of Language Technologies" by authors Alicia DeVrio, Myra Cheng, Lisa Egede, Alexandra Olteanu, & Su Lin Blodgett

How can we better think and talk about human-like qualities attributed to language technologies like LLMs? In our #CHI2025 paper, we taxonomize how text outputs from cases of user interactions with language technologies can contribute to anthropomorphism. arxiv.org/abs/2502.09870 1/n

06.03.2025 03:43 β€” πŸ‘ 43    πŸ” 11    πŸ’¬ 2    πŸ“Œ 3
Figure showing that interpretations of gestures vary dramatically across regions and cultures. β€˜Crossing your fingers,’ commonly used in the US to wish for good luck, can be deeply offensive to female audiences in parts of Vietnam. Similarly, the 'fig gesture,' a playful 'got your nose' game with children in the US, carries strong sexual connotations in Japan and can be highly offensive.

Figure showing that interpretations of gestures vary dramatically across regions and cultures. β€˜Crossing your fingers,’ commonly used in the US to wish for good luck, can be deeply offensive to female audiences in parts of Vietnam. Similarly, the 'fig gesture,' a playful 'got your nose' game with children in the US, carries strong sexual connotations in Japan and can be highly offensive.

Did you know? Gestures used to express universal conceptsβ€”like wishing for luckβ€”vary DRAMATICALLY across cultures?
🀞means luck in US but deeply offensive in Vietnam 🚨

πŸ“£ We introduce MC-SIGNS, a test bed to evaluate how LLMs/VLMs/T2I handle such nonverbal behavior!

πŸ“œ: arxiv.org/abs/2502.17710

26.02.2025 16:22 β€” πŸ‘ 33    πŸ” 7    πŸ’¬ 1    πŸ“Œ 3
NeurIPS Tutorial Opening the Language Model Pipeline: A Tutorial on Data Preparation, Model Training, and AdaptationNeurIPS 2024

the science of LMs should be fully open✨

today @akshitab.bsky.social @natolambert.bsky.social and I are giving our #neurips2024 tutorial on language model development.

everything from data, training, adaptation. published or not, no secrets 🫑

tues, 12/10, 9:30am PT β˜•οΈ

neurips.cc/virtual/2024...

10.12.2024 15:31 β€” πŸ‘ 147    πŸ” 17    πŸ’¬ 5    πŸ“Œ 3

How open is β€œopen” AI, really?
It isn’t just about making models reusable. If the origin of data is opaque, if labor is hidden & exploited, if frameworks are dominated by Big Tech, if computational power is mastered by an oligopolyβ€¦β€˜open’ is just a label.

Meredith Whittaker & friends in Nature.

03.12.2024 17:49 β€” πŸ‘ 53    πŸ” 15    πŸ’¬ 0    πŸ“Œ 0

I noticed a lot of starter packs skewed towards faculty/industry, so I made one of just NLP & ML students: go.bsky.app/vju2ux

Students do different research, go on the job market, and recruit other students. Ping me and I'll add you!

23.11.2024 19:54 β€” πŸ‘ 176    πŸ” 54    πŸ’¬ 101    πŸ“Œ 4
Screenshot of the paper title "What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length"

Screenshot of the paper title "What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length"

πŸ’¬ Have you or a loved one compared LM probabilities to human linguistic acceptability judgments? You may be overcompensating for the effect of frequency and length!
🌟 In our new paper, we rethink how we should be controlling for these factors 🧡:

20.11.2024 18:07 β€” πŸ‘ 84    πŸ” 19    πŸ’¬ 1    πŸ“Œ 4
Post image

@jaredfern.bsky.social is at 162

14.11.2024 15:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Hi I am at 232 in the back of the riverfront room!

14.11.2024 15:28 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I'm at EMNLP! Presenting the poster for this paper on Thursday morning (10:30-12), Session F Riverfront Hall, come say hi :)

13.11.2024 15:08 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

(Hehe first bsky post!) I'll be at #EMNLP2024 πŸ’ƒπŸŒ΄! Happy to chat about (among other things):
✨linguistically+cognitively motivated evaluation
✨NLP for low-resource+endangered languages
✨figuring out what features of language data LMs are *actually* learning
I'll be presenting two posters 🧡:

08.11.2024 18:39 β€” πŸ‘ 29    πŸ” 6    πŸ’¬ 1    πŸ“Œ 0

scrolling,,, minimal doom ?!

09.11.2024 00:58 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Understanding β€œDemocratization” in NLP and ML Research Arjun Subramonian, Vagrant Gautam, Dietrich Klakow, Zeerak Talat. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.

Understanding β€œDemocratization” in NLP and ML Research - joint work @arjunsubgraph.bsky.social and I co-led with Dietrich Klakow and @zeerak.bsky.social
aclanthology.org/2024.emnlp-m...

08.11.2024 23:23 β€” πŸ‘ 12    πŸ” 5    πŸ’¬ 5    πŸ“Œ 1

hi ! :)

08.11.2024 01:56 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

A starter pack for #NLP #NLProc researchers! πŸŽ‰

go.bsky.app/SngwGeS

04.11.2024 10:01 β€” πŸ‘ 251    πŸ” 99    πŸ’¬ 45    πŸ“Œ 13

I'll be presenting our paper at #EMNLP2024 next week -- see y'all in Miami🌴! This was my Summer 2023 work @ai2.bsky.social Grateful to my wonderful collaborators @ianmagnusson.bsky.social @ananyahjha93.bsky.social @tomsherborne.bsky.social & mentors @strubell.bsky.social Jesse, and Pradeep (6/n)

05.11.2024 22:43 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Scalable Data Ablations - a claran Collection Datasets and models for EMNLP paper "Scalable Data Ablation Approximations for Language Models through Modular Training and Merging"

Check out the paper for details and our specific recommendations!
πŸ€—Data and models: huggingface.co/collections/...
πŸ‘©β€πŸ’»Repo: github.com/clarana/ez-d...
πŸ“„Paper again: arxiv.org/abs/2410.15661
(5/n)

05.11.2024 22:41 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We can even predict larger model perplexity scores w/ smaller model proxy evals, AND the relationship holds even when the actual ppl scores are high (4/n)

05.11.2024 22:39 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

What does this mean? We can simulate *comprehensive and fine-grained* data ablations on language corpora, at scale! Required training compute scales only linearly wrt *new* training data, i.e. work for previously seen train data is "cached" and reusable in subsequent evals (3/n)

05.11.2024 22:39 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Figure plots "Seq"uentially trained models' perplexity respective held-out evals against various proxy perplexity evals. Training data was random combinations of 1 academic field of study from S2ORC + 1 M2D2 Wiki topic. In this case the strongest linear correlation with the actual scores comes from the "micro-Merged" model evals i.e. parameter average of all base model components

Figure plots "Seq"uentially trained models' perplexity respective held-out evals against various proxy perplexity evals. Training data was random combinations of 1 academic field of study from S2ORC + 1 M2D2 Wiki topic. In this case the strongest linear correlation with the actual scores comes from the "micro-Merged" model evals i.e. parameter average of all base model components

We show that there is a reliable *linear correlation* between perplexity evaluation scores for a model trained on a data mixture, and proxy scores from models trained on partitions of the mixture -- f(🟦🟩πŸŸͺ) vs. f(🟦) f(🟩) f(πŸŸͺ)

❗️This also works on arbitrary eval data (2/n)

05.11.2024 22:38 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Building/customizing your own LLM? You'll want to curate training data for it, but how do you know what makes the data good?
You can try out recipesπŸ‘©β€πŸ³ iterate on ✨vibes✨ but we can't actually test all possible combos of tweaks,,, right?? πŸ™…β€β™‚οΈWRONG! arxiv.org/abs/2410.15661 (1/n) 🧡

05.11.2024 22:37 β€” πŸ‘ 49    πŸ” 8    πŸ’¬ 1    πŸ“Œ 3