Marzena Karpinska's Avatar

Marzena Karpinska

@markar.bsky.social

#nlp researcher interested in evaluation including: multilingual models, long-form input/output, processing/generation of creative texts previous: postdoc @ umass_nlp phd from utokyo https://marzenakrp.github.io/

3,846 Followers  |  945 Following  |  72 Posts  |  Joined: 13.01.2024  |  1.834

Latest posts by markar.bsky.social on Bluesky

Congratulations to all authors of @sfu-cs-ai.bsky.social papers accepted to @iclr-conf.bsky.social 2026 πŸ₯³πŸ₯³πŸ₯³πŸ₯³πŸ₯³

Please check out our work in 🧡

11.02.2026 00:03 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Screenshot showing the title and abstract of talk by Peter West. The text says:
"Title: Mapping the (Jagged) Landscape of LLM Capabilities

Abstract: One key missing piece for the broad adoption of LLMs is intuition, specifically, human intuition about when and how models will succeed or fail across the diverse tasks we might apply them to. Your LLM might write a well-reasoned essay on 14th-century theology, but does that mean it can accurately answer questions on the same topic? This talk will focus on one aspect of my research, which is the characterization of model capabilities to begin to develop these intuitions. I will discuss recent projects that try to identify where these capabilities break down, with a particular focus on high information examples which will necessitate new hypotheses of how exactly artificial intelligence functions.

Speaker info: Peter West is an assistant professor at the University of British Columbia, broadly working on the capabilities and limits of LLMs. For example: the divergence of AI from human intuitions of intelligence, unpredictability and creativity in models, and studying LLMs with a non-interventional natural sciences lens. Peter completed his PhD at the University of Washington, Paul G School of Computer Science and Engineering. He completed a postdoc at the Stanford Institute for Human-Centered AI. His work has been recognized with best, outstanding, and spotlight papers in NLP and AI conferences."

Screenshot showing the title and abstract of talk by Peter West. The text says: "Title: Mapping the (Jagged) Landscape of LLM Capabilities Abstract: One key missing piece for the broad adoption of LLMs is intuition, specifically, human intuition about when and how models will succeed or fail across the diverse tasks we might apply them to. Your LLM might write a well-reasoned essay on 14th-century theology, but does that mean it can accurately answer questions on the same topic? This talk will focus on one aspect of my research, which is the characterization of model capabilities to begin to develop these intuitions. I will discuss recent projects that try to identify where these capabilities break down, with a particular focus on high information examples which will necessitate new hypotheses of how exactly artificial intelligence functions. Speaker info: Peter West is an assistant professor at the University of British Columbia, broadly working on the capabilities and limits of LLMs. For example: the divergence of AI from human intuitions of intelligence, unpredictability and creativity in models, and studying LLMs with a non-interventional natural sciences lens. Peter completed his PhD at the University of Washington, Paul G School of Computer Science and Engineering. He completed a postdoc at the Stanford Institute for Human-Centered AI. His work has been recognized with best, outstanding, and spotlight papers in NLP and AI conferences."

This week we are excited to host Peter West from @cs.ubc.ca who will talk about β€œMapping the (Jagged) Landscape of LLM Capabilities.”

02.02.2026 17:43 β€” πŸ‘ 5    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

🚨 New Study 🚨

@arxiv.bsky.social has recently decided to prohibit any 'position' paper from being submitted to its CS servers.
Why? Because of the "AI slop", and allegedly higher ratios of LLM-generated content in review papers, compared to non-review papers.

29.01.2026 14:00 β€” πŸ‘ 29    πŸ” 9    πŸ’¬ 2    πŸ“Œ 2

I agree, though I'm afraid we had conferences starting /taking place during other important holidays that are not US-centric (yet in countries where a lot of people attend from)

20.01.2026 23:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Apply Online

If you are interested in working with me apply here by Jan 19th (a bit last minute): sfu.ca/gradstudies/...
Feel free to reach out with any questions!

16.01.2026 21:18 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Now is probably a good time to share that I left my job at
@microsoft.com (will forever miss this team) and moved to Vancouver, Canada, where I'm starting my lab as an assistant professor at the gorgeous @sfu.ca πŸ”οΈ
I'm looking to hire 1-2 students starting in Fall 2026. Details in 🧡

16.01.2026 21:18 β€” πŸ‘ 12    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1

Such a terrible idea to replace a bad metric with a potentially worse one. Based on assumptions that every field treats author order in the same way + every order means same thing. Metrification made worse :( + discouraging collaborations.

26.10.2025 05:04 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

🚨New paper on AI & copyright

Authors have sued LLM companies for using books w/o permission for model training.

Courts however need empirical evidence of market harm. Our preregistered study exactly addresses this gap.

Joint work w Jane Ginsburg from Columbia Law and @dhillonp.bsky.social 1/n🧡

22.10.2025 16:54 β€” πŸ‘ 22    πŸ” 12    πŸ’¬ 1    πŸ“Œ 1
Post image Post image

Well this is sure to be a blockbuster AI article... @jennarussell.bsky.social et al are kicking ass and taking names in journalism, both individuals and organizations.

"AI use in American newspapers is widespread, uneven, and rarely disclosed"
arxiv.org/abs/2510.18774

23.10.2025 13:53 β€” πŸ‘ 22    πŸ” 8    πŸ’¬ 3    πŸ“Œ 0

We didn't, in fact, this research started with our previous project, trying to break AI detectors with Pangram being the only one not failing. Since then, we have experimented with it, and it does an extremely good job.
(Some plots in the paper show it well, like fig8 with historical data)

23.10.2025 16:16 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

AI is infiltrating American newsrooms.

Sadly, it is mostly *undisclosed* meaning that readers are often unaware that they are consuming LLM text.

Even worse, we find some of these texts making it to the print press (undisclosed)

Can we at least be honest about using models for editing?

22.10.2025 22:32 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

AI is already at work in American newsrooms.

We examine 186k articles published this summer and find that ~9% are either fully or partially AI-generated, usually without readers having any idea.

Here's what we learned about how AI is influencing local and national journalism:

22.10.2025 15:24 β€” πŸ‘ 55    πŸ” 29    πŸ’¬ 5    πŸ“Œ 2
Post image

πŸ“’ Announcing the First Workshop on Multilingual and Multicultural Evaluation (MME) at #EACL2026 πŸ‡²πŸ‡¦

MME focuses on resources, metrics & methodologies for evaluating multilingual systems! multilingual-multicultural-evaluation.github.io

πŸ“… Workshop Mar 24–29, 2026
πŸ—“οΈ Submit by Dec 19, 2025

20.10.2025 10:37 β€” πŸ‘ 34    πŸ” 15    πŸ’¬ 1    πŸ“Œ 0
Screen cap from linked article, with heading Significance and then text:

Large Language Models (LLMs) are used in evaluative tasks across domains. Yet, what appears as alignment with human or expert judgments may conceal a deeper shift in how β€œjudgment” itself is operationalized. Using news outlets as a controlled benchmark, we compare six LLMs to expert ratings and human evaluations under an identical, structured framework. While models often match expert outputs, our results suggest that they may rely on lexical associations and statistical priors rather than contextual reasoning or normative criteria. We term this divergence epistemia: the illusion of knowledge emerging when surface plausibility replaces verification. Our findings suggest not only performance asymmetries but also a shift in the heuristics underlying evaluative processes, raising fundamental questions about delegating judgment to LLMs.

Sentence starting with "While models often" is highlighted in blue.

Screen cap from linked article, with heading Significance and then text: Large Language Models (LLMs) are used in evaluative tasks across domains. Yet, what appears as alignment with human or expert judgments may conceal a deeper shift in how β€œjudgment” itself is operationalized. Using news outlets as a controlled benchmark, we compare six LLMs to expert ratings and human evaluations under an identical, structured framework. While models often match expert outputs, our results suggest that they may rely on lexical associations and statistical priors rather than contextual reasoning or normative criteria. We term this divergence epistemia: the illusion of knowledge emerging when surface plausibility replaces verification. Our findings suggest not only performance asymmetries but also a shift in the heuristics underlying evaluative processes, raising fundamental questions about delegating judgment to LLMs. Sentence starting with "While models often" is highlighted in blue.

I'd love to see someone try to estimate just how much time and money has gone into research that is either fully undermined by reliance on LLMs or fully pointless --- because obvious if you start from an understanding of what LLMs actually are.

www.pnas.org/doi/10.1073/...

18.10.2025 10:11 β€” πŸ‘ 481    πŸ” 126    πŸ’¬ 15    πŸ“Œ 14

I'm not sure why people lost the ability to do related work properly but if you absolutely need to use AI at least proofread it? (And they most likely edited with ai)
www.pangram.com/history/01bf...

18.10.2025 16:18 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image Post image Post image

The viral "Definition of AGI" paper tells you to read fake references which do not exist!

Proof: different articles present at the specified journal/volume/page number, and their titles exist nowhere on any searchable repository.

Take this as a warning to not use LMs to generate your references!

18.10.2025 00:54 β€” πŸ‘ 158    πŸ” 36    πŸ’¬ 6    πŸ“Œ 17
Preview
COLM 2025: 9 cool papers and some thoughts Reflections on the 2025 COLM conference, and a discussion of 9 cool COLM papers on benchmarking and eval, personas, and improving models for better long-context performance and consistency.

π‘΅π’†π’˜ π’ƒπ’π’π’ˆπ’‘π’π’”π’•! A rundown of some cool papers I got to chat about at #COLM2025 and some scattered thoughts

saxon.me/blog/2025/co...

17.10.2025 05:24 β€” πŸ‘ 22    πŸ” 7    πŸ’¬ 1    πŸ“Œ 1

Probably for the best, they had serious overflows because of this...

14.10.2025 01:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Wasn't this what ACL did this year?

14.10.2025 01:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Come to talk with us today about the evaluation of long form multilingual generation at the second poster session #COLM2025

πŸ“4:30–6:30 PM / Room 710 – Poster #8

07.10.2025 17:54 β€” πŸ‘ 6    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Post image Post image Post image

Off to #COLM fake Fuji looks really good today.
ζœ¬η‰©γ―δΈ‹γ‹γ‚‰γ—γ‹θ¦‹γŸγ“γ¨γŒγͺγ„γŒγ€δ»Šζ—₯は少γͺγγ¨γ‚‚ε½η‰©γŒδΈŠγ‹γ‚‰θ¦‹γˆγ¦ε¬‰γ—γ„γ€‚

06.10.2025 15:01 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

I feel like it was worth waking up early

06.10.2025 14:35 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Wait how come, I'm flying direct at 7am..

06.10.2025 12:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Humans Perceive Wrong Narratives from AI Reasoning Texts A new generation of AI models generates step-by-step reasoning text before producing an answer. This text appears to offer a human-readable window into their computation process, and is increasingly r...

When reading AI reasoning text (aka CoT), we (humans) form a narrative about the underlying computation process, which we take as a transparent explanation of model behavior. But what if our narratives are wrong? We measure that and find it usually is.

Now on arXiv: arxiv.org/abs/2508.16599

27.08.2025 21:30 β€” πŸ‘ 84    πŸ” 22    πŸ’¬ 4    πŸ“Œ 2
Preview
Preliminary Ranking of WMT25 General Machine Translation Systems We present the preliminary ranking of the WMT25 General Machine Translation Shared Task, in which MT systems have been evaluated using automatic metrics. As this ranking is based on automatic evaluati...

πŸ“Š Preliminary ranking of WMT 2025 General Machine Translation benchmark is here!

But don't draw conclusions just yet - automatic metrics are biased for techniques like metric as a reward model or MBR. The official human ranking will be part of General MT findings at WMT.

arxiv.org/abs/2508.14909

23.08.2025 09:28 β€” πŸ‘ 9    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0
Post image

Happy to see this work accepted to #EMNLP2025! πŸŽ‰πŸŽ‰πŸŽ‰

20.08.2025 20:49 β€” πŸ‘ 12    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

✨We are thrilled to announce that over 3200 papers have been accepted to #EMNLP2025 ✨

This includes over 1800 main conference papers and over 1400 papers in findings!

Congratulations to all authors!! πŸŽ‰πŸŽ‰πŸŽ‰

20.08.2025 20:47 β€” πŸ‘ 29    πŸ” 2    πŸ’¬ 0    πŸ“Œ 4

The Echoes in AI paper showed quite the opposite with also a story continuation setup.
Additionally, we present evidence that both *syntactic* and *discourse* diversity measures show strong homogenization that lexical and cosine used in this paper do not capture.

12.08.2025 21:01 β€” πŸ‘ 60    πŸ” 13    πŸ’¬ 2    πŸ“Œ 2

Definitely!

16.08.2025 17:46 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

At the same time I wish that whoever sparked this interest in data distribution would also help them with the design...

16.08.2025 03:24 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@markar is following 20 prominent accounts