Marzena Karpinska ✈️ COLM'25's Avatar

Marzena Karpinska ✈️ COLM'25

@markar.bsky.social

#nlp researcher interested in evaluation including: multilingual models, long-form input/output, processing/generation of creative texts previous: postdoc @ umass_nlp phd from utokyo https://marzenakrp.github.io/

3,797 Followers  |  933 Following  |  63 Posts  |  Joined: 13.01.2024  |  2.0794

Latest posts by markar.bsky.social on Bluesky

Post image

Come to talk with us today about the evaluation of long form multilingual generation at the second poster session #COLM2025

📍4:30–6:30 PM / Room 710 – Poster #8

07.10.2025 17:54 — 👍 5    🔁 2    💬 0    📌 0
Post image Post image Post image

Off to #COLM fake Fuji looks really good today.
本物は下からしか見たことがないが、今日は少なくとも偽物が上から見えて嬉しい。

06.10.2025 15:01 — 👍 6    🔁 0    💬 0    📌 0
Post image

I feel like it was worth waking up early

06.10.2025 14:35 — 👍 4    🔁 0    💬 0    📌 0

Wait how come, I'm flying direct at 7am..

06.10.2025 12:00 — 👍 0    🔁 0    💬 1    📌 0
Preview
Humans Perceive Wrong Narratives from AI Reasoning Texts A new generation of AI models generates step-by-step reasoning text before producing an answer. This text appears to offer a human-readable window into their computation process, and is increasingly r...

When reading AI reasoning text (aka CoT), we (humans) form a narrative about the underlying computation process, which we take as a transparent explanation of model behavior. But what if our narratives are wrong? We measure that and find it usually is.

Now on arXiv: arxiv.org/abs/2508.16599

27.08.2025 21:30 — 👍 85    🔁 22    💬 4    📌 2
Preview
Preliminary Ranking of WMT25 General Machine Translation Systems We present the preliminary ranking of the WMT25 General Machine Translation Shared Task, in which MT systems have been evaluated using automatic metrics. As this ranking is based on automatic evaluati...

📊 Preliminary ranking of WMT 2025 General Machine Translation benchmark is here!

But don't draw conclusions just yet - automatic metrics are biased for techniques like metric as a reward model or MBR. The official human ranking will be part of General MT findings at WMT.

arxiv.org/abs/2508.14909

23.08.2025 09:28 — 👍 9    🔁 4    💬 1    📌 0
Post image

Happy to see this work accepted to #EMNLP2025! 🎉🎉🎉

20.08.2025 20:49 — 👍 12    🔁 1    💬 0    📌 0

✨We are thrilled to announce that over 3200 papers have been accepted to #EMNLP2025 ✨

This includes over 1800 main conference papers and over 1400 papers in findings!

Congratulations to all authors!! 🎉🎉🎉

20.08.2025 20:47 — 👍 29    🔁 2    💬 0    📌 3

The Echoes in AI paper showed quite the opposite with also a story continuation setup.
Additionally, we present evidence that both *syntactic* and *discourse* diversity measures show strong homogenization that lexical and cosine used in this paper do not capture.

12.08.2025 21:01 — 👍 60    🔁 13    💬 2    📌 2

Definitely!

16.08.2025 17:46 — 👍 1    🔁 0    💬 0    📌 0

At the same time I wish that whoever sparked this interest in data distribution would also help them with the design...

16.08.2025 03:24 — 👍 1    🔁 0    💬 1    📌 0

Absolutely! Looking forward to seeing QUDsim at COLM!

16.08.2025 03:19 — 👍 1    🔁 0    💬 2    📌 0

The issue is always what, which humans in what circumstances

15.08.2025 05:33 — 👍 2    🔁 0    💬 1    📌 0

I think there are quite a few undergraduate students on this preprint and maybe there was a need for a bit more mentoring. The comparison to writingprompts is just one of the issues (amateur writers in very different conditions than normal writing + very short outputs).

15.08.2025 05:31 — 👍 3    🔁 0    💬 1    📌 0
NoCha leaderboard

Check out the full leaderboard here: novelchallenge.github.io

We'll be updating the dataset with new books and claims within the next few months!

08.08.2025 02:13 — 👍 1    🔁 0    💬 0    📌 0
Screenshot of benchmark with gpt-5 on top with 68.46% accuracy.

Screenshot of benchmark with gpt-5 on top with 68.46% accuracy.

GPT-5 lands first place on NoCha, our long-context book understanding benchmark.

That said, this is a tiny improvement (~1%) over o1-preview, which was released almost one year ago. Have long-context models hit a wall?

Accuracy of human readers is >97%... Long way to go!

08.08.2025 02:13 — 👍 18    🔁 6    💬 1    📌 0
Post image

🗓️29 July, 4 PM: Automated main concept generation for narrative discourse assessment in aphasia. w/
@marisahudspeth.bsky.social, Polly Stokes, Jacquie Kurland, and @brenocon.bsky.social

📍Hall 4/5.

Come by to chat about argumentation, narrative texts, policy & law, and beyond! #ACL2025NLP

28.07.2025 10:56 — 👍 6    🔁 3    💬 0    📌 0
Post image

Excited to present two papers at #ACL2025!

🗓️30 July, 11 AM: 𝛿-Stance: A Large-Scale Real World Dataset of Stances in Legal Argumentation. w/ Douglas Rice and @brenocon.bsky.social

📍At Hall 4/5. 🧵👇

28.07.2025 10:56 — 👍 6    🔁 3    💬 1    📌 0
Kaiserslautern, Germany

Kaiserslautern, Germany

📣 Life update: Thrilled to announce that I’ll be starting as faculty at the Max Planck Institute for Software Systems this Fall!

I’ll be recruiting PhD students in the upcoming cycle, as well as research interns throughout the year: lasharavichander.github.io/contact.html

22.07.2025 04:12 — 👍 89    🔁 12    💬 13    📌 4

Congratulations 👏🎉

23.07.2025 01:11 — 👍 1    🔁 0    💬 1    📌 0

For EMNLP 2025’s special theme of "Advancing our Reach: Interdisciplinary Recontextualization of NLP", we are organizing a panel of experts, and would like input from the community at large as we prepare. Please take a moment to fill in this survey: forms.office.com/r/pWFFA0Gss1

17.07.2025 20:24 — 👍 8    🔁 5    💬 0    📌 0

A new definition for AGI just dropped, and it is a bad one.

12.07.2025 18:04 — 👍 170    🔁 27    💬 8    📌 5

Now accepted to #COLM2025 @colmweb.org
🇨🇦🎉

08.07.2025 19:13 — 👍 4    🔁 0    💬 0    📌 0

Had to always apply for IRB in Japan (UTokyo), though the process was much longer than in the US (committee was meeting only few times a year and you were almost guaranteed to be asked to correct something which extended the process). Could easily take 2-3 months.

07.07.2025 23:00 — 👍 2    🔁 0    💬 1    📌 0
Preview
An Interdisciplinary Approach to Human-Centered Machine Translation Machine Translation (MT) tools are widely used today, often in contexts where professional translators are not present. Despite progress in MT technology, a gap persists between system development and...

What should Machine Translation research look like in the age of multilingual LLMs?

Here’s one answer from researchers across NLP/MT, Translation Studies, and HCI.
"An Interdisciplinary Approach to Human-Centered Machine Translation"
arxiv.org/abs/2506.13468

18.06.2025 12:08 — 👍 18    🔁 7    💬 1    📌 0
Preview
Literary Evidence Retrieval via Long-Context Language Models How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of That et al. (2022) t...

Extremely interesting new task that gives a model a literary text, plus a critical essay about it — with one quotation masked. Can the model figure out which quotation from the original work would support these claims? Best-performing models exceed human readers. #MLSky arxiv.org/abs/2506.030...

04.06.2025 15:50 — 👍 50    🔁 7    💬 3    📌 2

Tired of AI slop? Our work on "Frankentexts" shows how LLMs can stitch together random fragments of human writing into coherent, relevant responses to arbitrary prompts.

Frankentexts are weirdly creative, and they also pose problems for AI detectors: are they AI? human? More 👇

03.06.2025 16:16 — 👍 15    🔁 3    💬 0    📌 0
Post image

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts?

🧟 You get what we call a Frankentext!

💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.

03.06.2025 15:09 — 👍 33    🔁 7    💬 1    📌 1

Interested in crosslingual memorization? Check out our new work :) Congrats to Emir, Alisha, and Minh for putting together their first research paper 🎉

30.05.2025 15:40 — 👍 4    🔁 0    💬 0    📌 0
Post image

LLMs memorize novels 📚 in English. But what about existing translations? Or translations into new languages?

Our 🦉OWL dataset (31K/10 languages) shows GPT4o recognizes books:
92% English
83% official translations
69% unseen translations
75% as audio (EN)

30.05.2025 15:37 — 👍 7    🔁 2    💬 1    📌 3

@markar is following 20 prominent accounts