Vilém Zouhar #EMNLP's Avatar

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

PhD student @ ETH Zürich | all aspects of NLP but mostly evaluation and MT | go vegan | https://vilda.net

3,501 Followers  |  1,413 Following  |  265 Posts  |  Joined: 25.07.2023  |  2.2794

Latest posts by zouharvi.bsky.social on Bluesky

Post image

The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴‍♀️🤡

28.10.2025 17:13 — 👍 3    🔁 0    💬 0    📌 0

- How to Select Datapoints for Efficient Human Evaluation of NLG Models? arxiv.org/abs/2501.18251

- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175

- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549

28.10.2025 09:45 — 👍 2    🔁 0    💬 1    📌 0
Post image Post image Post image

Let's talk about eval (automatic or human) and multilinguality at #EMNLP in Suzhou! 🇨🇳

- Efficient evaluation (Nov 5, 16:30, poster session 3)
- MT difficulty (Nov 7, 12:30, findings 3)
- COMET-poly (Nov 8, 11:00, WMT)

(DM to meet 🌿 )

28.10.2025 09:45 — 👍 11    🔁 1    💬 2    📌 0

...real interesting research problems I was passionate about and planning my research future.

You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.

24.10.2025 12:32 — 👍 1    🔁 0    💬 1    📌 0
Preview
Google PhD Fellowships 2025 Yutong Chen, Benedict Schlüter and Vilém Zouhar, all three of them doctoral students at the Department of Computer Science, have been awarded the Google PhD Fellowship. The programme was created to re...

Grateful to receive the Google PhD Fellowship in NLP! 🙂

I am not secretive about having applied to 4 similar fellowships during my PhD before and not succeeding. Still, refining my research statement (part of the application) helped me tremendously in finding out the...

inf.ethz.ch/news-and-eve...

24.10.2025 12:32 — 👍 14    🔁 0    💬 1    📌 0

Congratulations, doctor! 🤓

22.10.2025 16:14 — 👍 1    🔁 0    💬 0    📌 0

Organizers:
@pinzhen.bsky.social, @hanxuhu.bsky.social, @simi97k.bsky.social, Wenhao Zhu, @bazril.bsky.social , Alexandra Birch, @afaji.bsky.social, @ricosennrich.bsky.social, @sarahooker.bsky.social.

20.10.2025 10:37 — 👍 0    🔁 0    💬 0    📌 0

... Further areas:
- Metrics, LLM judges & reward models 🧮
- Standardised multilingual reporting 📊
- AI-assisted evaluation (data, methods, metrics, standards) 🤖
- Position, application- or theory-focused contributions 💬

20.10.2025 10:37 — 👍 0    🔁 0    💬 1    📌 0

... Complex & nuanced evaluation topics:
Multimodality 🎥
Fairness ⚖️
Long I/O 🧠
Tool use 🧰
Code-switching 🌍
Literary & creative tasks ✍️

Also:
- Sociocultural & cognitive variation
- Scalable evaluation of cultural & factual knowledge

20.10.2025 10:37 — 👍 0    🔁 0    💬 1    📌 0

We welcome short & long, archival & non-archival submissions!

Topics include (but not limited to):
- Evaluation resources beyond English or Western-centric views 🌐
- Annotation methodology & procedures ✏️
- Evaluation protocols: ranking vs. direct, rubric/reference-based, prompt variation, etc. ⚖️

20.10.2025 10:37 — 👍 0    🔁 0    💬 1    📌 0
Post image

📢 Announcing the First Workshop on Multilingual and Multicultural Evaluation (MME) at #EACL2026 🇲🇦

MME focuses on resources, metrics & methodologies for evaluating multilingual systems! multilingual-multicultural-evaluation.github.io

📅 Workshop Mar 24–29, 2026
🗓️ Submit by Dec 19, 2025

20.10.2025 10:37 — 👍 32    🔁 14    💬 1    📌 0
Preview
Estimating Machine Translation Difficulty Machine translation quality has steadily improved over the years, achieving near-perfect translations in recent benchmarks. These high-quality outputs make it difficult to distinguish between state-of...

I'd love to hear your feedback 🌿 arxiv.org/abs/2508.101...

See you in Suzhou! 🇨🇳

16.09.2025 08:50 — 👍 3    🔁 0    💬 0    📌 0

My two biggest take-aways are:
- Standard testsets are too easy (Figure 1).
- We can make testsets that are not easy (Figure 2). 😎

16.09.2025 08:49 — 👍 17    🔁 3    💬 1    📌 0

We saw increased momentum in participation growth this year: 36 unique teams competing to improve the performance of MT. Furthermore, we added collected outputs of 24 popular LLMs and online systems. Reaching 50 evaluated systems in our annual benchmark.

23.08.2025 09:28 — 👍 3    🔁 1    💬 1    📌 0

It gets worse the more you look at it. Why is the height of 69.1 the same as the height of 30.8? Why are the labels rotated if there's enough space?

07.08.2025 19:32 — 👍 7    🔁 0    💬 1    📌 0
Shared task: Automated Translation Quality Evaluation Systems

Organizers are happy to help with any questions. 🙂
Website with all details and contacts: www2.statmt.org/wmt25/mteval...

25.07.2025 16:59 — 👍 0    🔁 0    💬 0    📌 0
QE-informed Segment-level Error Correction

📐Task 3: Quality-informed segment-level error correction

Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...

25.07.2025 16:59 — 👍 0    🔁 0    💬 1    📌 0
Fine-grained error span detection

📐Task 2: Span-level error detection

Identify and locate translation errors within each segment (start/end indices) and classify their severity.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...

25.07.2025 16:59 — 👍 0    🔁 0    💬 1    📌 0
MT Evaluation Subtask 1: Segment-Level Quality Score Prediction

📐Task 1: Segment-level quality score prediction

Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...

25.07.2025 16:59 — 👍 0    🔁 0    💬 1    📌 0

The 2025 MT Evaluation shared task brings together the strengths of the previous Metrics and Quality Estimation tasks under a single, unified evaluation framework.

The following tasks are now open for participants (deadline July 31st but participation has never been easier 🙂 ):

25.07.2025 16:59 — 👍 1    🔁 2    💬 1    📌 0

Faster but an extra dependency. 🤷

18.07.2025 14:59 — 👍 0    🔁 0    💬 0    📌 0

Not possible post-hoc but possible for the other direction! Thanks for your paper. 🙂

15.07.2025 23:24 — 👍 1    🔁 0    💬 0    📌 0

Thank you everyone who helped. 😊

Special thanks to @mrinmaya.bsky.social and Peng Cui from @csateth.bsky.social and all my friends I bugged with proofreading. 😁

15.07.2025 13:03 — 👍 2    🔁 0    💬 1    📌 0
Preview
How to Select Datapoints for Efficient Human Evaluation of NLG Models? Human evaluation is the gold standard for evaluating text generation models. However, it is expensive. In order to fit budgetary constraints, a random subset of the test data is often chosen in practi...

"How to Select Datapoints for Efficient Human Evaluation of NLG Models?" has now been accepted to TACL (a)! 🌿

📃 Paper (with nuances and caveats ): arxiv.org/abs/2501.182...
📦 Package: github.com/zouharvi/sub...
Feedback welcome!

15.07.2025 13:03 — 👍 6    🔁 0    💬 1    📌 0

Recommendation based on translation and summarization:
1️⃣ if you have a good automatic metric, use variance/consistency
2️⃣ if not, use model output diversity
3️⃣ if outputs not available, use artificial crowd/distilled predictors
4️⃣ if those are not available, use source diversity

15.07.2025 13:03 — 👍 0    🔁 0    💬 1    📌 0

We frame this as a 0/1 Knapsack problem: find a subset Y ⊆ X with maximum utility while staying under budget B. 🤓

maximize: ∑ zₓ · Utility(x)
subject to: ∑ zₓ · Cost(x) ≤ B
zₓ ∈ {0, 1}

The Utility(x) can be metric average, variance, diversity, etc.

15.07.2025 13:03 — 👍 0    🔁 0    💬 1    📌 0

This works even if you don't have the model outputs yet.
1️⃣ "artificial crowd" simulate what model outputs would look like; apply the previous methods.
2️⃣ "utility predictors" estimate usefulness from the source text.
3️⃣ "source-based diversity" remove similar inputs.

15.07.2025 13:03 — 👍 0    🔁 0    💬 1    📌 0
Post image

So what works? Selecting inputs that expose model differences:
1️⃣ high variance in metric scores
2️⃣ diversity in model outputs
3️⃣ high metric consistency with the rest of the dataset

We now need almost 30% fewer annotated examples to get the same model ranking.

15.07.2025 13:03 — 👍 1    🔁 0    💬 1    📌 0
Post image

We frame this as finding the smallest subset of data (Y ⊆ X) that gives the same model ranking as on the full dataset.

Simply picking the hardest examples (lowest average metric score) is a step up but can backfire by selecting the most expensive items to annotate.

15.07.2025 13:03 — 👍 0    🔁 0    💬 1    📌 0
Post image

You have a budget to human-evaluate 100 inputs to your models, but your dataset is 10,000 inputs. Do not just pick 100 randomly!🙅

We can do better. "How to Select Datapoints for Efficient Human Evaluation of NLG Models?" shows how.🕵️
(random is still a devilishly good baseline)

15.07.2025 13:03 — 👍 33    🔁 3    💬 2    📌 0

@zouharvi is following 20 prominent accounts