The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴♀️🤡
28.10.2025 17:13 — 👍 3 🔁 0 💬 0 📌 0@zouharvi.bsky.social
PhD student @ ETH Zürich | all aspects of NLP but mostly evaluation and MT | go vegan | https://vilda.net
The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴♀️🤡
28.10.2025 17:13 — 👍 3 🔁 0 💬 0 📌 0- How to Select Datapoints for Efficient Human Evaluation of NLG Models? arxiv.org/abs/2501.18251
- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175
- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549
Let's talk about eval (automatic or human) and multilinguality at #EMNLP in Suzhou! 🇨🇳
- Efficient evaluation (Nov 5, 16:30, poster session 3)
- MT difficulty (Nov 7, 12:30, findings 3)
- COMET-poly (Nov 8, 11:00, WMT)
(DM to meet 🌿 )
...real interesting research problems I was passionate about and planning my research future.
You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.
Grateful to receive the Google PhD Fellowship in NLP! 🙂
I am not secretive about having applied to 4 similar fellowships during my PhD before and not succeeding. Still, refining my research statement (part of the application) helped me tremendously in finding out the...
inf.ethz.ch/news-and-eve...
Congratulations, doctor! 🤓
22.10.2025 16:14 — 👍 1 🔁 0 💬 0 📌 0Organizers:
@pinzhen.bsky.social, @hanxuhu.bsky.social, @simi97k.bsky.social, Wenhao Zhu, @bazril.bsky.social , Alexandra Birch, @afaji.bsky.social, @ricosennrich.bsky.social, @sarahooker.bsky.social.
... Further areas:
- Metrics, LLM judges & reward models 🧮
- Standardised multilingual reporting 📊
- AI-assisted evaluation (data, methods, metrics, standards) 🤖
- Position, application- or theory-focused contributions 💬
... Complex & nuanced evaluation topics:
Multimodality 🎥
Fairness ⚖️
Long I/O 🧠
Tool use 🧰
Code-switching 🌍
Literary & creative tasks ✍️
Also:
- Sociocultural & cognitive variation
- Scalable evaluation of cultural & factual knowledge
We welcome short & long, archival & non-archival submissions!
Topics include (but not limited to):
- Evaluation resources beyond English or Western-centric views 🌐
- Annotation methodology & procedures ✏️
- Evaluation protocols: ranking vs. direct, rubric/reference-based, prompt variation, etc. ⚖️
📢 Announcing the First Workshop on Multilingual and Multicultural Evaluation (MME) at #EACL2026 🇲🇦
MME focuses on resources, metrics & methodologies for evaluating multilingual systems! multilingual-multicultural-evaluation.github.io
📅 Workshop Mar 24–29, 2026
🗓️ Submit by Dec 19, 2025
I'd love to hear your feedback 🌿 arxiv.org/abs/2508.101...
See you in Suzhou! 🇨🇳
My two biggest take-aways are:
- Standard testsets are too easy (Figure 1).
- We can make testsets that are not easy (Figure 2). 😎
We saw increased momentum in participation growth this year: 36 unique teams competing to improve the performance of MT. Furthermore, we added collected outputs of 24 popular LLMs and online systems. Reaching 50 evaluated systems in our annual benchmark.
23.08.2025 09:28 — 👍 3 🔁 1 💬 1 📌 0It gets worse the more you look at it. Why is the height of 69.1 the same as the height of 30.8? Why are the labels rotated if there's enough space?
07.08.2025 19:32 — 👍 7 🔁 0 💬 1 📌 0Organizers are happy to help with any questions. 🙂
Website with all details and contacts: www2.statmt.org/wmt25/mteval...
📐Task 3: Quality-informed segment-level error correction
Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
📐Task 2: Span-level error detection
Identify and locate translation errors within each segment (start/end indices) and classify their severity.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
📐Task 1: Segment-level quality score prediction
Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
The 2025 MT Evaluation shared task brings together the strengths of the previous Metrics and Quality Estimation tasks under a single, unified evaluation framework.
The following tasks are now open for participants (deadline July 31st but participation has never been easier 🙂 ):
Faster but an extra dependency. 🤷
18.07.2025 14:59 — 👍 0 🔁 0 💬 0 📌 0Not possible post-hoc but possible for the other direction! Thanks for your paper. 🙂
15.07.2025 23:24 — 👍 1 🔁 0 💬 0 📌 0Thank you everyone who helped. 😊
Special thanks to @mrinmaya.bsky.social and Peng Cui from @csateth.bsky.social and all my friends I bugged with proofreading. 😁
"How to Select Datapoints for Efficient Human Evaluation of NLG Models?" has now been accepted to TACL (a)! 🌿
📃 Paper (with nuances and caveats ): arxiv.org/abs/2501.182...
📦 Package: github.com/zouharvi/sub...
Feedback welcome!
Recommendation based on translation and summarization:
1️⃣ if you have a good automatic metric, use variance/consistency
2️⃣ if not, use model output diversity
3️⃣ if outputs not available, use artificial crowd/distilled predictors
4️⃣ if those are not available, use source diversity
We frame this as a 0/1 Knapsack problem: find a subset Y ⊆ X with maximum utility while staying under budget B. 🤓
maximize: ∑ zₓ · Utility(x)
subject to: ∑ zₓ · Cost(x) ≤ B
zₓ ∈ {0, 1}
The Utility(x) can be metric average, variance, diversity, etc.
This works even if you don't have the model outputs yet.
1️⃣ "artificial crowd" simulate what model outputs would look like; apply the previous methods.
2️⃣ "utility predictors" estimate usefulness from the source text.
3️⃣ "source-based diversity" remove similar inputs.
So what works? Selecting inputs that expose model differences:
1️⃣ high variance in metric scores
2️⃣ diversity in model outputs
3️⃣ high metric consistency with the rest of the dataset
We now need almost 30% fewer annotated examples to get the same model ranking.
We frame this as finding the smallest subset of data (Y ⊆ X) that gives the same model ranking as on the full dataset.
Simply picking the hardest examples (lowest average metric score) is a step up but can backfire by selecting the most expensive items to annotate.
You have a budget to human-evaluate 100 inputs to your models, but your dataset is 10,000 inputs. Do not just pick 100 randomly!🙅
We can do better. "How to Select Datapoints for Efficient Human Evaluation of NLG Models?" shows how.🕵️
(random is still a devilishly good baseline)