Barry Haddow's Avatar

Barry Haddow

@bazril.bsky.social

29 Followers  |  54 Following  |  7 Posts  |  Joined: 05.12.2024
Posts Following

Posts by Barry Haddow (@bazril.bsky.social)

Preview
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present ...

Some details in our HPLT report arxiv.org/abs/2503.10267 . Code released here: github.com/hplt-project...

09.05.2025 17:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Call for participation: We just opened the registration for this year's MT Marathon in August in Helsinki, Finland: blogs.helsinki.fi/language-tec..., featuring:

- Ayodele Awokoya
- Wilker Aziz
- Marta Costa-Jussa
- Barry Haddow
- Amit Moryosse
- Sara Papi
- JΓΆrg Tiedemann
- Marco Turchi

18.03.2025 12:57 β€” πŸ‘ 5    πŸ” 3    πŸ’¬ 0    πŸ“Œ 2

I'm part of this! There's also a paper: arxiv.org/abs/2503.10267

17.03.2025 13:27 β€” πŸ‘ 6    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Multilingual Instruction Shared Task

Big news from WMT! πŸŽ‰ We are expanding beyond MT and launching a new multilingual instruction shared task. Our goal is to foster truly multilingual LLM evaluation and best practices in automatic and human evaluation. Join us and build the winning multilingual system!
www2.statmt.org/wmt25/multil...

11.03.2025 18:26 β€” πŸ‘ 12    πŸ” 7    πŸ’¬ 1    πŸ“Œ 2
A note on formats: TMX files contain only unique translation units. Moses downloads include all non-empty alignment units including duplicates. Token counts for each language also include duplicate sentences and documents.

We also release MultiHPLT, with a total of 1275 language pairs (not English-centric). opus.nlpl.eu/MultiHPLT/en...

28.02.2025 14:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

We will shortly be adding document alignments to this data. This means that we will release sets of aligned complete documents, with additional details of the sentence alignments that we found in the documents.

28.02.2025 13:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
HPLT - High Performance Language Technologies A space that combines petabytes of natural language data with large-scale model training

** New parallel data set ** . We've just released HPLT v2.0, a parallel data set of 50 languages paired with English, 380M sentence pairs in total. Extracted from the Internet Archive and Common Crawl hplt-project.org/datasets/v2.0

28.02.2025 13:34 β€” πŸ‘ 4    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1

Can't believe WMT General MT shared task is 20 years old!

21.02.2025 09:00 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
MT Summit 2025 The 20th Machine Translation Summit (MT Summit 2025) will take place in Geneva, Switzerland from June 23-27 2025.

MT Summit 2025 - deadline extended!
The deadline for all papers (technical/user/translator/products/projects) has been extended to February 10th. MT Summit will be in Geneva, June 23--27. mtsummit2025.unige.ch/index.html

01.02.2025 17:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
The Anthony C. Clarke Award for the 2024 EAMT Best Thesis – European Association for Machine Translation

EAMT best thesis award - closes on January 31st. Completed an MT-related PhD in 2024? In Europe, Africa or Middle East. Then why not submit your thesis. eamt.org/2024/11/28/t...

26.01.2025 20:37 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0