Linguistic Data Consortium ldcupenn

PLC 50 logo

The 50th Penn Linguistics Conference (PLC) is Feb 28–Mar 1. PLC brings together students, faculty & researchers interested in languages & linguistics to share new work and connect with peers. We wish everyone a great and productive conference. @pennlinguistics.bsky.social tinyurl.com/verswp3z

27.02.2026 13:42 — 👍 0 🔁 0 💬 0 📌 0

More LDC data in the LORELEI series: LORELEI Russian Representative Language Pack features monolingual and parallel text, annotations, software tools and more for human language technology development to address emergent situations bit.ly/3MHDr4v

24.02.2026 16:02 — 👍 1 🔁 0 💬 0 📌 0

International Mother Language Day 21 February

Happy International #MotherLanguageDay This year’s theme celebrates youth voices on multilingual education – emphasizing that language is central to identity, learning, well-being and participation in society. Let’s celebrate every language, every voice www.unesco.org/en/days/moth...

20.02.2026 17:41 — 👍 0 🔁 0 💬 0 📌 0

KAIROS Schema Learning Background Source Data: 14K English & Spanish multimodal resources collected by LDC for a Schema Learning Corpus; schemas were used with event extraction to characterize & make predictions about real-world events in the corpus bit.ly/4tPVeYa

19.02.2026 15:22 — 👍 0 🔁 0 💬 0 📌 0

2022 NIST Language Recognition Evaluation Test and Development Sets: 222 hours of telephone speech and broadcast narrowband speech in 14 languages, plus turnkey evaluation documentation, emphasizing African languages and related English and French dialects bit.ly/4rIEJLs

18.02.2026 14:29 — 👍 0 🔁 0 💬 0 📌 0

Catch up on 2026 membership discounts, spring data scholarship awards and the release of three new publications in LDC’s February newsletter ldc-upenn.blogspot.com

17.02.2026 15:03 — 👍 1 🔁 0 💬 0 📌 0

MATERIAL Swahili-English Language Pack has 112 hours of Swahili conversational telephone speech, transcripts, English translations, annotations and queries designed to support cross language information retrieval bit.ly/49SWG3R

23.01.2026 16:46 — 👍 0 🔁 0 💬 0 📌 0

CALLHOME Japanese Lexicon Second Edition: morphological, phonological and stress information for 80,688 Japanese words from transcripts of telephone conversations between native Japanese speakers, along with a pronunciation dictionary and G2P tools bit.ly/3NlxvhC

22.01.2026 15:28 — 👍 0 🔁 0 💬 0 📌 0

CALLHOME Japanese Second Edition brings original speech and transcript datasets up to date with new transcripts and revised directories, file formats and documentation bit.ly/49kSdqz

21.01.2026 15:28 — 👍 0 🔁 0 💬 0 📌 0

LDC welcomes 2026 with its January newsletter featuring three publications and membership renewal information ldc-upenn.blogspot.com

20.01.2026 15:21 — 👍 0 🔁 0 💬 0 📌 0

LORELEI Sinhala Incident Language Pack: monolingual and parallel text, annotations, software tools and more for human language technology development in this under-resourced language bit.ly/4iVnJP1

18.12.2025 15:47 — 👍 0 🔁 0 💬 0 📌 0

2021 NIST SRE Test Set: 447 hours of Cantonese, Mandarin, and English conversational telephone speech, audio from video, and selfie image data for development and test, along with answer keys, enrollment, trial files and documentation bit.ly/4q35JV4

17.12.2025 15:43 — 👍 0 🔁 0 💬 0 📌 0

Check out LDC’s December’s newsletter for the latest news and publications and join us in celebrating the release of our 1000th corpus! ldc-upenn.blogspot.com

16.12.2025 15:38 — 👍 0 🔁 0 💬 0 📌 0

#18.9 Interspeech 2025 Impressions - Denise Dipersio Meet Denise Dipersio Associate Director at Linguistic Data Consortium sharing her experience with us. Host: Pascal Hecker Post-production: Wei Xue

Check out ISCA-SAC’s Speech Pitch podcast to hear from LDC’s Denise DiPersio #18.9. This session was recorded during Interspeech 2025. Listen to Denise talk about LDC’s past, present and future and LDC’s involvement in Interspeech since the 2009 conference in Brighton. tinyurl.com/488rske4

05.12.2025 15:00 — 👍 1 🔁 0 💬 0 📌 0

LORELEI Ilocano Incident Language Pack: monolingual and parallel text, annotations, software tools and more for human language technology development in this under-resourced language bit.ly/43moVEw

20.11.2025 14:58 — 👍 1 🔁 0 💬 0 📌 0

AnnoDIFP CTS Audio and Transcripts: 242.52 hours of English telephone audio and transcripts from 1179 calls involving 327 participants, paired with scores from two self-reported personality assessments bit.ly/47J6JHX

19.11.2025 15:13 — 👍 0 🔁 0 💬 0 📌 0

LDC’s November newsletter has details on 2026 membership renewal, the spring data scholarship deadline and two new publications ldc-upenn.blogspot.com

18.11.2025 14:46 — 👍 0 🔁 0 💬 0 📌 0

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations: transcripts and English translations for 116 hours of BOLT CTS telephone recordings; all speech was transcribed; 99% of the transcripts were translated bit.ly/4ockuEo

21.10.2025 13:30 — 👍 0 🔁 0 💬 0 📌 0

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio: 116 hours of telephone speech from 274 conversations between native speakers; developed by LDC for the DARPA BOLT program; contains previously unexposed calls from the CF/CH collections bit.ly/42rsg4S

20.10.2025 14:32 — 👍 0 🔁 0 💬 0 📌 0

KAIROS Phase 2 Quizlet contains English and Spanish web data annotated for events, relations and arguments, a reference knowledge graph and a knowledge base; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3WqvYYR

17.10.2025 14:47 — 👍 0 🔁 0 💬 0 📌 0

See LDC’s October newsletter for a preview of 2026 publications, fall data scholarship recipients and three new publications ldc-upenn.blogspot.com

16.10.2025 15:09 — 👍 0 🔁 0 💬 0 📌 0

More LDC data in the LORELEI series: LORELEI Hindi Representative Language Pack features monolingual and parallel text, annotations, software tools and more for human language technology development to address emergent situations bit.ly/4nCp3ar

22.09.2025 20:37 — 👍 1 🔁 0 💬 0 📌 0

AIDA Scenario 1 Evaluation Topic Source Data, Annotation & Assessment: 10k+ English, Russian & Ukrainian web docs on political relations between Russia & Ukraine in the 2010s annotated for entities & cross-reference, w/ judgments for scoring submissions bit.ly/3K7ynoA

22.09.2025 16:01 — 👍 0 🔁 0 💬 0 📌 0

Mixer 7 English Speech has 12,321 hours of telephone conversations, interviews and transcript readings from 222 English speakers, some collected using a 14-microphone array; speaker metadata is included bit.ly/4nvSYkG

19.09.2025 15:24 — 👍 0 🔁 0 💬 0 📌 0

Check out our September newsletter for three new LDC publications: Mixer 7 English Speech, AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment, and LORELEI Hindi Representative Language Pack ldc-upenn.blogspot.com

18.09.2025 15:08 — 👍 0 🔁 0 💬 0 📌 0

KAIROS Phase 1 Quizlet contains English and Spanish web data annotated for events, relations and arguments and a reference knowledge graph; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3HvDU7k

26.08.2025 18:39 — 👍 1 🔁 0 💬 0 📌 0

Abstract Meaning Representation 2.0 - Machine Translations translates 1,371 English sentences from LDC’s AMR 2.0 corpus into Spanish, German, Italian and Mandarin Chinese using Google Translate bit.ly/4n1m8bp

26.08.2025 14:50 — 👍 0 🔁 0 💬 0 📌 0

Mixer 6 - CHiME 8 Transcribed Calls and Interviews: 80 hours of Mixer 6 English interviews and telephone speech across 13 channels (1063 hours) with transcripts divided into training, development and test sets bit.ly/4oyUCn5

25.08.2025 18:33 — 👍 0 🔁 0 💬 0 📌 0

LDC’s August newsletter has the last call for fall data scholarship applications and details on new publications: Mixer 6 CHiME 8 Transcribed Calls and Interviews, Abstract Meaning Representation 2.0 – Machine Translations and KAIRO Phase 1 Quizlet ldc-upenn.blogspot.com

25.08.2025 13:09 — 👍 1 🔁 0 💬 0 📌 0

What a great conference #Interspeech2025! There is still time to stop by our booth and grab a limited-edition TIMIT word poetry magnet. Also don’t miss our colleague’s oral session on TELVID: A multilingual, multi-modal corpus for speaker recognition at 13:30, A04, Port 1A @interspeech.bsky.social

21.08.2025 09:40 — 👍 1 🔁 0 💬 0 📌 0

Posts by Linguistic Data Consortium (@ldcupenn.bsky.social)