Linguistic Data Consortium's Avatar

Linguistic Data Consortium

@ldcupenn.bsky.social

LDC creates and distributes language resources to universities, labs, companies and libraries for linguistic education, research and technology development.

31 Followers  |  1 Following  |  31 Posts  |  Joined: 12.03.2025  |  1.4891

Latest posts by ldcupenn.bsky.social on Bluesky

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations: transcripts and English translations for 116 hours of BOLT CTS telephone recordings; all speech was transcribed; 99% of the transcripts were translated bit.ly/4ockuEo

21.10.2025 13:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio: 116 hours of telephone speech from 274 conversations between native speakers; developed by LDC for the DARPA BOLT program; contains previously unexposed calls from the CF/CH collections bit.ly/42rsg4S

20.10.2025 14:32 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

KAIROS Phase 2 Quizlet contains English and Spanish web data annotated for events, relations and arguments, a reference knowledge graph and a knowledge base; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3WqvYYR

17.10.2025 14:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

See LDC’s October newsletter for a preview of 2026 publications, fall data scholarship recipients and three new publications ldc-upenn.blogspot.com

16.10.2025 15:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

More LDC data in the LORELEI series: LORELEI Hindi Representative Language Pack features monolingual and parallel text, annotations, software tools and more for human language technology development to address emergent situations bit.ly/4nCp3ar

22.09.2025 20:37 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

AIDA Scenario 1 Evaluation Topic Source Data, Annotation & Assessment: 10k+ English, Russian & Ukrainian web docs on political relations between Russia & Ukraine in the 2010s annotated for entities & cross-reference, w/ judgments for scoring submissions bit.ly/3K7ynoA

22.09.2025 16:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Mixer 7 English Speech has 12,321 hours of telephone conversations, interviews and transcript readings from 222 English speakers, some collected using a 14-microphone array; speaker metadata is included bit.ly/4nvSYkG

19.09.2025 15:24 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Check out our September newsletter for three new LDC publications: Mixer 7 English Speech, AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment, and LORELEI Hindi Representative Language Pack ldc-upenn.blogspot.com

18.09.2025 15:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

KAIROS Phase 1 Quizlet contains English and Spanish web data annotated for events, relations and arguments and a reference knowledge graph; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3HvDU7k

26.08.2025 18:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Abstract Meaning Representation 2.0 - Machine Translations translates 1,371 English sentences from LDC’s AMR 2.0 corpus into Spanish, German, Italian and Mandarin Chinese using Google Translate bit.ly/4n1m8bp

26.08.2025 14:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Mixer 6 - CHiME 8 Transcribed Calls and Interviews: 80 hours of Mixer 6 English interviews and telephone speech across 13 channels (1063 hours) with transcripts divided into training, development and test sets bit.ly/4oyUCn5

25.08.2025 18:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

LDC’s August newsletter has the last call for fall data scholarship applications and details on new publications: Mixer 6 CHiME 8 Transcribed Calls and Interviews, Abstract Meaning Representation 2.0 – Machine Translations and KAIRO Phase 1 Quizlet ldc-upenn.blogspot.com

25.08.2025 13:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

What a great conference #Interspeech2025! There is still time to stop by our booth and grab a limited-edition TIMIT word poetry magnet. Also don’t miss our colleague’s oral session on TELVID: A multilingual, multi-modal corpus for speaker recognition at 13:30, A04, Port 1A @interspeech.bsky.social

21.08.2025 09:40 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Good morning #Interspeech2025 Stop by our booth during the coffee breaks today to say hello. Also don't miss today's special session co-organized by LDC on Challenges in Speech Collection, Curation and Annotation in two parts beginning at 13:30, Dock 15. @interspeech.bsky.social

20.08.2025 07:11 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Good morning Interspeech. It's a great second day. Come by and grab one of our limited giveaways. @interspeech.bsky.social
#Interspeech2025

19.08.2025 07:22 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

We are excited to be here at Interspeech 2025 @interspeech.bsky.social‬ Come see us at the first coffee break today to learn more about the latest developments at LDC. #Interspeech2025

18.08.2025 08:11 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

LDC will be exhibiting at #Interspeech2025, August 17-21 in Rotterdam. Stop by our booth to say hello and learn the latest developments at the Consortium. LDC work will also be featured in presentations, posters and a special session. We look forward to seeing you there. www.interspeech2025.org

12.08.2025 15:51 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

From the LORELEI companion project: LoReHLT Uzbek Representative Language Pack features monolingual and parallel text, annotations, audio recordings, software tools and more for human language technology development to address emergent situations bit.ly/4lL0zuL

22.07.2025 14:08 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Penn Parsed Corpora of Historical English Second Release: POS-tagged & syntactically annotated British English text (1100 CE -1914 CE); updates the 2020 release with new annotation, revised guidelines, philological information & the Corpus2 search tool bit.ly/46zR1hR

18.07.2025 14:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

AnnoDIFP Session Audio and Transcripts: 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments bit.ly/4nEYQJr

17.07.2025 15:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Check out the July newsletter for Fall 2025 data scholarship application deadlines & 3 new publications: AnnoDIFP Session Audio and Transcripts, Penn Parsed Corpora of Historical English Second Release & LoReHLT Uzbek Representative Language Pack ldc-upenn.blogspot.com

16.07.2025 14:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

KAIROS Schema Learning Complex Event Annotation has English and Spanish web text, audio, video and image data labeled for 93 real-world complex events with event, relation and argument annotations linking to document provenance bit.ly/4jNrDIq

25.06.2025 13:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

IWSLT 2022 - 2023 Shared Task Training, Development and Test Set: 210 hours of Tunisian Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation used in IWSLT dialectal speech and low resource tracks bit.ly/3HEO4lL

24.06.2025 14:24 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Chinese Sentence Pattern Structure Treebank contains 5,016 sentences and 119,627 tokens from modern and ancient Chinese works annotated for lexical sense, syntactic structure and inter-clause relations bit.ly/4kZVGh3

23.06.2025 13:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

LDC’s June newsletter has the latest on three new publications: Chinese Sentence Pattern Structure Treebank, IWSLT 2022-2023 Shared Task Training, Development and Test Set, and KAIROS Schema Learning Complex Event Annotation ldc-upenn.blogspot.com

17.06.2025 13:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and Translations: transcripts and English translations for 93 hours of BOLT CTS telephone recordings; all speech was transcribed; 89% of the transcripts were translated bit.ly/4jKul2j

20.05.2025 13:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio: 93 hours of telephone speech from 236 conversations between native speakers; developed by LDC for the DARPA BOLT program; contains previously unexposed calls from the CF/CH collections bit.ly/4kbsBPy

19.05.2025 14:12 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Check out LDC’s May newsletter for two new companion releases developed by LDC to support the DARPA BOLT program, BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio and BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and Translations ldc-upenn.blogspot.com

16.05.2025 14:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

MATERIAL Kazakh-English Language Pack has 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations and queries designed to support cross language information retrieval bit.ly/42cwe01

18.04.2025 13:53 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

DEFT Spanish Light and Rich ERE Annotation: 158 Latin American discussion forum and Spanish newswire documents annotated for entities, relations and events, including conference (light) and event hoppers (rich), developed by LDC for the DARPA DEFT program bit.ly/3YcGCnd

17.04.2025 14:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@ldcupenn is following 1 prominent accounts