From the LORELEI companion project: LoReHLT Uzbek Representative Language Pack features monolingual and parallel text, annotations, audio recordings, software tools and more for human language technology development to address emergent situations bit.ly/4lL0zuL
22.07.2025 14:08 β π 1 π 0 π¬ 0 π 0
Penn Parsed Corpora of Historical English Second Release: POS-tagged & syntactically annotated British English text (1100 CE -1914 CE); updates the 2020 release with new annotation, revised guidelines, philological information & the Corpus2 search tool bit.ly/46zR1hR
18.07.2025 14:34 β π 0 π 0 π¬ 0 π 0
AnnoDIFP Session Audio and Transcripts: 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments bit.ly/4nEYQJr
17.07.2025 15:16 β π 0 π 0 π¬ 0 π 0
Check out the July newsletter for Fall 2025 data scholarship application deadlines & 3 new publications: AnnoDIFP Session Audio and Transcripts, Penn Parsed Corpora of Historical English Second Release & LoReHLT Uzbek Representative Language Pack ldc-upenn.blogspot.com
16.07.2025 14:29 β π 0 π 0 π¬ 0 π 0
KAIROS Schema Learning Complex Event Annotation has English and Spanish web text, audio, video and image data labeled for 93 real-world complex events with event, relation and argument annotations linking to document provenance bit.ly/4jNrDIq
25.06.2025 13:07 β π 0 π 0 π¬ 0 π 0
IWSLT 2022 - 2023 Shared Task Training, Development and Test Set: 210 hours of Tunisian Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation used in IWSLT dialectal speech and low resource tracks bit.ly/3HEO4lL
24.06.2025 14:24 β π 0 π 0 π¬ 0 π 0
Chinese Sentence Pattern Structure Treebank contains 5,016 sentences and 119,627 tokens from modern and ancient Chinese works annotated for lexical sense, syntactic structure and inter-clause relations bit.ly/4kZVGh3
23.06.2025 13:57 β π 0 π 0 π¬ 0 π 0
LDCβs June newsletter has the latest on three new publications: Chinese Sentence Pattern Structure Treebank, IWSLT 2022-2023 Shared Task Training, Development and Test Set, and KAIROS Schema Learning Complex Event Annotation ldc-upenn.blogspot.com
17.06.2025 13:39 β π 0 π 0 π¬ 0 π 0
BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and Translations: transcripts and English translations for 93 hours of BOLT CTS telephone recordings; all speech was transcribed; 89% of the transcripts were translated bit.ly/4jKul2j
20.05.2025 13:29 β π 0 π 0 π¬ 0 π 0
BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio: 93 hours of telephone speech from 236 conversations between native speakers; developed by LDC for the DARPA BOLT program; contains previously unexposed calls from the CF/CH collections bit.ly/4kbsBPy
19.05.2025 14:12 β π 0 π 0 π¬ 0 π 0
Check out LDCβs May newsletter for two new companion releases developed by LDC to support the DARPA BOLT program, BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio and BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and Translations ldc-upenn.blogspot.com
16.05.2025 14:07 β π 0 π 0 π¬ 0 π 0
MATERIAL Kazakh-English Language Pack has 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations and queries designed to support cross language information retrieval bit.ly/42cwe01
18.04.2025 13:53 β π 0 π 0 π¬ 0 π 0
DEFT Spanish Light and Rich ERE Annotation: 158 Latin American discussion forum and Spanish newswire documents annotated for entities, relations and events, including conference (light) and event hoppers (rich), developed by LDC for the DARPA DEFT program bit.ly/3YcGCnd
17.04.2025 14:39 β π 0 π 0 π¬ 0 π 0
The April newsletter introduces LDCβs upgraded website, welcomes Bluesky to our social media channels and has the latest on LDCβs two new publications, DEFT Spanish Light and Rich ERE Annotation and MATERIAL Kazakh-English Language Pack ldc-upenn.blogspot.com
16.04.2025 14:27 β π 0 π 0 π¬ 0 π 0