Advisor(s)
Abstract(s)
We introduce TED-Multilingual Discourse Bank, a corpus of TED talks transcripts in 6 languages (English, German, Polish, EuropeanPortuguese, Russian and Turkish), where the ultimate aim is to provide a clearly described level of discourse structure and semanticsin multiple languages. The corpus is manually annotated following the goals and principles of PDTB, involving explicit and implicitdiscourse connectives, entity relations, alternative lexicalizations and no relations. In the corpus, we also aim to capture the character-istics of spoken language that exist in the transcripts and adapt the PDTB scheme according to our aims; for example, we introducehypophora. We spot other aspects of spoken discourse such as the discourse marker use of connectives to keep them distinct from theirdiscourse connective use. TED-MDB is, to the best of our knowledge, one of the few multilingual discourse treebanks and is hoped tobe a source of parallel data for contrastive linguistic analysis as well as language technology applications. We describe the corpus, theannotation procedure and provide preliminary corpus statistics.
Description
Keywords
Discourse Parallel Multilingual corpus
Pedagogical Context
Citation
Zeyrek, Deniz, Amália Mendes, Murathan Kurfalı (2018) Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank. In Proceedings of the 11th Language Resources and Evaluation Conference - LREC’2018, 7-12 May 2018, Miyazaki, Japan, pp. 1913-1919.