Repository logo
 
No Thumbnail Available
Publication

Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank

Use this identifier to reference this record.
Name:Description:Size:Format: 
141.pdf580.25 KBAdobe PDF Download

Advisor(s)

Abstract(s)

We introduce TED-Multilingual Discourse Bank, a corpus of TED talks transcripts in 6 languages (English, German, Polish, EuropeanPortuguese, Russian and Turkish), where the ultimate aim is to provide a clearly described level of discourse structure and semanticsin multiple languages. The corpus is manually annotated following the goals and principles of PDTB, involving explicit and implicitdiscourse connectives, entity relations, alternative lexicalizations and no relations. In the corpus, we also aim to capture the character-istics of spoken language that exist in the transcripts and adapt the PDTB scheme according to our aims; for example, we introducehypophora. We spot other aspects of spoken discourse such as the discourse marker use of connectives to keep them distinct from theirdiscourse connective use. TED-MDB is, to the best of our knowledge, one of the few multilingual discourse treebanks and is hoped tobe a source of parallel data for contrastive linguistic analysis as well as language technology applications. We describe the corpus, theannotation procedure and provide preliminary corpus statistics.

Description

Keywords

Discourse Parallel Multilingual corpus

Pedagogical Context

Citation

Zeyrek, Deniz, Amália Mendes, Murathan Kurfalı (2018) Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank. In Proceedings of the 11th Language Resources and Evaluation Conference - LREC’2018, 7-12 May 2018, Miyazaki, Japan, pp. 1913-1919.

Research Projects

Research ProjectShow more

Organizational Units

Journal Issue

Publisher

European Language Resources Association

CC License