A Portuguese Native Language Identification Dataset

del Río, Iria; Zampieri, Marcos; Malmasi, Shervin

http://hdl.handle.net/10451/33644

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
A Portuguese Native Language Identification Dataset.pdf		97.93 KB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

del Río, Iria

Zampieri, Marcos

Malmasi, Shervin

Resumo(s)

In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author’s first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP. We discuss possible applications of this dataset and present the results obtained for the first lexical baseline system for Portuguese NLI.

URI

http://hdl.handle.net/10451/33644

Citação

del Río, Iria; Zampieri, Marcos; Malmasi, Shervin (2018): A Portuguese Native Language Identification Dataset in "The Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications". The Association for Computational Linguistics: New Orleans

Projetos de investigação

DETECÇÃO E CORREÇÃO AUTOMÁTICA DE ERROS EM PORTUGUÊS SEGUNDA LÍNGUA/LÍNGUA ESTRANGEIRA

Projeto de investigaçãoVer mais

Editora

The Association for Computational Linguistics

Coleções

FL - CLUL - Livros de Actas

Ver registo completo