Developing reliability metrics and validation tools for datasets with deep linguistic information

Castro, Sérgio Ricardo de

http://hdl.handle.net/10451/13908

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
1011rf_29479.pdf		2.65 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Castro, Sérgio Ricardo de

Orientador(es)

Branco, António

Resumo(s)

The purpose of this dissertation is to propose a reliability metric and respective validation tools for corpora annotated with deep linguistic information. The annotation of corpus with deep linguistic information is a complex task, and therefore is aided by a computational grammar. This grammar generates all the possible grammatical representations for sentences. The human annotators select the most correct analysis for each sentence, or reject it if no suitable representation is achieved. This task is repeated by two human annotators under a double-blind annotation scheme and the resulting annotations are adjudicated by a third annotator. This process should result in reliable datasets since the main purpose of this dataset is to be the training and validation data for other natural language processing tools. Therefore it is necessary to have a metric that assures such reliability and quality. In most cases, the metrics uses for shallow annotation or parser evaluation have been used for this same task. However the increased complexity demands a better granularity in order to properly measure the reliability of the dataset. With that in mind, I suggest the usage of a metric based on the Cohen’s Kappa metric that instead of considering the assignment of tags to parts of the sentence, considers the decision at the level of the semantic discriminants, the most granular unit available for this task. By comparing each annotator’s options it is possible to evaluate with a high degree of granularity how close their analysis were for any given sentence. An application was developed that allowed the application of this model to the data resulting from the annotation process which was aided by the LOGON framework. The output of this application not only has the metric for the annotated dataset, but some information related with divergent decision with the intent of aiding the adjudication process.

Palavras-chave

Natural language processing corpora annotation with deep linguistic information inter-annotator agreement

URI

http://hdl.handle.net/10451/13908
http://repositorio.ul.pt/handle/10455/6753

Coleções

FC-DI - Master Thesis (dissertation)

Ver registo completo