| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 2.65 MB | Adobe PDF |
Autores
Orientador(es)
Resumo(s)
The purpose of this dissertation is to propose a reliability metric and respective
validation tools for corpora annotated with deep linguistic information.
The annotation of corpus with deep linguistic information is a complex task, and
therefore is aided by a computational grammar. This grammar generates all the possible grammatical representations for sentences. The human annotators select the
most correct analysis for each sentence, or reject it if no suitable representation is
achieved. This task is repeated by two human annotators under a double-blind annotation
scheme and the resulting annotations are adjudicated by a third annotator.
This process should result in reliable datasets since the main purpose of this
dataset is to be the training and validation data for other natural language processing
tools. Therefore it is necessary to have a metric that assures such reliability and
quality.
In most cases, the metrics uses for shallow annotation or parser evaluation have
been used for this same task. However the increased complexity demands a better
granularity in order to properly measure the reliability of the dataset.
With that in mind, I suggest the usage of a metric based on the Cohen’s Kappa
metric that instead of considering the assignment of tags to parts of the sentence,
considers the decision at the level of the semantic discriminants, the most granular
unit available for this task. By comparing each annotator’s options it is possible
to evaluate with a high degree of granularity how close their analysis were for any given sentence.
An application was developed that allowed the application of this model to the
data resulting from the annotation process which was aided by the LOGON framework.
The output of this application not only has the metric for the annotated dataset,
but some information related with divergent decision with the intent of aiding the
adjudication process.
Descrição
Palavras-chave
Natural language processing corpora annotation with deep linguistic information inter-annotator agreement
