Logo do repositório
 
Publicação

Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources

dc.contributor.authorMendes, Amália
dc.contributor.authorAmaro, Raquel
dc.contributor.authorNascimento, Maria Fernanda Bacelar do
dc.date.accessioned2019-03-19T15:22:50Z
dc.date.available2019-03-19T15:22:50Z
dc.date.issued2004
dc.description.abstractThis paper discusses the experience of reusing annotation tools developed for written corpora to tag a spoken corpus with POS information. Eric Brill’s tagger, initially trained over a written and tagged corpus of 250.000 words, is being used to tag the Portuguese C-ORAL-ROM spoken corpus, of 300.000 words. First, we address issues related with the tagset definition as well as the tagger performance over the written corpus. We discuss important options concerning the spoken corpus transcription, with direct impact on the tagging task, as well as the additional tags required. Transcription options allow in some cases for automatic tag identification and replacement, through a post-tagger process. Other cases, like the annotation of discourse markers, are more complex and require manual revision (and eventual listening). Since the final annotation will not only include the POS tag but also the wordform lemma, the paper also addresses issues related to the lemmatisation task. The positive results obtained show that the process of tagging and lemmatising a spoken Portuguese corpus through the reuse of already available resources may constitute an example of how to minimize the costs of such a task, without compromising the results. Finally, we discuss some possible developments to improve the tagger’s performance.pt_PT
dc.description.versioninfo:eu-repo/semantics/publishedVersionpt_PT
dc.identifier.citationMendes, A., Amaro, R. & Bacelar do Nascimento, M. F. (2004): "Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources", in Branco, A., Mendes, A. & Ribeiro, R. (eds.) Language Technology for Portuguese: Shallow processing tools and resources. Lisboa: Colibript_PT
dc.identifier.urihttp://hdl.handle.net/10451/37588
dc.language.isoengpt_PT
dc.publisherColibript_PT
dc.relation.publisherversionhttp://www.edi-colibri.pt/Detalhes.aspx?ItemID=1505pt_PT
dc.titleMorphological Tagging of a Spoken Portuguese Corpus Using Available Resourcespt_PT
dc.typebook part
dspace.entity.typePublication
oaire.citation.conferencePlaceLisboapt_PT
oaire.citation.titleLanguage Technology for Portuguese: Shallow processing tools and resourcespt_PT
person.familyNameMendes
person.givenNameAmália
person.identifier.ciencia-id4018-7A6F-1873
person.identifier.orcid0000-0001-6815-2674
person.identifier.scopus-author-id14035817100
rcaap.rightsopenAccesspt_PT
rcaap.typebookPartpt_PT
relation.isAuthorOfPublication94be597b-a42a-42f4-8f1d-822fa454b910
relation.isAuthorOfPublication.latestForDiscovery94be597b-a42a-42f4-8f1d-822fa454b910

Ficheiros

Principais
A mostrar 1 - 1 de 1
A carregar...
Miniatura
Nome:
mendes_etal_livrotasha.pdf
Tamanho:
96.7 KB
Formato:
Adobe Portable Document Format
Licença
A mostrar 1 - 1 de 1
Miniatura indisponível
Nome:
license.txt
Tamanho:
1.2 KB
Formato:
Item-specific license agreed upon to submission
Descrição: