Please use this identifier to cite or link to this item: http://hdl.handle.net/10451/37497
Title: Corpus-based extraction and identification of Portuguese Multiword Expressions
Author: Antunes, Sandra
Nascimento, Maria Fernanda Bacelar do
Casteleiro, João Miguel
Mendes, Amália
Pereira, Luísa
Sá, Tiago
Keywords: Multiword expressions
Collocations
Information extraction
Lexical database
Lexical association measures
Typology of multiword expressions
Issue Date: 2006
Publisher: Université Catholique de Louvain
Citation: Antunes, S., Bacelar do Nascimento, M. F., Casteleiro, J. M., Mendes, A., Pereira, L. & Sá, T. (2006): "Corpus-based extraction and identification of Portuguese Multiword Expressions", in Traitement Automatique des Langues Naturelles - TALN 2006, Leuven, April10-13, 2006.
Abstract: This presentation reports the methodology followed and the results attained on an on-going project aiming at building a large lexical database of corpus-extracted multiword (MW) expressions for the Portuguese language. MW expressions were automatically extracted from a balanced 50 million word corpus compiled for this project, furthermore statistically interpreted using lexical association measures and are undergoing a manual validation process. The lexical database covers different types of MW expressions, from named entities to lexical associations with different degrees of cohesion, ranging from totally frozen idioms to favoured co-occurring forms, like collocations. We aim to achieve two main objectives with this resource: to build on the large set of data of different types of MW expressions to revise existing typologies of collocations and to integrate them in a larger theory of MW units; to use the extensive hand-checked data as training data to evaluate existing statistical lexical association measures.
Cet article présente la méthodologie suivie et les résultats obtenus dans le cadre d’un projet qui a pour objectif la construction d’une large base de données d’expressions multi-mots de la langue portugaise. Ces expressions multi-mots ont été automatiquement extraites d’un corpus équilibré de 50 millions de mots, interprétées statistiquement à l’aide de mesures d’association lexicales et ont été ensuite manuellement vérifiées. La base de données lexicales recouvre différent types d’expressions multi-mots avec différents degrés de cohésion, qui vont de la quasi totale fixité jusqu’aux groupes de mots qui se réalisent préférentiellement ensemble, comme les collocations. Le large ensemble de données de cette ressource permettra une révision des typologies d’unités multi-mots en portugais et l’évaluation de différentes mesures d’associations lexicales.
URI: http://hdl.handle.net/10451/37497
Appears in Collections:FL - CLUL - Livros de Actas

Files in This Item:
File Description SizeFormat 
paper_taln_2006_antunes_final_version.pdf766,81 kBAdobe PDFView/Open


FacebookTwitterDeliciousLinkedInDiggGoogle BookmarksMySpace
Formato BibTex MendeleyEndnote 

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.