Repository logo
 
Loading...
Project Logo
Research Project

Word combinations in portuguese language COMBINA-PT

Authors

Publications

COMBINA-PT: a Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword Expressions
Publication . Mendes, Amália; Antunes, Sandra; Bacelar do Nascimento, Maria Fernanda; Casteleiro, João Miguel; Pereira, Luísa; Sá, Tiago
This paper presents the COMBINA-PT project, a study of corpus-extracted Portuguese Multiword (MW) expressions. The objective of this on-going project is to compile a large lexical database of multiword (MW) units of the Portuguese language, automatically extracted from a balanced 50 million word corpus, interpreted with lexical association measures and manually validated. MW expressions considered in the database include named entities and lexical associations with different degrees of cohesion, ranging from frozen groups, which undergo little or no variation, to lexical collocations composed of words that tend to occur together and that constitute syntactic dependencies, although with a low degree of fixedness. This new resource has a two-fold objective: (i) to be an important research tool which supports the development of MW expressions typologies and their lexicographic treatment; (ii) to be of major help in developing and evaluating language processing tools able of dealing with MW expressions
Corpus-based extraction and identification of Portuguese Multiword Expressions
Publication . Antunes, Sandra; Nascimento, Maria Fernanda Bacelar do; Casteleiro, João Miguel; Mendes, Amália; Pereira, Luísa; Sá, Tiago
This presentation reports the methodology followed and the results attained on an on-going project aiming at building a large lexical database of corpus-extracted multiword (MW) expressions for the Portuguese language. MW expressions were automatically extracted from a balanced 50 million word corpus compiled for this project, furthermore statistically interpreted using lexical association measures and are undergoing a manual validation process. The lexical database covers different types of MW expressions, from named entities to lexical associations with different degrees of cohesion, ranging from totally frozen idioms to favoured co-occurring forms, like collocations. We aim to achieve two main objectives with this resource: to build on the large set of data of different types of MW expressions to revise existing typologies of collocations and to integrate them in a larger theory of MW units; to use the extensive hand-checked data as training data to evaluate existing statistical lexical association measures.
A Lexical Database of Portuguese Multiword Expressions
Publication . Antunes, Sandra; Nascimento, Maria Fernanda Bacelar do; Casteleiro, João Miguel; Mendes, Amália; Pereira, Luísa; Sá, Tiago
This presentation focuses on an ongoing project which aims at the creation of a large lexical database of Portuguese multiword (MW) units, automatically extracted through the analysis of a balanced 50 million word corpus, statistically interpreted with lexical association measures and validated by hand. This database covers different types of MW units, like named entities, and lexical associations ranging from sets of favoured co-occurring forms with high corpus frequency and low cohesion to strongly lexicalized expressions with no, or minimum, variation. This new resource has a two-fold objective: to be an important research tool which supports the development of collocation typologies and their integration in a larger theory of MW units; to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.

Organizational Units

Description

Keywords

Contributors

Funders

Funding agency

Fundação para a Ciência e a Tecnologia

Funding programme

POCI

Funding Award Number

POCTI/LIN/48465/2002

ID