Logo do repositório
 
A carregar...
Miniatura
Publicação

Dealing with language emergent behavior using vectors of reduced dimension

Utilize este identificador para referenciar este registo.
Nome:Descrição:Tamanho:Formato: 
TM_Luís_Cordeiro.pdf4.01 MBAdobe PDF Ver/Abrir

Resumo(s)

Natural languages, as typical complex systems, exhibit distinctive properties arising from the relationships between their elements such as nonlinearity and emergence. Such properties, to gether with the high dimensionality inherent of extensive vocabularies, make natural languages intrinsically difficult to model. Word embedding models have tackled these difficulties by using distributional semantics along with neural-based models for computing vector representation of words in a space of reduced dimension. In particular, the word2vec model makes use of a 3- layer neural network that generates a vector space, Γ, where a quantitative notion of meaning is recovered. In this work, we use the word2vec architecture to show that, in the space of reduced dimension, in addition to meaning, it is also possible to recover a notion of word attractiveness. In this framework, we define in Γ the quantity mass, M, for each of the V words that form the vocabulary. It was found that M is positively correlated with the word frequencies in the text, f, and that both f and M are distributed according to power laws. It was also found that when the text is shuffled, that is, keeping word frequencies but changing their order, practically all words have M = 0 which suggests that mass is a property that does not bypass text’s emergent structure. In addition, we have extended the definition of mass to serve as connection criterion for a new linguistic network (a model for languages in terms of a graph structure). It was found that this network exhibits scale-free and small-world properties and that its topology is signifi cantly affected by text shuffling, on contrast to what is observed for other unsupervised linguistic networks. We also suggest that the total mass of the system may function as a measure that represents an intuitive concept of information and that is uniquely defined.

Descrição

Tese de Mestrado Integrado, Engenharia Física, 2022, Universidade de Lisboa, Faculdade de Ciências

Palavras-chave

Linguística Quantitativa Sistemas Complexos Redes Linguísticas Vetorização lexical Teses de mestrado - 2022

Contexto Educativo

Citação

Projetos de investigação

Unidades organizacionais

Fascículo

Editora

Licença CC