| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 4.01 MB | Adobe PDF |
Resumo(s)
Natural languages, as typical complex systems, exhibit distinctive properties arising from the
relationships between their elements such as nonlinearity and emergence. Such properties, to gether with the high dimensionality inherent of extensive vocabularies, make natural languages
intrinsically difficult to model. Word embedding models have tackled these difficulties by using
distributional semantics along with neural-based models for computing vector representation of
words in a space of reduced dimension. In particular, the word2vec model makes use of a 3-
layer neural network that generates a vector space, Γ, where a quantitative notion of meaning is
recovered. In this work, we use the word2vec architecture to show that, in the space of reduced
dimension, in addition to meaning, it is also possible to recover a notion of word attractiveness.
In this framework, we define in Γ the quantity mass, M, for each of the V words that form the
vocabulary. It was found that M is positively correlated with the word frequencies in the text,
f, and that both f and M are distributed according to power laws. It was also found that when
the text is shuffled, that is, keeping word frequencies but changing their order, practically all
words have M = 0 which suggests that mass is a property that does not bypass text’s emergent
structure. In addition, we have extended the definition of mass to serve as connection criterion
for a new linguistic network (a model for languages in terms of a graph structure). It was found
that this network exhibits scale-free and small-world properties and that its topology is signifi cantly affected by text shuffling, on contrast to what is observed for other unsupervised linguistic
networks. We also suggest that the total mass of the system may function as a measure that
represents an intuitive concept of information and that is uniquely defined.
Descrição
Tese de Mestrado Integrado, Engenharia Física, 2022, Universidade de Lisboa, Faculdade de Ciências
Palavras-chave
Linguística Quantitativa Sistemas Complexos Redes Linguísticas Vetorização lexical Teses de mestrado - 2022
