| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 7.59 MB | Adobe PDF |
Orientador(es)
Resumo(s)
Esta tese de mestrado estuda as relações entre as palavras escritas nos resumos dos concursos da Agência Espacial Europeia (ESA - Invitation to Tender - ITT) e, em particular, se existe alguma correlação entre as palavras e a possibilidade de determinado país ser o ganhador do concurso. Um conjunto de dados de 2013 a 2016, com as informações dos dashboards dos status dos concursos e as informações do site Emits fornecidos pela ESA foram organizadas e compiladas. Em seguida, os códigos necessários para analisar esse conjunto de dados foi desenvolvido em R. Construímos matrizes e representações gráficas com as relações entre os países vencedores, os escritórios da ESA e os diferentes programas da ESA. Com base nisso, os primeiros pontos foram levantados e analisados. Em seguida, selecionamos cinco países com base no número de ITTs premiados e representatividade nos escritórios da ESA para desenvolvimento de modelos estatísticos. Esses países são: Alemanha, França, Grã-Bretanha ( Reino Unido), Itália e Bélgica. Com o uso de pacotes de mineração de dados (text mining), com o “TM” do R, os resumos originais foram organizados, de forma a retirar informação irrelevante que poderiam dificultar a realização deste trabalho. Números, espaços em branco e palavras mais frequentes foram removidas e todo texto foi colocado em minúsculo. Após estas etapas, a matriz documento por termo (DTM) foi construída. Nesta matrix, cada linhas é um documento (neste caso, o resumo de cada um dos ITTs) e cada coluna as variáveis (neste caso, as palavras mais frequentes na base de dados). A DTM é a base de todo o estudo relativo a análise textual. Para cada um dos cinco países com mais ITTs, modelos logísticos foram criados e métodos de seleção Stepwise aplicados. Os modelos criados relacionam palavras com a possibilidade de um determinado país ganhar um ITT. A validade dos modelos foi analisada utilizando parâmetros estatísticos como: sensibilidade x curva de especificidade (ponto de corte), área curva Roc e Odd. Posteriormente, começamos a investigar se os ITTs se aglomeraram em clusters definidos por estas variáveis. Diferentes métodos foram utilizados. O parâmetro da silhueta foi usado para validação dos clusters, porém os resultados não foram satisfatórios. Aplicou-se a análise de componentes principais (PCA), que permaneceu deixando lacunas, sugerindo que estudos mais avançados devem ser feitos para entender essa questão.
Com este estudo, podemos inferir que existem relações entre as palavras escritas nos resumos dos ITTS e a chance de um determinado país ser o vencedor de um determinado ITT. Por essa razão, este tema merece continuar a ser desenvolvido em trabalhos futuros.
This master thesis intends to study relations between the words written in European Space Agency (ESA) Invitation to Tender (ITT) abstracts, and, if there is any correlation between the words and the chance of a certain country to award a bid. An intermediate task was to compile and organize a proper dataset. A dataset was created using the ESA Dashboards and ESA Emits from 2013 to 2016 as basis. Then, we developed the necessary codes to analyze this dataset in R. We constructed matrices and graphical representations with the relations between Winner Countries, the ESA Offices and the different ESA Programs. Based on this, our firsts points were raised and analyzed. Five countries were selected based in the number of awarded ITTs. They are Germany, France, Great Britain, Italy and Belgium. These countries were scrutinized using text mining techniques and statistics models. Using our dataset, we analyzed the entire text abstract with R packages for text mining, as the TM package. The original abstracts were organized removing numbers, white spaces and most frequent words. After these steps, document term matrix (DTM) were constructed. DTM is a matrix, where the rows are the documents (ITT abstract) and the columns are the variables (most frequent words). The DTM was the basis for all textual analysis study. Regression models (logistic regression) were created for these five countries and stepwise methods used for variables selection. The created models relate words with the chance of a certain country winning an ITT. The validity of the models was analyzed using statistics parameters as: Sensibility x Specificity curve (cut-off point), Area under ROC curve, ODD. Ratio and fitted values. Afterwards, we started to investigate if the ITTs clustered in the DTM defined space. Different methods were used to define clusters. We verified if clustered formed in the word frequency space and also in a principal component analysis transformed space. However, results show that no method results in an automatic clustering using the Silhouette method, suggesting that more advanced techniques might be needed to extract the true number of clusters. The results of the application of PCA do not show agglomeration, suggesting internal clustering tendency. Finally, we can conclude that there seems to exist some relations between words and winner countries, the reasons for which remains to be studied in further works.
This master thesis intends to study relations between the words written in European Space Agency (ESA) Invitation to Tender (ITT) abstracts, and, if there is any correlation between the words and the chance of a certain country to award a bid. An intermediate task was to compile and organize a proper dataset. A dataset was created using the ESA Dashboards and ESA Emits from 2013 to 2016 as basis. Then, we developed the necessary codes to analyze this dataset in R. We constructed matrices and graphical representations with the relations between Winner Countries, the ESA Offices and the different ESA Programs. Based on this, our firsts points were raised and analyzed. Five countries were selected based in the number of awarded ITTs. They are Germany, France, Great Britain, Italy and Belgium. These countries were scrutinized using text mining techniques and statistics models. Using our dataset, we analyzed the entire text abstract with R packages for text mining, as the TM package. The original abstracts were organized removing numbers, white spaces and most frequent words. After these steps, document term matrix (DTM) were constructed. DTM is a matrix, where the rows are the documents (ITT abstract) and the columns are the variables (most frequent words). The DTM was the basis for all textual analysis study. Regression models (logistic regression) were created for these five countries and stepwise methods used for variables selection. The created models relate words with the chance of a certain country winning an ITT. The validity of the models was analyzed using statistics parameters as: Sensibility x Specificity curve (cut-off point), Area under ROC curve, ODD. Ratio and fitted values. Afterwards, we started to investigate if the ITTs clustered in the DTM defined space. Different methods were used to define clusters. We verified if clustered formed in the word frequency space and also in a principal component analysis transformed space. However, results show that no method results in an automatic clustering using the Silhouette method, suggesting that more advanced techniques might be needed to extract the true number of clusters. The results of the application of PCA do not show agglomeration, suggesting internal clustering tendency. Finally, we can conclude that there seems to exist some relations between words and winner countries, the reasons for which remains to be studied in further works.
Descrição
Trabalho de projeto de mestrado, Matemática Aplicada à Economia e Gestão, Universidade de Lisboa, Faculdade de Ciências, 2019
Palavras-chave
ESA Text mining R, DTM Logistic regression Stepwise methods ITT Winner country Clusters Kmeans PCA Teses de mestrado - 2019
