Logo do repositório
 
A carregar...
Logótipo do projeto
Projeto de investigação

Sem título

Autores

Publicações

O forro: a construção de um corpus
Publication . Tiny, Abigail; Amaro, Haldane; Hendrickx, Iris; Hagemeijer, Tjerk
Este trabalho apresenta o processo de construção de um corpus de material oral e escrito do forro (santome), um crioulo de base lexical portuguesa falado na ilha de São Tomé. O corpus compreende dados da segunda metade do século XIX até ao presente. Abordamos as dificuldades típicas de línguas não oficiais que são predominantemente orais, tais como a normalização ortográfica e um conjunto de dados mais restrito. Para a compilação do corpus seguimos padrões linguísticos de corpora e para codificar os metadados utilizámos a codificação de carateres UTF-8 e XML. Definimos um conjunto de metadados e apresentamos as etiquetas desenvolvido para a anotação dos dados com informação linguística.
The Gulf of Guinea creoles: A case-study of syntactic reconstruction
Publication . Hagemeijer, Tjerk
This paper argues that creole languages do not face some of the typical problems that have been discussed with respect to syntactic reconstruction of older languages. Creoles often belong to young language families and are therefore expected to show a significant amount of syntactic identity among sister languages. Other factors, such as their isolating typology and geographical isolation, may be additional advantages in the success of syntactic reconstruction. This hypothesis is tested on the four Portuguese-related Gulf of Guinea creoles, where a high degree of identity and the use of other processes, such as directionality, prove to provide good insights into the syntactic features of the proto-language.
A Corpus of Santome
Publication . Hagemeijer, Tjerk; Hendrickx, Iris; Haldane, Amaro; Tiny, Abigail
We present the process of constructing a corpus of spoken and written material for Santome, a Portuguese-related creole language spoken on the island of S. Tomé in the Gulf of Guinea (Africa). Since the language lacks an official status, we faced the typical difficulties, such as language variation, lack of standard spelling, lack of basic language instruments, and only a limited data set. The corpus comprises data from the second half of the 19th century until the present. For the corpus compilation we followed corpus linguistics standards and used UTF-8 character encoding and XML to encode meta information. We discuss how we normalized all material to one spelling, how we dealt with cases of language variation, and what type of meta data is used. We also present a POS-tag set developed for the Santome language that will be used to annotate the data with linguistic information.
Creole languages and genes: The case of São Tomé and Príncipe
Publication . Hagemeijer, Tjerk; Rocha, Jorge
This article focuses on the gene-language connection between the Portugueserelated Gulf of Guinea creole-speaking populations in São Tomé and Príncipe. The Gulf of Guinea creoles constitute a young language family of four languages spoken on three islands: Santome (ST) and Angolar (AN) on the island of São Tomé; Principense (PR) on Príncipe; and Fa d’Ambo (FA) on Annobón. The latter island, which integrates Equatorial Guinea, is not included in our genetic case-study because its population has not yet been sampled.
The Gulf of Guinea Creole Corpora
Publication . Hagemeijer, Tjerk; Généreux, Michel; Hendrickx, Iris; Mendes, Amália; Tiny, Abigail; Zamora, Armando
We present the process of building linguistic corpora of the Portuguese-related Gulf of Guinea creoles, a cluster of four historically related languages: Santome, Angolar, Principense and Fa d’Ambô. We faced the typical difficulties of languages lacking an official status, such as lack of standard spelling, language variation, lack of basic language instruments, and small data sets, which comprise data from the late 19th century to the present. In order to tackle these problems, the compiled written and transcribed spoken data collected during field work trips were adapted to a normalized spelling that was applied to the four languages. For the corpus compilation we followed corpus linguistics standards. We recorded meta data for each file and added morphosyntactic information based on a part-of-speech tag set that was designed to deal with the specificities of these languages. The corpora of three of the four creoles are already available and searchable via an online web interface.

Unidades organizacionais

Descrição

Palavras-chave

Contribuidores

Financiadores

Entidade financiadora

Fundação para a Ciência e a Tecnologia

Programa de financiamento

3599-PPCDT

Número da atribuição

PTDC/CLE-LIN/111494/2009

ID