A carregar...
Projeto de investigação
“iBiogeography: aproveitar o poder dos Big Data para a monitorização biogeográfica” iBiogeography: harnessing and measuring the power of big, unstructured data for biogeographical monitoring
Financiador
Autores
Publicações
Who is reporting non-native species and how? A cross-expert assessment of practices and drivers of non-native biodiversity reporting in species regional listing
Publication . Castro, Andry; Ribeiro, Joana; Reino, Luís; Capinha, César
Each year, hundreds of scientific works with species' geographical data are published.
However, these data can be challenging to identify, collect, and integrate into analytical
workflows due to differences in reporting structures, storage formats, and the
omission or inconsistency of relevant information and terminology. These difficulties
tend to be aggravated for non-native
species, given varying attitudes toward non-native
species reporting and the existence of an additional layer of invasion-related
terminology. Thus, our objective is to identify the current practices and drivers of the
geographical reporting of non-native
species in the scientific literature. We conducted
an online survey targeting authors of species regional checklists—a
widely published
source of biogeographical data—where
we asked about reporting habits and perceptions
regarding non-native
taxa. The responses and the relationships between response
variables and predictors were analyzed using descriptive statistics and ordinal
logistic regression models. With a response rate of 22.4% (n = 113), we found that
nearly half of respondents (45.5%) do not always report non-native
taxa, and of those
who report, many (44.7%) do not always differentiate them from native taxa. Close
to half of respondents (46.4%) also view the terminology of biological invasions as an
obstacle to the reporting of non-native
taxa. The ways in which checklist information
is provided are varied, but mainly correspond to descriptive text and embedded tables
with non-native
species (when given) mentioned alongside native species. Only 13.4%
of respondents mention to always provide the data in automation-friendly
formats or
its publication in biodiversity data repositories. Data on the distribution of non-native
species are essential for monitoring global biodiversity change and preventing biological
invasions. Despite its importance our results show an urgent need to improve the
frequency, accessibility, and consistency of publication of these data.
Large language models overcome the challenges of unstructured text data in ecology
Publication . Castro, Andry; Pinto, João; Reino, Luís; Pipek, Pavel; Capinha, César
The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.
Unidades organizacionais
Descrição
Palavras-chave
Contribuidores
Financiadores
Entidade financiadora
Fundação para a Ciência e a Tecnologia
Programa de financiamento
Número da atribuição
PRT/BD/152100/2021
