Logo do repositório
 
Publicação

Large language models overcome the challenges of unstructured text data in ecology

dc.contributor.authorCastro, Andry
dc.contributor.authorPinto, João
dc.contributor.authorReino, Luís
dc.contributor.authorPipek, Pavel
dc.contributor.authorCapinha, César
dc.date.accessioned2025-01-06T12:18:13Z
dc.date.available2025-01-06T12:18:13Z
dc.date.issued2024
dc.description.abstractThe vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.pt_PT
dc.description.versioninfo:eu-repo/semantics/publishedVersionpt_PT
dc.identifier.citationCastro, A., Pinto, J., Reino, L., Pipek, P., & Capinha, C. (2024). Large language models overcome the challenges of unstructured text data in ecology. Ecological Informatics, 82, 102742. https://doi.org/10.1016/j.ecoinf.2024.102742pt_PT
dc.identifier.doi10.1016/j.ecoinf.2024.102742pt_PT
dc.identifier.eissn1878-0512
dc.identifier.issn1574-9541
dc.identifier.urihttp://hdl.handle.net/10400.5/96850
dc.language.isoengpt_PT
dc.peerreviewedyespt_PT
dc.relationUID/04413/2020pt_PT
dc.relation“iBiogeography: aproveitar o poder dos Big Data para a monitorização biogeográfica” iBiogeography: harnessing and measuring the power of big, unstructured data for biogeographical monitoring
dc.relation10.54499/PTDC/BIA-ECO/0207/2020pt_PT
dc.relationCentre of Geographical Studies
dc.relationNot Available
dc.relation.publisherversionhttps://www.sciencedirect.com/science/article/pii/S157495412400284X?via%3Dihubpt_PT
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/pt_PT
dc.subjectAIpt_PT
dc.subjectAutomationpt_PT
dc.subjectData integrationpt_PT
dc.subjectGPTpt_PT
dc.subjectLLaMApt_PT
dc.subjectUnstructured datapt_PT
dc.titleLarge language models overcome the challenges of unstructured text data in ecologypt_PT
dc.typejournal article
dspace.entity.typePublication
oaire.awardTitle“iBiogeography: aproveitar o poder dos Big Data para a monitorização biogeográfica” iBiogeography: harnessing and measuring the power of big, unstructured data for biogeographical monitoring
oaire.awardTitleCentre of Geographical Studies
oaire.awardTitleNot Available
oaire.awardURIinfo:eu-repo/grantAgreement/FCT//PRT%2FBD%2F152100%2F2021/PT
oaire.awardURIinfo:eu-repo/grantAgreement/FCT/Concurso de avaliação no âmbito do Programa Plurianual de Financiamento de Unidades de I&D (2017%2F2018) - Financiamento Base/UIDB%2F00295%2F2020/PT
oaire.awardURIinfo:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDP%2F00295%2F2020/PT
oaire.awardURIinfo:eu-repo/grantAgreement/FCT/CEEC IND 2017/CEECIND%2F00445%2F2017%2FCP1423%2FCP1645%2FCT0003/PT
oaire.citation.startPage102742pt_PT
oaire.citation.titleEcological Informaticspt_PT
oaire.citation.volume82pt_PT
oaire.fundingStreamConcurso de avaliação no âmbito do Programa Plurianual de Financiamento de Unidades de I&D (2017/2018) - Financiamento Base
oaire.fundingStream6817 - DCRRNI ID
oaire.fundingStreamCEEC IND 2017
person.familyNameCastro
person.familyNameCapinha
person.givenNameAndry
person.givenNameCésar
person.identifier.ciencia-idE519-A431-3F59
person.identifier.ciencia-id7714-2A88-CDE3
person.identifier.orcid0000-0001-5635-5271
person.identifier.orcid0000-0002-0666-9755
person.identifier.ridK-6439-2017
person.identifier.scopus-author-id32867555000
project.funder.identifierhttp://doi.org/10.13039/501100001871
project.funder.identifierhttp://doi.org/10.13039/501100001871
project.funder.identifierhttp://doi.org/10.13039/501100001871
project.funder.identifierhttp://doi.org/10.13039/501100001871
project.funder.nameFundação para a Ciência e a Tecnologia
project.funder.nameFundação para a Ciência e a Tecnologia
project.funder.nameFundação para a Ciência e a Tecnologia
project.funder.nameFundação para a Ciência e a Tecnologia
rcaap.rightsopenAccesspt_PT
rcaap.typearticlept_PT
relation.isAuthorOfPublicationc93dd827-f2f4-4491-b800-aa5f1aea6064
relation.isAuthorOfPublication4c666e7e-4ba8-4a41-8064-d26b3b9fc0f8
relation.isAuthorOfPublication.latestForDiscovery4c666e7e-4ba8-4a41-8064-d26b3b9fc0f8
relation.isProjectOfPublicationcb71fcb3-4eae-4faa-9f4a-097e3610fc88
relation.isProjectOfPublication64dca69f-9476-4071-8b46-6ac3dfe01c6b
relation.isProjectOfPublication0b403611-7396-429d-9c07-4f3dd8a81d98
relation.isProjectOfPublication8110f160-946a-47eb-97f1-5775b736bec4
relation.isProjectOfPublication.latestForDiscovery8110f160-946a-47eb-97f1-5775b736bec4

Ficheiros

Principais
A mostrar 1 - 1 de 1
A carregar...
Miniatura
Nome:
Castro_Pinto_Reino_Pipek_Capinha_2024.pdf
Tamanho:
1.98 MB
Formato:
Adobe Portable Document Format
Licença
A mostrar 1 - 1 de 1
Miniatura indisponível
Nome:
license.txt
Tamanho:
1.2 KB
Formato:
Item-specific license agreed upon to submission
Descrição: