To Örebro University

oru.seÖrebro University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
How Reliable Are GPT-4o and LLAMA3.3-70B in Classifying Natural Language Requirements? The Impact of the Temperature Setting
Örebro University, Örebro University School of Business. (CERIS - Centre for Empirical Research on Information Systems)ORCID iD: 0000-0002-3265-7627
Örebro University, Örebro University School of Business. (CERIS - Centre for Empirical Research on Information Systems)ORCID iD: 0000-0002-0311-1502
Örebro University, Örebro University School of Business. (CERIS - Centre for Empirical Research on Information Systems)ORCID iD: 0000-0002-3722-6797
Örebro University, Örebro University School of Business. (CERIS - Centre for Empirical Research on Information Systems)ORCID iD: 0000-0001-8604-8862
2025 (English)In: IEEE Software, ISSN 0740-7459, E-ISSN 1937-4194, Vol. 42, no 6, p. 97-104Article in journal (Refereed) Published
Abstract [en]

Classifying natural language requirements (NLRs) plays a crucial role in software engineering, helping us distinguish between functional and non-functional requirements. While large language models offer automation potential, we should address concerns about their consistency, meaning their ability to produce the same results over time. In this work, we share experiences from experimenting with how well GPT-4o and LLAMA3.3-70B classify NLRs using a zero-shot learning approach. Moreover, we explore how the temperature parameter influences classification performance and consistency for these models. Our results show that large language models like GPT-4o and LLAMA3.3- 70B can support automated NLRs classification. GPT-4o performs well in identifying functional requirements, with the highest consistency occurring at a temperature setting of one. Additionally, non-functional requirements classification improves at higher temperatures, indicating a trade-off between determinism and adaptability. LLAMA3.3-70B is more consistent than GPT-4o, and its classification accuracy varies less depending on temperature adjustments.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025. Vol. 42, no 6, p. 97-104
Keywords [en]
Software engineering, Predictive models, Accuracy, Transformers, Training, Natural languages, Temperature measurement, Software reliability, Natural language processing
National Category
Information Systems, Social aspects
Research subject
Informatics
Identifiers
URN: urn:nbn:se:oru:diva-122267DOI: 10.1109/MS.2025.3572561ISI: 001600046500002OAI: oai:DiVA.org:oru-122267DiVA, id: diva2:1981195
Available from: 2025-07-03 Created: 2025-07-03 Last updated: 2025-11-12Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Karlsson, FredrikChatzipetrou, PanagiotaGao, ShangHavstorm, Tanja Elina

Search in DiVA

By author/editor
Karlsson, FredrikChatzipetrou, PanagiotaGao, ShangHavstorm, Tanja Elina
By organisation
Örebro University School of Business
In the same journal
IEEE Software
Information Systems, Social aspects

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 87 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf