To Örebro University

oru.seÖrebro University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
REvolve: Reward Evolution with Large Language Models using Human Feedback
Örebro University, School of Science and Technology. (Centre for Applied Autonomous Sensor Systems (AASS))ORCID iD: 0000-0003-3422-2085
Örebro University, School of Science and Technology. (Centre for Applied Autonomous Sensor Systems (AASS))ORCID iD: 0009-0007-4357-9533
Örebro University, School of Science and Technology. (Centre for Applied Autonomous Sensor Systems (AASS))
Örebro University, School of Science and Technology. (Centre for Applied Autonomous Sensor Systems (AASS))ORCID iD: 0000-0002-3122-693X
Show others and affiliations
2025 (English)In: 13th International Conference on Learning Representations (ICLR 2025): Proceedings, International Conference on Learning Representations, ICLR , 2025, p. 25710-25751Conference paper, Published paper (Refereed)
Abstract [en]

Designing effective reward functions is crucial to training reinforcement learning (RL) algorithms. However, this design is non-trivial, even for domain experts, due to the subjective nature of certain tasks that are hard to quantify explicitly. In recent works, large language models (LLMs) have been used for reward generation from natural language task descriptions, leveraging their extensive instruction tuning and commonsense understanding of human behavior. In this work, we hypothesize that LLMs, guided by human feedback, can be used to formulate reward functions that reflect human implicit knowledge. We study this in three challenging settings - autonomous driving, humanoid locomotion, and dexterous manipulation - wherein notions of “good” behavior are tacit and hard to quantify. To this end, we introduce REvolve, a truly evolutionary framework that uses LLMs for reward design in RL. REvolve generates and refines reward functions by utilizing human feedback to guide the evolution process, effectively translating implicit human knowledge into explicit reward functions for training (deep) RL agents. Experimentally, we demonstrate that agents trained on REvolve-designed rewards outperform other state-of-the-art baselines. 

Place, publisher, year, edition, pages
International Conference on Learning Representations, ICLR , 2025. p. 25710-25751
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:oru:diva-123277DOI: 10.48550/arXiv.2406.01309Scopus ID: 2-s2.0-105010222426ISBN: 9798331320850 (electronic)OAI: oai:DiVA.org:oru-123277DiVA, id: diva2:1993704
Conference
13th International Conference on Learning Representations (ICLR 2025), Singapore, April 24-28, 2025
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Knut and Alice Wallenberg FoundationAvailable from: 2025-09-01 Created: 2025-09-01 Last updated: 2026-01-16Bibliographically approved
In thesis
1. Neurosymbolic Decision-Making with Large Language Models
Open this publication in new window or tab >>Neurosymbolic Decision-Making with Large Language Models
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Reasoning and decision-making are foundational challenges in artificial intelligence (AI). These processes are closely linked – an intelligent agent must reason about its environment and goals in order to make decisions and select actions. Two principal frameworks for sequential decision-making are AI planning and reinforcement learning (RL). Planning assumes access to a known model of the environment and uses symbolic representations to compute a sequence of actions that leads from an initial state to a desired goal. In contrast, RL focuse son learning behavior through interaction, enabling agents to develop policies that maximize long-term reward under uncertainty. Despite methodological differences, both approaches aim to generate intelligent, goal-directed action sequences.

The rise of Large Language Models (LLMs) has sparked significant interest in their potential to perform reasoning, planning, and decision-making tasks. Despite their impressive performance in natural language understanding and generalization, there is growing skepticism about whether LLMs genuinely reason or merely leverage statistical correlations. This dissertation investigates this question through a principled evaluation grounded in computational theory, using 3-SAT – the canonical NP-complete problem – as a testbed. The findings demonstrate that LLMs fail to exhibit sound and complete reasoning, especially on complex instances where shallow heuristics fail, and that their apparent reasoning abilities often stem from overfitting to statistical patterns.

To address these limitations, this dissertation proposes a range of neurosymbolic architectures that combine the generative flexibility of LLMs with the rigor and reliability of symbolic methods. Empirical evaluations across planning, reward design, and plan verification tasks show that such integration yields systems that are more robust and accurate. This work advances our theoretical and practical understanding of LLM-based reasoning, provides concrete design principles for neurosymbolic systems, and charts a path toward AI agents that integrate world knowledge with logical precision.

Place, publisher, year, edition, pages
Örebro: Örebro University, 2025. p. 67
Series
Örebro Studies in Technology, ISSN 1650-8580 ; 106
National Category
Computer Sciences
Identifiers
urn:nbn:se:oru:diva-122456 (URN)9789175296869 (ISBN)
Public defence
2025-10-17, Örebro universitet, Långhuset, Hörsal L2, Fakultetsgatan 1, Örebro, 13:00 (English)
Opponent
Supervisors
Available from: 2025-07-22 Created: 2025-07-22 Last updated: 2025-09-04Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Hazra, RishiSygkounas, AlkisPersson, AndreasLoutfi, AmyZuidberg dos Martires, Pedro

Search in DiVA

By author/editor
Hazra, RishiSygkounas, AlkisPersson, AndreasLoutfi, AmyZuidberg dos Martires, Pedro
By organisation
School of Science and Technology
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 193 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf