To Örebro University

oru.seÖrebro University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
EgoTV: Egocentric Task Verification from Natural Language Task Descriptions
Örebro University, School of Science and Technology. (MPI, AASS)ORCID iD: 0000-0003-3422-2085
Meta.
Meta.
Meta.
Show others and affiliations
2023 (English)In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV): Proceedings, IEEE, 2023, p. 15371-15383Conference paper, Published paper (Refereed)
Abstract [en]

To enable progress towards egocentric agents capable of understanding everyday tasks specified in natural language, we propose a benchmark and a synthetic dataset called Egocentric Task Verification (EgoTV). The goal in EgoTV is to verify the execution of tasks from egocentric videos based on the natural language description of these tasks. EgoTV contains pairs of videos and their task descriptions for multi-step tasks -- these tasks contain multiple sub-task decompositions, state changes, object interactions, and sub-task ordering constraints. In addition, EgoTV also provides abstracted task descriptions that contain only partial details about ways to accomplish a task. Consequently, EgoTV requires causal, temporal, and compositional reasoning of video and language modalities, which is missing in existing datasets. We also find that existing vision-language models struggle at such all-round reasoning needed for task verification in EgoTV Inspired by the needs of EgoTV, we propose a novel Neuro-Symbolic Grounding (NSG) approach that leverages symbolic representations to capture the compositional and temporal structure of tasks. We demonstrate NSG's capability towards task tracking and verification on our EgoTV dataset and a real-world dataset derived from CrossTask (CTV). We open-source the EgoTV and CTV datasets and the NSG model for future research on egocentric assistive agents. 

Place, publisher, year, edition, pages
IEEE, 2023. p. 15371-15383
Series
IEEE International Conference on Computer Vision (ICCV), ISSN 1550-5499, E-ISSN 2380-7504
Keywords [en]
Video Task Verification, Computer Vision, Language Understanding, Neuro-Symbolic Reasoning
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:oru:diva-108102DOI: 10.1109/ICCV51070.2023.01414ISI: 001169499007076Scopus ID: 2-s2.0-85180427181ISBN: 9798350307184 (electronic)ISBN: 9798350307191 (print)OAI: oai:DiVA.org:oru-108102DiVA, id: diva2:1794433
Conference
International Conference on Computer Vision (ICCV 2023), Paris, France, October 2-6, 2023
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Available from: 2023-09-05 Created: 2023-09-05 Last updated: 2025-09-01
In thesis
1. Neurosymbolic Decision-Making with Large Language Models
Open this publication in new window or tab >>Neurosymbolic Decision-Making with Large Language Models
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Reasoning and decision-making are foundational challenges in artificial intelligence (AI). These processes are closely linked – an intelligent agent must reason about its environment and goals in order to make decisions and select actions. Two principal frameworks for sequential decision-making are AI planning and reinforcement learning (RL). Planning assumes access to a known model of the environment and uses symbolic representations to compute a sequence of actions that leads from an initial state to a desired goal. In contrast, RL focuse son learning behavior through interaction, enabling agents to develop policies that maximize long-term reward under uncertainty. Despite methodological differences, both approaches aim to generate intelligent, goal-directed action sequences.

The rise of Large Language Models (LLMs) has sparked significant interest in their potential to perform reasoning, planning, and decision-making tasks. Despite their impressive performance in natural language understanding and generalization, there is growing skepticism about whether LLMs genuinely reason or merely leverage statistical correlations. This dissertation investigates this question through a principled evaluation grounded in computational theory, using 3-SAT – the canonical NP-complete problem – as a testbed. The findings demonstrate that LLMs fail to exhibit sound and complete reasoning, especially on complex instances where shallow heuristics fail, and that their apparent reasoning abilities often stem from overfitting to statistical patterns.

To address these limitations, this dissertation proposes a range of neurosymbolic architectures that combine the generative flexibility of LLMs with the rigor and reliability of symbolic methods. Empirical evaluations across planning, reward design, and plan verification tasks show that such integration yields systems that are more robust and accurate. This work advances our theoretical and practical understanding of LLM-based reasoning, provides concrete design principles for neurosymbolic systems, and charts a path toward AI agents that integrate world knowledge with logical precision.

Place, publisher, year, edition, pages
Örebro: Örebro University, 2025. p. 67
Series
Örebro Studies in Technology, ISSN 1650-8580 ; 106
National Category
Computer Sciences
Identifiers
urn:nbn:se:oru:diva-122456 (URN)9789175296869 (ISBN)
Public defence
2025-10-17, Örebro universitet, Långhuset, Hörsal L2, Fakultetsgatan 1, Örebro, 13:00 (English)
Opponent
Supervisors
Available from: 2025-07-22 Created: 2025-07-22 Last updated: 2025-09-04Bibliographically approved

Open Access in DiVA

EgoTV: Egocentric Task Verificationfrom Natural Language Task Descriptions(4510 kB)486 downloads
File information
File name FULLTEXT01.pdfFile size 4510 kBChecksum SHA-512
9b276419a3b1499b3375320fe8f6c8ded0cc0014ed408933a80a0d4afe0cdc0eab9c3a12846912d843f6cfcf38ffae756a53965c2854568d2b15f20e6b4dba6c
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopusarXiv

Authority records

Hazra, Rishi

Search in DiVA

By author/editor
Hazra, Rishi
By organisation
School of Science and Technology
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar
Total: 487 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1045 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf