Abstract
Eye tracking scanpaths encode the temporal sequence and spatial distribution of eye movements, offering insights into visual attention and aesthetic perception. However, analysing scanpaths still requires substantial manual effort and specialised expertise, which limits scalability and constrains objectivity of eye tracking methods. This paper examines whether and how multimodal large language models (MLLMs) can provide objective, expert-level scanpath interpretations. We used GPT-4o as a case study to develop eye tracking scanpath analysis (ETSA) approach which integrates (1) structural information extraction to parse scanpath events, (2) knowledge base of visual-behaviour expertise, and (3) least-to-most and few-shot chain-of-thought prompt engineering to guide reasoning. We conducted two studies to evaluate the reliability and effectiveness of the approach, as well as an ablation analysis to quantify the contribution of the knowledge base and a cross-model evaluation to assess generalisability across different MLLMs. The results of repeated-measures experiment show high semantic similarity of 0.884, moderate feature-level agreement with expert scanpath interpretations (F1 = 0.476) and no significant differences from expert annotations based on the exact McNemar test (p = 0.545). Together with the ablation and cross-model findings, this study contributes a generalisable and reliable pipeline for MLLM-based scanpath interpretation, supporting efficient analysis of complex eye tracking data.
Affiliated Institutions
Related Publications
SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks
Siamese network based trackers formulate tracking as convolutional feature cross-correlation between target template and searching region. However, Siamese trackers still have a...
An investigation of the psychological processes underlying the debugging of computer programs
The primary objective of this thesis is to contribute to a theory of programmer expertise with respect to debugging computer programs. It is important to develop a theory of pro...
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human a...
Components of expertise
This article discusses frameworks for studying expertise at the knowledge level and knowledge-use level. It reviews existing approaches such as inference structures, the distinc...
Dynamic Few-Shot Visual Learning Without Forgetting
The human visual system has the remarkably ability to be able to effortlessly learn novel concepts from only a few examples. Mimicking the same behavior on machine learning visi...
Publication Info
- Year
- 2025
- Type
- article
- Volume
- 6
- Issue
- 4
- Pages
- 164-164
- Citations
- 0
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.3390/modelling6040164