ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling

Preprint
University of Illinois Urbana-Champaign
† Corresponding author

Abstract

We study the problem of human gaze modeling, which aims to generate the gaze patterns a viewer produces while observing a visual stimulus. Gaze is primarily captured through two modalities: continuous eye-tracking trajectories, which describe fine-grained motion dynamics, and discrete scanpaths, which describe high-level fixation structure. Because gaze varies substantially across viewers and trials, we treat this variability as a defining property rather than noise and model gaze as a stochastic generative process. Existing generative gaze models supervise on only one of these two representations in isolation. We hypothesize that trajectories and scanpaths describe gaze at complementary scales and are jointly informative during training, and test this hypothesis through ST-DiffEye, a joint trajectory-scanpath diffusion framework that couples both modalities by concatenating them as an additional raw input channel, requiring no architectural overhead beyond an input and output channel expansion. We further introduce a principled evaluation framework based on the Continuous Ranked Probability Score (CRPS), which generalizes any existing sequence similarity metric into a proper scoring rule that jointly assesses the accuracy and diversity of generated gaze. Experiments on task-driven visual search, covering both target-present and target-absent scenarios, and on free-viewing benchmarks demonstrate state-of-the-art performance. These results, along with detailed ablations, confirm the benefit of joint modeling and the value of distribution-aware evaluation in capturing the intrinsic variability of human gaze.

Contributions

  • A joint scanpath-trajectory diffusion framework that couples both modalities via channel-level concatenation, enabling end-to-end training that exploits their complementarity without additional architectural complexity.
  • A CRPS-based metrics family that generalizes any existing sequence similarity measure into a proper probabilistic scoring rule for evaluating the distributional accuracy and diversity of generative gaze models.
  • State-of-the-art performance on free-viewing and task-driven benchmarks, with ablations confirming the contribution of joint training and the informativeness of the proposed metrics in distinguishing models that capture human gaze variability from those that do not.

Figure 1: Complementary representations of human gaze.

Scanpaths, trajectories, and their combination overlaid on a natural image

Prior work models gaze using either (a) scanpaths for fixation-level structure or (b) continuous trajectories for fine-grained motion dynamics. Our approach jointly leverages (c) both, aligning fixation anchors with the underlying trajectory along a shared time axis.

Motivation: Scanpaths provide a coarse structural summary of where attention dwells, while eye-tracking trajectories provide a fine-grained signal for how the eye moves between fixations. Existing generative gaze models supervise on only one of these modalities, discarding complementary cues. ST-DiffEye trains a single diffusion model jointly on both, so fixation-level structure and continuous dynamics inform a shared latent representation.

Figure 2: Overview of ST-DiffEye.

ST-DiffEye pipeline: multimodal conditioning, fixation index construction, DiT denoiser training, and sampling

(1) The visual stimulus I and task condition c are encoded and fused into a joint representation Vjoint. (2) Each trajectory point is assigned a continuous fixation index by anchoring fixations to their nearest trajectory points and linearly interpolating between anchors. (3) The augmented trajectory tokens are denoised by a DiT conditioned on Vjoint, then decoded into a trajectory with a predicted valid length. (4) At sampling time, the generated trajectory is post-processed via fixation extraction to produce the final scanpath.

Figure 3: Why CRPS — failure modes of mean, best, and KLD protocols.

Synthetic Gaussian generators scored under mean, best, KLD, and CRPS protocols

Four generators are evaluated against the same ground-truth Gaussian: zero-variance, low-variance, mid-variance, and realistic. Each non-CRPS protocol is won by a different degenerate generator (mean → zero-variance; best → low-variance; KLD → mid-variance), while only CRPS correctly identifies the realistic generator. CRPS jointly penalizes inaccuracy and insufficient diversity in a single interpretable number.

Figure 4: Qualitative comparison with baselines.

Generated scanpaths compared to baselines across visual search and free-viewing benchmarks

Generated scanpaths on COCO-Search18 (top: target-present and target-absent rows) and the free-viewing benchmarks COCO-FreeView and MIT1003 (bottom). ST-DiffEye produces accurate, task-consistent scanpaths in target-present search and exploratory, human-like coverage in free-viewing, while several baselines collapse to low-diversity outputs or drift from salient regions.

Results

CRPS-based metrics on free-viewing and visual-search datasets. Distance metrics (LD, DFD, DTW, TDE) are lower-is-better (↓); similarity metrics (MM, SM, SS, SSS) are higher-is-better (↑). † marks duration-aware variants. Bold = best within each dataset/comparison block. ST-DiffEye leads especially on distance-based and duration-aware metrics.

Method LD↓DFD↓DTW↓TDE↓ MM↑SM↑SM†↑SS↑SS†↑SSS↑SSS†↑
COCO-FreeView — Scanpath comparison
DiffEye56.368121.3551116.5940.0510.3700.025−0.0180.197−0.029
ScanDiff41.856104.744910.4330.0490.3870.1020.0890.2740.075
ST-DiffEye (Ours)47.108103.009851.7920.0480.3770.0890.1080.2060.088
MIT1003 — Scanpath comparison
DiffEye27.530111.506516.8210.0620.3770.1120.1250.3280.121
ScanDiff25.189102.216496.3270.0580.3790.1110.1180.3420.128
ST-DiffEye (Ours)24.961100.903456.9850.0580.3800.1290.1450.3210.146
COCO-Search18, Target Present — Scanpath comparison
GazeFormer15.115181.157284.0390.0990.3270.0440.0080.5400.0520.001−0.045
HAT10.784122.199227.4500.0800.3980.2010.1970.185
ScanDiff9.653124.455219.8550.0760.3710.1650.1040.5210.1480.1640.089
ST-DiffEye (Ours)8.716110.038196.6310.0740.3810.2060.1460.5040.1800.2110.134
COCO-Search18, Target Absent — Scanpath comparison
GazeXplain28.302152.507514.8800.0740.350−0.0090.0010.3470.0100.0020.009
HAT24.371117.266439.9560.0640.3990.095−0.0510.100
ScanDiff21.252119.689426.3500.0690.3780.0860.0770.3050.0900.1090.092
ST-DiffEye (Ours)21.128115.683410.4440.0640.3810.1100.0950.2880.1090.1300.109

Excerpt of Table 1 in the paper (scanpath comparison blocks). Full results — including the trajectory comparison blocks and the mean, best-of-N, and KLD protocols — are in the paper.

BibTeX

@article{zhao2026stdiffeye,
  title   = {ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling},
  author  = {Zhao, Brian Nlong and Kara, Ozgur and Kim, Junho and Rehg, James M.},
  journal = {arXiv preprint},
  year    = {2026}
}

TODO: update the BibTeX once the arXiv identifier is assigned.