ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling

Abstract

We study the problem of human gaze modeling, which aims to generate the gaze patterns a viewer produces while observing a visual stimulus. Gaze is primarily captured through two modalities: continuous eye-tracking trajectories, which describe fine-grained motion dynamics, and discrete scanpaths, which describe high-level fixation structure. Because gaze varies substantially across viewers and trials, we treat this variability as a defining property rather than noise and model gaze as a stochastic generative process. Existing generative gaze models supervise on only one of these two representations in isolation. We hypothesize that trajectories and scanpaths describe gaze at complementary scales and are jointly informative during training, and test this hypothesis through ST-DiffEye, a joint trajectory-scanpath diffusion framework that couples both modalities by concatenating them as an additional raw input channel, requiring no architectural overhead beyond an input and output channel expansion. We further introduce a principled evaluation framework based on the Continuous Ranked Probability Score (CRPS), which generalizes any existing sequence similarity metric into a proper scoring rule that jointly assesses the accuracy and diversity of generated gaze. Experiments on task-driven visual search, covering both target-present and target-absent scenarios, and on free-viewing benchmarks demonstrate state-of-the-art performance. These results, along with detailed ablations, confirm the benefit of joint modeling and the value of distribution-aware evaluation in capturing the intrinsic variability of human gaze.

Contributions

A joint scanpath-trajectory diffusion framework that couples both modalities via channel-level concatenation, enabling end-to-end training that exploits their complementarity without additional architectural complexity.
A CRPS-based metrics family that generalizes any existing sequence similarity measure into a proper probabilistic scoring rule for evaluating the distributional accuracy and diversity of generative gaze models.
State-of-the-art performance on free-viewing and task-driven benchmarks, with ablations confirming the contribution of joint training and the informativeness of the proposed metrics in distinguishing models that capture human gaze variability from those that do not.

Results

CRPS-based metrics on free-viewing and visual-search datasets. Distance metrics — LD (Levenshtein Distance), DFD (Discrete Fréchet Distance), DTW (Dynamic Time Warping), TDE (Time-Delayed Embedding) — are lower-is-better (↓); similarity metrics — MM (MultiMatch), SM (ScanMatch), SS (Sequence Score), SSS (Semantic Sequence Score) — are higher-is-better (↑). † marks duration-aware variants. Bold = best within each dataset/comparison block. ST-DiffEye leads especially on distance-based and duration-aware metrics.

Method	LD↓	DFD↓	DTW↓	TDE↓	MM↑	SM↑	SM†↑	SS↑	SS†↑	SSS↑	SSS†↑
COCO-FreeView — Scanpath comparison
DiffEye	56.368	121.355	1116.594	0.051	0.370	0.025	−0.018	0.197	−0.029	–	–
ScanDiff	41.856	104.744	910.433	0.049	0.387	0.102	0.089	0.274	0.075	–	–
ST-DiffEye (Ours)	47.108	103.009	851.792	0.048	0.377	0.089	0.108	0.206	0.088	–	–
MIT1003 — Scanpath comparison
DiffEye	27.530	111.506	516.821	0.062	0.377	0.112	0.125	0.328	0.121	–	–
ScanDiff	25.189	102.216	496.327	0.058	0.379	0.111	0.118	0.342	0.128	–	–
ST-DiffEye (Ours)	24.961	100.903	456.985	0.058	0.380	0.129	0.145	0.321	0.146	–	–
COCO-Search18, Target Present — Scanpath comparison
GazeFormer	15.115	181.157	284.039	0.099	0.327	0.044	0.008	0.540	0.052	0.001	−0.045
HAT	10.784	122.199	227.450	0.080	0.398	0.201	–	0.197	–	0.185	–
ScanDiff	9.653	124.455	219.855	0.076	0.371	0.165	0.104	0.521	0.148	0.164	0.089
ST-DiffEye (Ours)	8.716	110.038	196.631	0.074	0.381	0.206	0.146	0.504	0.180	0.211	0.134
COCO-Search18, Target Absent — Scanpath comparison
GazeXplain	28.302	152.507	514.880	0.074	0.350	−0.009	0.001	0.347	0.010	0.002	0.009
HAT	24.371	117.266	439.956	0.064	0.399	0.095	–	−0.051	–	0.100	–
ScanDiff	21.252	119.689	426.350	0.069	0.378	0.086	0.077	0.305	0.090	0.109	0.092
ST-DiffEye (Ours)	21.128	115.683	410.444	0.064	0.381	0.110	0.095	0.288	0.109	0.130	0.109

Excerpt of Table 1 in the paper (scanpath comparison blocks). Full results — including the trajectory comparison blocks and the mean, best-of-N, and KLD protocols — are in the paper.

BibTeX

@article{zhao2026stdiffeye, title = {ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling}, author = {Zhao, Brian Nlong and Kara, Ozgur and Kim, Junho and Rehg, James M.}, journal = {arXiv preprint}, year = {2026} }

TODO: update the BibTeX once the arXiv identifier is assigned.