SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Jinlin Wu^1,6, Felix Holm³, Chuxi Chen¹, An Wang⁴, Yaxin Hu¹, Xiaofan Ye⁷, Zelin Zang¹, Miao Xu^1,5,6, Lihua Zhou¹, Huai Liao⁸, Danny T. M. CHAN⁹, Ming Feng¹⁰, Wai S. Poon⁷, Hongliang Ren⁴, Dong Yi¹, Nassir Navab³, Gaofeng Meng^1,5,6, Hongbin Liu^1,6, Jiebo Luo², and Zhen Lei^*1,5,6

¹Center for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, CAS, Hong Kong, China
²Hong Kong Institute of Science and Innovation, CAS, Hong Kong, China
³Computer Aided Medical Procedures, Technical University of Munich, Munich, Germany
⁴Electronic Engineering Department, The Chinese University of Hong Kong, Hong Kong, China
⁵University of Chinese Academy of Sciences, Beijing, China
⁶State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China
⁷Neuromedical Centre, Hong Kong University Shenzhen Hospital, Shenzhen, China
⁸Department of Respiratory Medicine, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
⁹Department of Surgery, The Chinese University of Hong Kong, Hong Kong, China
¹⁰Department of Neurosurgery, China Pituitary Disease Registry Center, PUMCH, CAMS & PUMC, Beijing, China February 11, 2026

arXiv Code Checkpoint

Abstract

Current surgical foundation models remain trapped in static, image-based paradigms, failing to grasp the complex temporal dynamics essential for surgical understanding. We present SurgMotion, a video-native foundation model that shifts the paradigm from pixel-level reconstruction to latent motion prediction. Built upon the Video Joint Embedding Predictive Architecture (V-JEPA), SurgMotion learns robust spatiotemporal representations without the computational overhead of generative decoding. To unlock its potential, we curate SurgMotion-15M, the largest multi-modal surgical video dataset to date, spanning 13 anatomical regions and 3,658 hours.

We further introduce a Flow-Guided Latent Prediction objective to prevent feature collapse in homogeneous tissues. Extensive experiments demonstrate that SurgMotion outperforms state-of-the-art methods by significant margins: +14.6% F1-score on EgoSurgery and +10.3% F1-score on PitVis. Our work establishes a new standard for data-efficient, motion-aware surgical intelligence.

SurgMotion-15M Coverage

SurgMotion-15M spans 13+ anatomical regions, creating a diverse landscape for pan-surgical learning.

15M

Frames

13+

Organs

3.6k

Hours

6+

Tasks

Experimental Results

We evaluate SurgMotion on standard laparoscopic benchmarks, cross-domain generalization tasks, and fine-grained action understanding. By leveraging Flow-Guided V-JEPA, our model achieves state-of-the-art performance, recording a +14.6% F1-score improvement on EgoSurgery and +10.3% on PitVis compared to previous methods.

Scaling pre training data and model capacity together leads to a clear jump in performance. Trained on 3,658 hours of surgical video with 1.01B parameters, SurgMotion sets a new scale for surgical foundation models and achieves the highest workflow Avg F1 of 72.0 among compared methods.

All Models Task Bar — This advantage carries over to public benchmarks. Across six representative surgical tasks, SurgMotion achieves the best overall results on 5 of 6 tasks, leading in workflow analysis, action recognition, segmentation, triplet recognition, and skill assessment, while remaining competitive on depth estimation.

Qualitative Visualization

(a) Segmentation

The temporal consistency of V-JEPA features ensures stable masks across frames on Polyp dataset.

(b) Depth Estimation

Comparison of Ground Truth (left) vs. SurgMotion Prediction (right). Our model produces smooth, geometrically coherent depth maps on C3VD dataset.

We assess SurgMotion on 8 datasets spanning laparoscopic, open, endonasal, neurosurgical, and ophthalmic procedures. Our model achieves SOTA performance of phase recognition on several challenging surgical workflow recognition datasets.

(d) Frame Segmentation via Phase Recognition

The latent feature space t-SNE embedding visualization shows clear separation between semantically distinct surgical phases. It visualizes temporal phase predictions across 12 representative cases from 8 datasets. Each row shows the predicted phase sequence as a colored bar, with the ground truth at the top. SurgMotion produces smoother predictions that closely track the ground truth phase boundaries, whereas baselines exhibit frequent fragmentation and phase confusion.

BibTeX

If you find this work helpful, you can cite our paper as follows:

@article{SurgMotion2026,
  title={SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos},
  author={Wu, Jinlin and Holm, Felix and Chen, Chuxi and Wang, An and Hu, Yaxin and Ye, Xiaofan and Zang, Zelin and Xu, Miao and Zhou, Lihua and Liao, Huai and Chan, Danny T. M. and Feng, Ming and Poon, Wai S. and Ren, Hongliang and Yi, Dong and Navab, Nassir and Meng, Gaofeng and Luo, Jiebo and Liu, Hongbin and Lei, Zhen},
  journal={arXiv preprint},
  year={2026}
}

Collaborating Institutions

We thank our partners for their support in clinical data and academic research. (Listed in no particular order)

Peking Union Medical College Hospital

King's College Hospital

First Affiliated Hospital of SYSU

Prince of Wales Hospital

HKU-Shenzhen Hospital

Technical University of Munich

The Chinese University of Hong Kong