SurgMotionE

SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Jinlin Wu1,6, Felix Holm3, Chuxi Chen1, An Wang4, Yaxin Hu1, Xiaofan Ye7, Zelin Zang1, Miao Xu1,5,6, Lihua Zhou1, Huai Liao8, Danny T. M. CHAN9, Ming Feng10, Wai S. Poon7, Hongliang Ren4, Dong Yi1, Nassir Navab3, Gaofeng Meng1,5,6, Hongbin Liu1,6, Jiebo Luo2, and Zhen Lei*1,5,6
1Center for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, CAS, Hong Kong, China
2Hong Kong Institute of Science and Innovation, CAS, Hong Kong, China
3Computer Aided Medical Procedures, Technical University of Munich, Munich, Germany
4Electronic Engineering Department, The Chinese University of Hong Kong, Hong Kong, China
5University of Chinese Academy of Sciences, Beijing, China
6State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China
7Neuromedical Centre, Hong Kong University Shenzhen Hospital, Shenzhen, China
8Department of Respiratory Medicine, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
9Department of Surgery, The Chinese University of Hong Kong, Hong Kong, China
10Department of Neurosurgery, China Pituitary Disease Registry Center, PUMCH, CAMS & PUMC, Beijing, China February 11, 2026

Abstract

Current surgical foundation models remain trapped in static, image-based paradigms, failing to grasp the complex temporal dynamics essential for surgical understanding. We present SurgMotion, a video-native foundation model that shifts the paradigm from pixel-level reconstruction to latent motion prediction. Built upon the Video Joint Embedding Predictive Architecture (V-JEPA), SurgMotion learns robust spatiotemporal representations without the computational overhead of generative decoding. To unlock its potential, we curate SurgMotion-15M, the largest multi-modal surgical video dataset to date, spanning 13 anatomical regions and 3,658 hours.

We further introduce a Flow-Guided Latent Prediction objective to prevent feature collapse in homogeneous tissues. Extensive experiments demonstrate that SurgMotion outperforms state-of-the-art methods by significant margins: +14.6% F1-score on EgoSurgery and +10.3% F1-score on PitVis. Our work establishes a new standard for data-efficient, motion-aware surgical intelligence.

SurgMotion-15M Coverage

SurgMotion-15M spans 13+ anatomical regions, creating a diverse landscape for pan-surgical learning.

15M

Frames

13+

Organs

3.6k

Hours

6+

Tasks

Experimental Results

We evaluate SurgMotion on standard laparoscopic benchmarks, cross-domain generalization tasks, and fine-grained action understanding. By leveraging Flow-Guided V-JEPA, our model achieves state-of-the-art performance, recording a +14.6% F1-score improvement on EgoSurgery and +10.3% on PitVis compared to previous methods.

Dataset Duration Bar
Workflow

Scaling pre training data and model capacity together leads to a clear jump in performance. Trained on 3,658 hours of surgical video with 1.01B parameters, SurgMotion sets a new scale for surgical foundation models and achieves the highest workflow Avg F1 of 72.0 among compared methods.

All Models Task Bar
This advantage carries over to public benchmarks. Across six representative surgical tasks, SurgMotion achieves the best overall results on 5 of 6 tasks, leading in workflow analysis, action recognition, segmentation, triplet recognition, and skill assessment, while remaining competitive on depth estimation.

Qualitative Visualization


(a) Segmentation

The temporal consistency of V-JEPA features ensures stable masks across frames on Polyp dataset.

(b) Depth Estimation

Comparison of Ground Truth (left) vs. SurgMotion Prediction (right). Our model produces smooth, geometrically coherent depth maps on C3VD dataset.

(c) Phase Recognition

We assess SurgMotion on 8 datasets spanning laparoscopic, open, endonasal, neurosurgical, and ophthalmic procedures. Our model achieves SOTA performance of phase recognition on several challenging surgical workflow recognition datasets.
t-SNE Visualization

(d) Frame Segmentation via Phase Recognition

The latent feature space t-SNE embedding visualization shows clear separation between semantically distinct surgical phases. It visualizes temporal phase predictions across 12 representative cases from 8 datasets. Each row shows the predicted phase sequence as a colored bar, with the ground truth at the top. SurgMotion produces smoother predictions that closely track the ground truth phase boundaries, whereas baselines exhibit frequent fragmentation and phase confusion.

BibTeX

If you find this work helpful, you can cite our paper as follows:

@article{SurgMotion2026,
  title={SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos},
  author={Wu, Jinlin and Holm, Felix and Chen, Chuxi and Wang, An and Hu, Yaxin and Ye, Xiaofan and Zang, Zelin and Xu, Miao and Zhou, Lihua and Liao, Huai and Chan, Danny T. M. and Feng, Ming and Poon, Wai S. and Ren, Hongliang and Yi, Dong and Navab, Nassir and Meng, Gaofeng and Luo, Jiebo and Liu, Hongbin and Lei, Zhen},
  journal={arXiv preprint},
  year={2026}
}

Collaborating Institutions

We thank our partners for their support in clinical data and academic research. (Listed in no particular order)

Peking Union Medical College Hospital

Peking Union Medical College Hospital

King's College Hospital

King's College Hospital

First Affiliated Hospital of SYSU

First Affiliated Hospital of SYSU

Prince of Wales Hospital

Prince of Wales Hospital

HKU-Shenzhen Hospital

HKU-Shenzhen Hospital

Technical University of Munich

Technical University of Munich

The Chinese University of Hong Kong

The Chinese University of Hong Kong

×