Accepted at ICML 2026

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

The first LLM-based framework for open-domain text-to-SVG animation, modeling motion as sparse state updates on a persistent SVG DOM tree for stronger identity consistency, structural validity, and geometry-level deformation.

Guotao Liang¹ Zhangcheng Wang² Chuang Wang¹ Juncheng Hu¹ Haitao Zhou¹ Junhua Liu³ Jing Zhang¹ Dong Xu⁴ Qian Yu¹^†

¹Beihang University ²4Paradigm ³Zhejiang University ⁴The University of Hong Kong

^† Corresponding author

arXiv Watch 54 Demos Code Coming Soon Dataset Page Coming Soon

SVGAnim-134k benchmark Over 9.8x sequence compression Identification-First Motion Planning Rendering-Aware GRPO

Core Idea

Sparse State Updates on a persistent SVG DOM tree

VAnim predicts differential updates instead of regenerating a full SVG at every step, reducing token length while preserving non-participating elements by construction.

Why It Matters

Structure-preserving motion beyond rigid transforms

The framework supports geometry-level motion and path deformation, unlocking richer vector animation than CSS or SMIL transform-only generation.

Scale

134k training samples and 54 public demo cases

The project introduces SVGAnim-134k, the first large-scale vector animation benchmark, and pairs it with a curated on-page case gallery for qualitative inspection.

Abstract

Open-domain text-to-SVG animation with structural control

Scalable Vector Graphics animation is valuable because it remains editable, lightweight, and resolution independent, yet it is difficult to generate automatically: the model must bridge discrete SVG code with continuous visual dynamics while preserving topology and identity. VAnim addresses this by representing animation as sparse state updates over a persistent SVG DOM tree, grounding motion through Identification-First Motion Planning, and aligning generation with rendered visual feedback using Rendering-Aware Reinforcement Learning via GRPO.

Contributions

Three ingredients behind VAnim

SVGAnim-134k benchmark

A large-scale benchmark for training and evaluating vector animation models, built from linguistically annotated SVG animation sequences.

Identification-First sparse generation

Motion planning is explicitly grounded to persistent SVG identities, then executed through sparse updates that retain the original structure by design.

Rendering-aware optimization

GRPO introduces visual reward feedback so the model can learn code updates that better match the intended motion and semantics after rendering.

Method

From SVG identity grounding to rendering-aware training

Overview figure of the VAnim framework. — Framework overview: identify and plan motion on persistent SVG entities, execute sparse updates, then optimize with rendering-aware GRPO.

Token efficiency analysis for sparse state updates. — Sparse state updates keep long-horizon animation generation tractable by compressing the sequence representation.

Dataset

SVGAnim-134k captures real vector motion diversity

The benchmark spans UI icons, loading indicators, narrative illustrations, and character dynamics. The resulting distribution contains a substantial share of path-level updates, making the task meaningfully harder than rigid transform animation.

First large-scale dataset for vector animation generation.
Dual-stream annotations pair user prompts with structure-bound reasoning traces.
Designed to support SFT, RL optimization, and held-out evaluation.

Representative samples from the SVGAnim-134k dataset. — Representative samples from SVGAnim-134k across diverse visual domains.

Attribute update distribution in the SVGAnim-134k dataset. — Attribute update statistics reveal strong geometry-level motion complexity, including non-rigid path manipulations.

Results

Better semantic alignment with stronger structure preservation

Qualitative comparison between VAnim and baseline methods. — Qualitative comparison against Gemini-3-Pro, GPT-5.2, and LiveSketch. VAnim preserves identity and topology while executing non-rigid and sequential motions more faithfully.

Ablation study figure for VAnim. — Ablations isolate the contributions of sparse updates, input grounding, and rendering-aware optimization.

Takeaway

What the paper emphasizes

The main gains come from two design decisions: a sparse representation that naturally stabilizes structure, and visual reward feedback that teaches the model to align rendered motion with prompt semantics.

Together, they let VAnim handle long-horizon animation with richer deformation patterns than rigid transform pipelines.

Featured Cases

A quick look at representative generated animations

Full Gallery

Browse all public demo cases

Loading demos...

Citation

Use this BibTeX entry

@inproceedings{liang2026vanim,
  title={VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation},
  author={Liang, Guotao and Wang, Zhangcheng and Wang, Chuang and Hu, Juncheng and Zhou, Haitao and Liu, Junhua and Zhang, Jing and Xu, Dong and Yu, Qian},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning},
  year={2026}
}