SpotEdit: Selective Region Editing in Diffusion Transformers

1National University of Singapore, 2Shanghai Jiao Tong University
SpotEdit Results Preview

Abstract

Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.

Motivation

Most image edits modify only a small region, yet existing diffusion methods regenerate the entire image. This wastes computation.

By reconstructing the clean image at intermediate timesteps, we can directly observe which parts of the image have already stabilized and which ones are still being edited.

\( \hat{X}_0 = x_t - t \, v_{\theta}(x_t, c, t) \).

Original

original frame

x̂₀ᵗ

xt frame

t = 1.00

Result

result frame

By observing that non-edited regions converge early during diffusion, we ask:

Is it truly necessary to regenerate every region during editing?

And here we propose SpotEdit, which follows a simple principle: edit only what needs to be edited.

Pipeline

SpotEdit pipeline overview

The process consists of three stages: (1) Initial Steps: the model performs standard DiT denoising on all image tokens under the editing instruction, while caching the KV values for SpotFusion. (2) Spot Steps: SpotSelector dynamically identifies regenerated-region and non-edited-region tokens using LPIPS-like perceptual scores. Non-edited tokens are skipped by DiT computation, while regenerated tokens are generated iteratively with SpotFusion, which builds a temporally consistent condition cache by fusing cached non-edited KV values with condition image KV values. (3) Token Replacement: regenerated tokens are updated through DiT, and non-edited tokens are directly covered by the corresponding reused tokens before decoding into an image, ensuring background fidelity with reduced computation.

Gallery

BibTeX

@article{qin2025spotedit,
  title={SpotEdit: Selective Region Editing in Diffusion Transformers},
  author={Qin, Zhibin and Tan, Zhenxiong and Wang, Zeqing and Liu, Songhua and Wang, Xinchao},
  journal={arXiv preprint arXiv:2512.22323},
  year={2025},
  url={https://github.com/Biangbiang0321/SpotEdit}
}