EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model

CVPRW 2026

1NC AI 2Medipixel, Inc. 3MAUM.AI 4EverEx

EditCrafter can edit images up to 4K resolutions by leveraging pre-trained Text-to-Image diffusion models without additional fine-tuning or optimization.

Abstract

We propose EditCrafter, a high-resolution image editing method that operates without tuning, leveraging pretrained text-to-image (T2I) diffusion models to process images at resolutions significantly exceeding those used during training. Leveraging the generative priors of large-scale T2I diffusion models enables the development of a wide array of novel generation and editing applications. Although numerous image editing methods have been proposed based on diffusion models and exhibit high-quality editing results, they are difficult to apply to images with arbitrary aspect ratios or higher resolutions since they only work at the training resolutions (512×512 or 1024×1024). Naively applying patch-wise editing fails with unrealistic object structures and repetition. To address these challenges, we introduce EditCrafter, a simple yet effective editing pipeline. EditCrafter operates by first performing tiled inversion, which preserves the original identity of the input high-resolution image. We further propose a noise-damped manifold-constrained classifier-free guidance (NDCFG++) that is tailored for high resolution image editing from the inverted latent. Our experiments show that the our EditCrafter can achieve impressive editing results across various resolutions without fine-tuning and optimization.


Pipeline

Pipeline

Since the noise estimator (U-Net) in Stable Diffusion is trained on low-resolution images, directly inverting an encoded high-resolution image z0 = E(x0) into a high-resolution latent zt for subsequent editing results in poor identity preservation. So, we first perform tiled DDIM inversion to generate a high-resolution latent representation. Utilizing this latent, the reverse diffusion process is carried out with a re-dilated noise estimator. To enhance the quality of text-guided editing, we propose manifold-constrained noise-damped classifier-free guidance (NDCFG++). In this figure, editing prompt P is “A raccoon peeking out from behind a bush”.



Qualitative Comparisons


EditCrafter generates highly faithful editing images that are well-aligned with the editing prompts while preserving the intricate details of the original images. CSD[1] frequently exhibits repetitive objects due to its patch-wise generation scheme.

Stable Diffusion 2.1

Utilizing a pretrained model trained on a resolution of 512×512, our method can edit images with a resolution of up to 2048×2048 without the need for fine-tuning/optimization.

SDXL 1.0

Utilizing a pretrained model trained on a resolution of 1024×1024, our method can edit images with a resolution of up to 4096×4096 without the need for fine-tuning/optimization.

References

[1] Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin. Collaborative Score Distillation for Consistent Visual Synthesis. NeurIPS, 2023.

BibTeX

@inproceedings{
    kim2026editcrafter,
    title={{EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model}},
    author={Kunho Kim and Sumin Seo and Yongjun Cho and Hyungjin Chung},
    booktitle={CVPR 2nd Workshop on Human-Interactive Generation and Editing},
    year={2026},
}