EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model

CVPRW 2026

Kunho Kim¹, Sumin Seo², Yongjun Cho³, Hyungjin Chung⁴

¹NC AI ²Medipixel, Inc. ³MAUM.AI ⁴EverEx

EditCrafter can edit images up to 4K resolutions by leveraging pre-trained Text-to-Image diffusion models without additional fine-tuning or optimization.

Abstract

We propose EditCrafter, a high-resolution image editing method that operates without tuning, leveraging pretrained text-to-image (T2I) diffusion models to process images at resolutions significantly exceeding those used during training. Leveraging the generative priors of large-scale T2I diffusion models enables the development of a wide array of novel generation and editing applications. Although numerous image editing methods have been proposed based on diffusion models and exhibit high-quality editing results, they are difficult to apply to images with arbitrary aspect ratios or higher resolutions since they only work at the training resolutions (512×512 or 1024×1024). Naively applying patch-wise editing fails with unrealistic object structures and repetition. To address these challenges, we introduce EditCrafter, a simple yet effective editing pipeline. EditCrafter operates by first performing tiled inversion, which preserves the original identity of the input high-resolution image. We further propose a noise-damped manifold-constrained classifier-free guidance (NDCFG++) that is tailored for high resolution image editing from the inverted latent. Our experiments show that the our EditCrafter can achieve impressive editing results across various resolutions without fine-tuning and optimization.

Pipeline

Since the noise estimator (U-Net) in Stable Diffusion is trained on low-resolution images, directly inverting an encoded high-resolution image z₀ = E(x₀) into a high-resolution latent z_t for subsequent editing results in poor identity preservation. So, we first perform tiled DDIM inversion to generate a high-resolution latent representation. Utilizing this latent, the reverse diffusion process is carried out with a re-dilated noise estimator. To enhance the quality of text-guided editing, we propose manifold-constrained noise-damped classifier-free guidance (NDCFG++). In this figure, editing prompt P is “A raccoon peeking out from behind a bush”.

Qualitative Comparisons

EditCrafter generates highly faithful editing images that are well-aligned with the editing prompts while preserving the intricate details of the original images. CSD^[1] frequently exhibits repetitive objects due to its patch-wise generation scheme.

Stable Diffusion 2.1

Utilizing a pretrained model trained on a resolution of 512×512, our method can edit images with a resolution of up to 2048×2048 without the need for fine-tuning/optimization.

Stable Diffusion 2.1

4× (1024×1024)

...on a rainy street, reflecting city lights. → ...in a desert setting at sunset.

Stable Diffusion 2.1 4x original image — Original Image

Stable Diffusion 2.1

4× (1024×1024)

moon → earth

Stable Diffusion 2.1

4× (1024×1024)

...lying on a cozy blanket → ...lying on a grassy field

Stable Diffusion 2.1

4× (1024×1024)

cactus → aloe

Stable Diffusion 2.1

4× (1024×1024)

...with a lemon slice → ...with a cucumber slice

Stable Diffusion 2.1

4× (1024×1024)

tulips → roses

Stable Diffusion 2.1

4× (1024×1024)

village → castle

Stable Diffusion 2.1

8× (2048×1024)

tiger → panda

Stable Diffusion 2.1 8x original image — Original Image

Stable Diffusion 2.1

8× (2048×1024)

eagle → dragon

Stable Diffusion 2.1

8× (2048×1024)

fox → lion

Stable Diffusion 2.1

8× (2048×1024)

barn owl → hawk

Stable Diffusion 2.1

8× (2048×1024)

palm tree → beach umbrella

Stable Diffusion 2.1

8× (2048×1024)

shark → dolphin

Stable Diffusion 2.1

16× (2048×2048)

chameleon → koala

Stable Diffusion 2.1 16x original image — Original Image

Stable Diffusion 2.1

16× (2048×2048)

berries → roses

Stable Diffusion 2.1

16× (2048×2048)

cat → goat

Stable Diffusion 2.1

16× (2048×2048)

soccer ball → crystal ball

SDXL 1.0

Utilizing a pretrained model trained on a resolution of 1024×1024, our method can edit images with a resolution of up to 4096×4096 without the need for fine-tuning/optimization.

SDXL 1.0

4× (2048×2048)

dandelion seeds → balloon

SDXL 1.0 4x original image — Original Image

SDXL 1.0

4× (2048×2048)

asphalt → desert

SDXL 1.0

4× (2048×2048)

gems → bones

SDXL 1.0

4× (2048×2048)

phoenix → chicken

SDXL 1.0

8× (4096×2048)

cherry blossom → maple

SDXL 1.0 8x original image — Original Image

SDXL 1.0

8× (4096×2048)

beanstalk → mushroom

SDXL 1.0

8× (4096×2048)

lion → tiger

SDXL 1.0

8× (4096×2048)

seashell → crab

SDXL 1.0

8× (4096×2048)

snow globe → jungle globe

SDXL 1.0

8× (4096×2048)

humpback whale → green sea turtle

SDXL 1.0

16× (4096×4096)

forest → burning forest

SDXL 1.0 16x original image — Original Image

SDXL 1.0

16× (4096×4096)

apple → pink peach

SDXL 1.0

16× (4096×4096)

bee → hummingbird

SDXL 1.0

16× (4096×4096)

bird → owl

SDXL 1.0

16× (4096×4096)

mountain → sand dune

SDXL 1.0

16× (4096×4096)

stone → Stonehenge

SDXL 1.0

16× (4096×4096)

waterfall → lava flow

References

[1] Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin. Collaborative Score Distillation for Consistent Visual Synthesis. NeurIPS, 2023.

BibTeX

@inproceedings{
    kim2026editcrafter,
    title={{EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model}},
    author={Kunho Kim and Sumin Seo and Yongjun Cho and Hyungjin Chung},
    booktitle={CVPR 2nd Workshop on Human-Interactive Generation and Editing},
    year={2026},
}