EditCrafter can edit images up to 4K resolutions by leveraging pre-trained Text-to-Image diffusion models without additional fine-tuning or optimization.
We propose EditCrafter, a high-resolution image editing method that operates without tuning, leveraging pretrained text-to-image (T2I) diffusion models to process images at resolutions significantly exceeding those used during training. Leveraging the generative priors of large-scale T2I diffusion models enables the development of a wide array of novel generation and editing applications. Although numerous image editing methods have been proposed based on diffusion models and exhibit high-quality editing results, they are difficult to apply to images with arbitrary aspect ratios or higher resolutions since they only work at the training resolutions (512×512 or 1024×1024). Naively applying patch-wise editing fails with unrealistic object structures and repetition. To address these challenges, we introduce EditCrafter, a simple yet effective editing pipeline. EditCrafter operates by first performing tiled inversion, which preserves the original identity of the input high-resolution image. We further propose a noise-damped manifold-constrained classifier-free guidance (NDCFG++) that is tailored for high resolution image editing from the inverted latent. Our experiments show that the our EditCrafter can achieve impressive editing results across various resolutions without fine-tuning and optimization.
Since the noise estimator (U-Net) in Stable Diffusion is trained on low-resolution images, directly inverting an encoded high-resolution image z0 = E(x0) into a high-resolution latent zt for subsequent editing results in poor identity preservation. So, we first perform tiled DDIM inversion to generate a high-resolution latent representation. Utilizing this latent, the reverse diffusion process is carried out with a re-dilated noise estimator. To enhance the quality of text-guided editing, we propose manifold-constrained noise-damped classifier-free guidance (NDCFG++). In this figure, editing prompt P is “A raccoon peeking out from behind a bush”.
EditCrafter generates highly faithful editing images that are well-aligned with the editing prompts while preserving the intricate details of the original images.
CSD[1] frequently exhibits repetitive objects due to its patch-wise generation scheme.
Utilizing a pretrained model trained on a resolution of 512×512, our method can edit images with a resolution of up to 2048×2048 without the need for fine-tuning/optimization.
Stable Diffusion 2.1
4× (1024×1024)
...on a rainy street, reflecting city lights. → ...in a desert setting at sunset.
Stable Diffusion 2.1
4× (1024×1024)
...lying on a cozy blanket → ...lying on a grassy field
Stable Diffusion 2.1
4× (1024×1024)
moon → earth
Stable Diffusion 2.1
4× (1024×1024)
cactus → aloe
Stable Diffusion 2.1
4× (1024×1024)
...with a lemon slice → ...with a cucumber slice
Stable Diffusion 2.1
4× (1024×1024)
tulips → roses
Stable Diffusion 2.1
4× (1024×1024)
village → castle
Stable Diffusion 2.1
8× (2048×1024)
fox → lion
Stable Diffusion 2.1
8× (2048×1024)
barn owl → barn hawk
Stable Diffusion 2.1
8× (2048×1024)
palm tree → beach umbrella
Stable Diffusion 2.1
8× (2048×1024)
whale shark → dolphin
Stable Diffusion 2.1
8× (2048×1024)
...overlooking a village → ...surrounded by autumn forests
Stable Diffusion 2.1
8× (2048×1024)
elephant → zebra
Stable Diffusion 2.1
16× (2048×2048)
cat → goat
Stable Diffusion 2.1
16× (2048×2048)
...topped with berries → ...topped with roses
Stable Diffusion 2.1
16× (2048×2048)
soccer ball → crystal ball
Stable Diffusion 2.1
16× (2048×2048)
chameleon → koala
Utilizing a pretrained model trained on a resolution of 1024×1024, our method can edit images with a resolution of up to 4096×4096 without the need for fine-tuning/optimization.
SDXL 1.0
4× (2048×2048)
dandelion seeds → balloon
SDXL 1.0
4× (2048×2048)
phoenix → chicken
SDXL 1.0
4× (2048×2048)
...overflowing with jewels → ...overflowing with bones
SDXL 1.0
4× (2048×2048)
...on an open road → ...on a desert
SDXL 1.0
8× (4096×2048)
cherry blossom → maple
SDXL 1.0
8× (4096×2048)
lion → tiger
SDXL 1.0
8× (4096×2048)
beanstalk → mushroom
SDXL 1.0
8× (4096×2048)
seashell → crab
SDXL 1.0
8× (4096×2048)
snow globe → jungle globe
SDXL 1.0
8× (4096×2048)
humpback whale → green sea turtle
SDXL 1.0
16× (4096×4096)
forest → burning forest
SDXL 1.0
16× (4096×4096)
apple → pink peach
SDXL 1.0
16× (4096×4096)
bee → hummingbird
SDXL 1.0
16× (4096×4096)
bird → owl
SDXL 1.0
16× (4096×4096)
mountain → sand dune
SDXL 1.0
16× (4096×4096)
stone → Stonehenge
SDXL 1.0
16× (4096×4096)
waterfall → lava flow
[1] Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin. Collaborative Score Distillation for Consistent Visual Synthesis. NeurIPS, 2023.
@inproceedings{
kim2026editcrafter,
title={{EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model}},
author={Kunho Kim and Sumin Seo and Yongjun Cho and Hyungjin Chung},
booktitle={CVPR 2nd Workshop on Human-Interactive Generation and Editing},
year={2026},
}