StageDiffusion | Seoyeon Byun

Background and Objectives

Controllable image generation is critical for interactive human-AI co-creation systems, where users require fine-grained spatial control. Existing training-free layout-to-image methods struggle with overlapping regions because they enforce mutually exclusive attention, leading to conflicting optimization signals and degraded compositional coherence.

Comparison of our method with existing state-of-the-art methods.

Methods

We introduced a training-free, stagewise diffusion framework that decomposes complex layouts into simpler sub-scenes. The method progressively grounds objects through joint autoregressive sampling, while earlier stages form a persistent scene scaffold. I also worked with constraint-aware attention mechanisms, balancing current-object emphasis and global scene consistency.

Overview of the Stagewise Diffusion Framework.

My Role

I empirically analyzed how compositional signals emerge in intermediate representations using attention analysis and linear probing, which helped inform key design decisions in our stage-wise factorization.

Results & Outputs

This work was submitted to CVPR 2026.