M&M VTO: Multi-Garment Virtual Try-On and Editing

Luyang Zhu1,2, Yingwei Li1, Nan Liu1, Hao Peng1,

Dawei Yang1, Ira Kemelmacher-Shlizerman1,2

1Google Research 2University of Washington

CVPR 2024 (Highlight)

We present M&M VTO–a mix and match virtual try-on method that takes as input multiple garment images, text description for garment layout and an image of a person. An example input includes: an image of a shirt, an image of a pair of pants, "rolled sleeves, shirt tucked in", and an image of a person. The output is a visualization of how those garments (in the desired layout) would look like on the given person. Key contributions of our method are: 1) a single stage diffusion based model, with no super resolution cascading, that allows to mix and match multiple garments at 1024x512 resolution preserving and warping intricate garment details, 2) architecture design (VTO UNet Diffusion Transformer) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy for identity preservation (6MB model per individual vs 4GB achieved with, e.g., dreambooth finetuning); solving a common identity loss problem in current virtual try-on methods, 3) layout control for multiple garments via text inputs specifically finetuned over PaLI-3 for virtual try-on task. Experimental results indicate that M&M VTO achieves state-of-the-art performance both qualitatively and quantitatively, as well as opens up new opportunities for virtual try-on via language-guided and multi-garment try-on.

Approach

Overview of M&M VTO. Left: Given multiple garments (top and bottom in this case, full-body garment not shown for this example), layout description, and a person image, our method enables multi-garment virtual try-on. Right: By freezing all the parameters, we optimize person feature embeddings extracted from the person encoder to improve person identity for a specific input image. The fine-tuning process recovers the information lost via agnostic computation.

VTO-UDiT Architecture. For image inputs, UNet encoders ($\mathbf{E}_{\mathbf{z}_t}$, $\mathbf{E}_{p}$, $\mathbf{E}_{g}$) extract features maps ($\mathcal{F}_{\mathbf{z}_t}$, $\mathcal{F}_{p}$, $\mathcal{F}_{g}^{\kappa}$) from $\mathbf{z}_t$, $I_{a}$, $I_{c}^{\kappa}$, respectively, with $\kappa \in \{\text{upper}, \text{lower}, \text{full}\}$. Diffusion timestep $t$ and garment attributes $y_{\text{gl}}$ are embedded with sinusoidal positional encoding, followed by a linear layer. The embeddings ($\mathcal{F}_{t}$ and $\mathcal{F}_{y_{\text{gl}}}$) are then used to modulate features with FiLM or concatenated to the key-value feature of self-attention in DiT similar to Following Imagen. Following TryOnDiffusion, spatially aligned features($\mathcal{F}_{\mathbf{z}_t}$, $\mathcal{F}_{p}$) are concatenated whereas $\mathcal{F}_{g}^{\kappa}$ are implicitly warped with cross-attention blocks. The final denoised image $\hat{\mathbf{x}}_0$ is obtained with decoder $\mathbf{D}_{\mathbf{z}_t}$, which is architecturally symmetrical to $\mathbf{E}_{\mathbf{z}_t}$.

Interactive Try-on Demo

Top Garment

Person or Try-on

Bottom Garment

Top Garment

Person or Try-on

Bottom Garment

Qualitative Results

Posed Garment VTO

Input Person

Input Garments

TryOnDiffusion

Ours

Layflat Garment VTO

Input Person

Input Garments

TryOnDiffusion

GP-VTON

LaDI-VTON

Ours-DressCode

Ours

Person Identity Preservation

Input
Person

Input
Garments

Fine-tuned
Full Model

Fine-tuned
Person Encoder

Ours Without
Fine-tuning

Ours With
Fine-tuning

Garment Layout Editing

Instruction: Tuck in the shirt

Input
Person

Input
Garments

Imagen
Editor

SDXL
Inpainting

DiffEdit

InstructP2P

P2P + NI

Ours

Instruction: Tuck out the shirt

Input
Person

Input
Garments

Imagen
Editor

SDXL
Inpainting

DiffEdit

InstructP2P

P2P + NI

Ours

Instruction: Roll up the Sleeve

Input
Person

Input
Garments

Imagen
Editor

SDXL
Inpainting

DiffEdit

InstructP2P

P2P + NI

Ours

Instruction: Roll down the Sleeve

Input
Person

Input
Garments

Imagen
Editor

SDXL
Inpainting

DiffEdit

InstructP2P

P2P + NI

Ours

Dress VTO

Input Person

Input Garment

Try-on Result

Limitations

There are several limitations for M&M VTO. First, our approach isn't designed for layout editing tasks, such as "Open the outer top", as no specific information is provided from inputs about what should be inpainted in the open area. Second, our method struggles with uncommon garment combinations found in the real world, like a long coat paired with skirts. Third, our model faces challenges when dealing with upper-body clothing from different images, e.g. pairing a shirt from one photo with an outer coat from another. This issue mainly stems from the difficulty in finding training pairs where one image clearly shows a shirt without any cover, while another displays the same shirt under an outer layer. As a result, the model struggles to accurately remove the shirt when it's covered by an outer layer during testing. Finally, note that our method visualizes how an item might look on a person, accounting for their body shape, but it doesn't yet include size information nor solves for exact fit.

BibTex

@InProceedings{Zhu_2024_CVPR_mmvto,
  author={Zhu, Luyang and Li, Yingwei and Liu, Nan and Peng, Hao and Yang, Dawei and Kemelmacher-Shlizerman, Ira},
  title={M&M VTO: Multi-Garment Virtual Try-On and Editing},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year={2024},
}

Special Thanks

This work was done when all authors were at Google. We would like to thank Chris Lee, Andreas Lugmayr, Innfarn Yoo, Chunhui Gu, Alan Yang, Varsha Ramakrishnan, Tyler Zhu, Srivatsan Varadharajan, Yasamin Jafarian and Ricardo Martin-Brualla for their insightful discussions. We are grateful for the kind support of the whole Google ARML Commerce organization. We thank Aurelia Di for her professional assistance on the garment layering Q&A survey design.