Generic Video Editing with An Ensemble of Experts

Yuyang Zhao1 Enze Xie2 Lanqing Hong1 Zhenguo Li3 Gim Hee Lee1

1 National University of Singapore 2 The University of Hong Kong
3 The Hong Kong University of Science and Technology




The text-driven image and video diffusion models have achieved unprecedented success in generating realistic and diverse content. Recently, the editing and variation of existing images and videos in diffusion-based generative models have garnered significant attention. However, previous works are limited to editing content with text or providing coarse personalization using a single visual clue, rendering them unsuitable for indescribable content that requires fine-grained and detailed control. In this regard, we propose a generic video editing framework called Make-A-Protagonist, which utilizes textual and visual clues to edit videos with the goal of empowering individuals to become the protagonists. Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model that employs mask-guided denoising sampling to generate the desired output. Extensive results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.




The first framework for generic video editing with both visual and textual clues.



Text-to-Video Editing with Protagonist

Protagonist Editing

Background Editing



        title={Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts},
        author={Zhao, Yuyang and Xie, Enze and Hong, Lanqing and Li, Zhenguo and Lee, Gim Hee},
        journal={arXiv preprint arXiv:2305.08850},