Multi-ControlNet & Open Source AI Video Generation

Full breakdown + takeaways from using ControlNet to make a video2video workflow. I use NeRFs here, but it'll apply to any 3D rendered or live action input. Let's get into it!

Feb 27, 2023

ControlNet continues to capture the imagination of the generative AI community — myself included! This thread is a continuation of my deep dive into ControlNet and it's implications for creators and entrepreneurs.

ICYMI, here’s the last post using ControlNet to redecorate a 3D scan of a room. I plan to make more videos + posts like the below, so stay tuned!

Creative Tech Digest

AI Room Makeover: Reskinning Reality With ControlNet, Stable Diffusion & EbSynth

Hey Creative Technologists! Today we’ll be covering an AI video experiment I created to learn and prepare a deep dive on ControlNet…

a year ago · 3 likes · 1 comment · Bilawal Sidhu

🪄 Honestly though, this wave of generative AI makes me feel like I'm 11 again discovering VFX & 3D animation software that gave me powers of digital sorcery to blend reality & imagination. It was fun to do this interview on my journey as a YouTube creator & product manager where I go deeper 🙏🏾

No doubt generative AI will brings out the child in all of us. It’s like we haven’t put together all the primitives at our disposal in all possible combinations and we’re learning new things every day.

Plus, new primitives keep dropping. A few of you realize we multi-control net, make a feature request, and bam it’s implemented in a few days. And we didn’t even have to write a PRD or sit through reviews 😉

Now on to the workflow to make 🔥 videos with the latest in open source AI!

Bilawal Sidhu @bilawalsidhu

Multi ControlNet is a game changer for making an open source video2video pipeline. I spent some time hacking this NeRF2Depth2Image workflow using a combination of ControlNet methods + SD 1.5 + EbSynth. 🧵 Full breakdown of my workflow & detailed tips shared in the thread below ⬇

Bilawal Sidhu @bilawalsidhu

Here's an overview workflow we're going to deconstruct! At a high level: Capture video (used my iPhone) ➡️ Train NeRF (used Luma AI) ➡️ Animate & Render RGB + Depth ➡️ Multi-Control Net (Depth + HED) ➡️ EbSynth ➡️ Blending & Compositing. Now let's break it down step by step:

Bilawal Sidhu @bilawalsidhu

For the input, I wanted to see if I can exploit the crispy depth maps you can get out of a Neural Radiance Field (NeRF) 3D scan. - Left: 3D flythrough rendered from a NeRF (iPhone video ➡️ trained w/ Luma AI) - Right: The corresponding depth map (notice the immaculate detail!)

Bilawal Sidhu @bilawalsidhu

Dialing in the look was easy with ControlNet + SD. I tested diff. methods and liked the combination of HEB Boundary + Depth the most. Almost went for the GTA look lol! Next, let's turn these into smooth video, then merge the strengths of different ControlNet methods together.

Bilawal Sidhu @bilawalsidhu

With the look dialed, I ran all video frames through ControlNet's depth module. Then I cherry picked a subset to serve as keyframes for Ebsynth. If you flipbook em, this already looks pretty good! I think this is why Runway's Gen 1 output is lower FPS to hide temporal artifacts?

Bilawal Sidhu @bilawalsidhu

Once you have your keyframes you can use EbSynth to interpolate between them using your original video as a guide. But if you cut naively between them you'll notice the result are pretty jumpy because the contents of the scene still change a fair bit b/w keyframes. Case in point:

Bilawal Sidhu @bilawalsidhu

The simplest way to make this less jarring is simply to render overlapping segments of the keyframes from EbSynth and blend them together in your video editing tool of choice. 💡 Tip: Render more keyframes and more overlap that you need. You can always refine/discard later.

Bilawal Sidhu @bilawalsidhu

ControlNet methods have their pros/cons based on your subject matter: - Left: Depth does a good job picking up the 3D structure in a scene, but struggles with the textures and thinner structures - Right: HEB boundary finds all the contrasty edges on the facade graffiti textures

Bilawal Sidhu @bilawalsidhu

To get a more coherent result, we can fuse these ControlNet methods. I used the crispy depth map from my NeRF scan and used it to composite the Depth + HED boundary passes together. This allows me to have the spatial "foundation" of the depth, then layer on the edge work with HED

Bilawal Sidhu @bilawalsidhu

And voila! We have our final result below. I wanted a stylized "painterly" quality, I experimented with blending modes in and liked "overlay" to add back higher frequency detail from the HED pass from ControlNet. I'm quite happy with the end result! Very clean.

Bilawal Sidhu @bilawalsidhu

Bonus tip: I also learned that you can also use the z-depth pass as an inpainting mask inside automatic1111 to create some very cool/trippy effects. This looks like a portal is opening up to the painting world of Bob Ross :)

And that's a wrap! I’d love to keep sharing what I'm learning with the open source AI & creator community, so if you found this helpful I'd appreciate it if you:

1. RT this thread or share this article with your creative tech frenz

2. Follow me on Twitter for more dank content

3. And if you aren’t already, subscribe below to get these right to your inbox