This tutorial demonstrates how to set up and use the Z-Image Turbo diffusion model within ComfyUI, focusing on its efficiency and performance, especially for users with limited VRAM.
[00:00:01.000] - [00:00:11.000] The video begins by showcasing a real-time generated image of a woman eating ramen in a vibrant, neon-lit street scene. This serves as an example of what can be achieved with the Z-Image Turbo model.
[00:00:11.000] - [00:00:28.000] The presenter highlights that this model is optimized for speed and low VRAM usage, making it accessible even on hardware with 6-8GB of VRAM. They mention that while it’s not on par with the largest models like SDXL, it’s a “great alternative to Plus 2.”
[00:00:29.000] - [00:00:43.000] The presenter notes that by using the same prompt repeatedly, the generated images exhibit significant similarity. To achieve variations, it’s recommended to “spice up that prompt” or use an LCM (Latent Consistency Model) to iterate through different ideas quickly.
[00:00:44.000] - [00:00:59.000] The video then delves into the technical aspects, showing the ComfyUI workflow. It emphasizes the “Load Diffusion Model” node, where users can select between different model versions. The presenter points out that using the “FP8 (low vram)” version can improve performance on less powerful machines.
[00:01:00.000] - [00:01:04.000] The presenter displays a performance metric showing image generation times, with the FP8 model achieving a speed of “1.950s” per image.
[00:01:05.000] - [00:01:15.000] The workflow also includes “Load CLIP” and “Load VAE” nodes, which are essential for processing text prompts and decoding the generated latent images into viewable outputs.
[00:01:15.000] - [00:01:29.000] The presenter refers to a “Markdown Note” section, which provides crucial “Model links” for downloading the necessary components. These include links for the diffusion model (BF16 or FP8 versions), the text encoder, and the VAE.
[00:01:30.000] - [00:01:40.000] The presenter meticulously details the “Model Storage Location” within the ComfyUI directory structure, specifying where each downloaded model file should be placed: ComfyUI/models/diffusion_models/, ComfyUI/models/text_encoders/, and ComfyUI/models/vae/.
[00:01:40.000] - [00:01:52.000] The presenter advises users to update ComfyUI to its latest version via the ComfyUI Manager, which can be accessed by pressing ‘R’ (or possibly ‘Ctrl+R’) and then clicking the “Update All” button.
[00:01:52.000] - [00:02:08.000] The video explains the selection of the diffusion model, recommending the FP8 version for users with less VRAM for better performance.
[00:02:08.000] - [00:02:19.000] It’s also mentioned that if you already have the necessary models downloaded, you can skip the download step.
[00:02:19.000] - [00:02:29.000] The presenter shows the correct placement of the downloaded models: the diffusion model goes into ComfyUI/models/diffusion_models/, the text encoder into ComfyUI/models/text_encoders/, and the VAE into ComfyUI/models/vae/.
[00:02:29.000] - [00:02:40.000] The structure of the ComfyUI models folder is displayed, illustrating the exact path for each model type.
[00:02:40.000] - [00:03:00.000] The presenter explains that users should download the specific model files and place them in the corresponding folders. They also mention that if you have these models already, you can skip the download.
[00:03:00.000] - [00:03:16.000] To ensure everything is up-to-date, the presenter suggests using the ComfyUI Manager to “Update All” or “Update existing.”
[00:03:16.000] - [00:03:42.000] The tutorial then shows how to configure the workflow. The “EmptySD3LatentImage” node is set to 1536x1536 for the dimensions, and the batch size remains at 1. The prompt is entered as: “A penguin with a top hat. He holds a sign that reads: ‘Penguin’. Background is tropical paradise.”
[00:03:42.000] - [00:04:01.000] With these settings, the model generates an image of a penguin wearing a top hat on a tropical beach, holding a sign that reads “Penguin.” The presenter notes the “8 steps” and “euler simple” sampler as key parameters.
[00:04:01.000] - [00:04:17.000] The presenter points out that while the FP8 model is efficient, there might be a slight loss in “precision” compared to larger models, but the overall results are still “good.”
[00:04:17.000] - [00:04:44.000] The demonstration then shifts to generating an image of a cat sleeping in the snow, with a dog sleeping next to it, and a mountain background. The prompt used is: “A cat sleeping in the snow. A dog is sleeping next to it. Background is mountains. A snow leopard is also there. The sun is shining, golden hour light. A red ball is in front of the dog. The leopard wears a top hat. Green magic potion is in front of the red ball.”
[00:04:44.000] - [00:05:17.000] The presenter emphasizes that “negatives” are not used in this specific prompt, and they are using the “conditioning zero out” node, which effectively disables negative prompts. This is shown as a way to achieve desired results even without explicitly defining negative prompts.
[00:05:17.000] - [00:05:56.000] The generated image shows the detailed scene described in the prompt, with a snow leopard wearing a top hat, a cat and dog in the snow, and the mentioned props.
[00:05:56.000] - [00:06:28.000] The presenter then tries a more complex prompt: “A woman is riding a snow leopard. The snow leopard wears a top hat. Another snow leopard is lying in the snow next to it. They have a red ball in front of them and a green potion. Background is mountains.” The model successfully generates this intricate scene.
[00:06:28.000] - [00:08:17.000] The presenter highlights the ability to combine elements and details, such as placing a “red ball” and “green potion” in the scene, and modifying the positioning of subjects. They also experiment with adding more elements to the prompt, like having the leopard wear a top hat. The results are shown to be “pretty amazing.”
[00:08:17.000] - [00:09:17.000] The presenter concludes by stating that this workflow and the Z-Image Turbo model provide a great way to generate “amazing stuff” with “low VRAM” and encourages viewers to experiment and have fun with it.