Offset: 0.0s
Space Play/Pause

How to Use Flux 2.0: Local Text-to-Image GGUF Model for Low V-RAM

The video showcases the use of a complex node-based workflow, likely within a tool like ComfyUI, to generate images using the FLUX diffusion models. The process involves loading various models, inc…

4 min read

The video showcases the use of a complex node-based workflow, likely within a tool like ComfyUI, to generate images using the FLUX diffusion models. The process involves loading various models, including a UNet model, a VAE, and CLIP text encoders, and then feeding them into a sampling pipeline.

[00:00:01.000] - [00:00:07.000] The initial part of the video demonstrates the successful generation of an image using the “smallest GGGUF model”. The process took “10 seconds on my 4060 TI 16GB”.

[00:00:07.000] - [00:00:17.000] The total time to generate the image was “235 seconds”. The generated image features a fox-like character singing on a stage.

[00:00:25.000] - [00:00:34.000] The workflow involves several nodes to process images and text. Two “Load Image” nodes are used to bring images into the workflow as conditions.

[00:00:34.000] - [00:00:54.000] The video then highlights the necessary “model links” for the workflow, including links to “Flux.2 Model,” “Flux.2 GGUF Models,” “Wan.2.2 Text Encoder,” and “VAE.”

[00:00:54.000] - [00:01:15.000] The speaker explains the need for a “Flux.2 GGUF model” with “16 gigabytes” of VRAM. The largest available GGUF model is “35.5 gigabytes,” but the speaker opts for the “16 gigabyte” version.

[00:01:15.000] - [00:01:57.000] The speaker also mentions using the “Wan.2.2 Text Encoder,” which is “18 gigabytes,” and the “VAE” which is “336 megabytes.”

[00:01:57.000] - [00:02:15.000] The speaker notes that the “GGUF version” of the text encoder is “not out yet” but they are using the “18 gigabyte” version. They then proceed to use the “flux2-vae” which is “336 MB.”

[00:02:15.000] - [00:02:23.000] The workflow appears to be configured to generate an image based on a prompt. The prompt includes “two women hugging each other.”

[00:02:23.000] - [00:02:39.000] The generated image from the first prompt shows “two women hugging each other,” but the “fingers are fused,” and “the shoulder looks big.”

[00:02:49.000] - [00:02:54.000] The speaker then loads the “Flux2_dev_Q2_K_gguf” model into the “Unet Loader.”

[00:02:54.000] - [00:03:01.000] They mention that “if you are not using the GGUF model, then you can use the safetensors file.” They then save the model inside the “unet” folder.

[00:03:01.000] - [00:03:21.000] The workflow demonstrates how the “GGUF model goes to the guider, then to the sampler to generate the final image.”

[00:03:21.000] - [00:03:42.000] The “VAE” is used to decode the “latent space” into “pixel space.” The prompt from the “CLIP Text Encode” node is “two women hugging each other.”

[00:03:42.000] - [00:04:04.000] The workflow uses “two image nodes” to incorporate “image conditions.” The prompt is fed into the “CLIP Text Encode” node, and the “VAE Encode” node converts pixels to latent space.

[00:04:04.000] - [00:04:29.000] The “ReferenceLatent” nodes are used “to give ref images.” The video also shows how to use “multiple reference images.”

[00:04:29.000] - [00:04:53.000] The speaker explains that if you are not using the GGUF model, you should use the safetensors file. They then switch to the “Flux2_dev_Q3_K_M.gguf” model.

[00:04:53.000] - [00:05:15.000] The “Q3 model” results in “lower quality” compared to the “original Flux model,” but it is “much faster.” The “Q3 model took 300 seconds to generate an image.”

[00:05:15.000] - [00:05:29.000] The speaker then compares the “Flux.2 model” with “Flux.3 model,” stating that “Flux.3 is much better.”

[00:05:29.000] - [00:06:14.000] The video shows how to use the “Prompting Guide” and “Quick Reference” for “FLUX text-to-image prompting.” It explains the “FLUX Prompt Framework” which uses “Subject + Action + Style + Context.”

[00:06:14.000] - [00:07:42.000] An example prompt is provided: “Red fox sitting in tall grass, wildlife documentary photography, misty dawn.” The prompt is broken down into “Subject,” “Action,” “Style,” and “Context.”

[00:07:42.000] - [00:08:12.000] The speaker demonstrates how to create a prompt for a “product image” using a template. They then try a different model, “Flux2_dev_Q2_K_gguf,” which is “faster but lower quality.”

[00:08:12.000] - [00:08:38.000] The “Q3 model” is then tried again, and the generated image is “okay,” but not as good as expected. The speaker mentions that “more advanced models” might be needed for better results.

[00:08:38.000] - [00:09:19.000] Finally, the speaker refers to the “Prompting Guide” for more information on “FLUX prompting” and demonstrates how to use “JSON structured prompting” for more control over image generation. They also show examples of “Typography and Design” and “Infographics and Data Visualization” using FLUX.