Stable Diffusion Cheatsheet Cheatsheet

🎨

Diffusers Library

Core Library

Hugging Face Diffusers is the go-to library for diffusion models. It provides pre-trained models for image generation, inpainting, super-resolution, and more.

diffusers_basics.py

from diffusers import StableDiffusionPipeline, StableDiffusionXLPipeline
import torch

# ── Basic Text-to-Image ──
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")

image = pipe(
    "A serene Japanese garden with cherry blossoms",
    num_inference_steps=30,
    guidance_scale=7.5,
    width=1024,
    height=1024,
).images[0]
image.save("garden.png")

# ── Image-to-Image ──
from diffusers import StableDiffusionImg2ImgPipeline
i2i_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    torch_dtype=torch.float16,
).to("cuda")

result = i2i_pipe(
    prompt="Oil painting style",
    image=input_image,
    strength=0.7,  # 0.0 = keep original, 1.0 = ignore original
    num_inference_steps=30,
).images[0]

# ── Inpainting ──
from diffusers import StableDiffusionInpaintPipeline
inpaint_pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-inpainting-1.0",
    torch_dtype=torch.float16,
).to("cuda")

result = inpaint_pipe(
    prompt="a red sports car",
    image=base_image,
    mask_image=mask,  # White = inpaint, Black = keep
    num_inference_steps=30,
).images[0]

Key Diffusion Parameters

Parameter	Default	Description	Effect
num_inference_steps	50	Number of denoising steps	More = better quality, slower (20-50 ideal)
guidance_scale	7.5	How closely to follow prompt	Higher = more prompt adherence, less creative
width / height	512	Output image dimensions	Must be divisible by 8. 1024x1024 for SDXL
negative_prompt	None	What to avoid	Reduce artifacts: "blurry, ugly, bad quality"
seed	random	Reproducibility	Same seed = same image (deterministic)
strength	0.8	Img2Img transformation amount	0.0=original, 1.0=fully new
num_images_per_prompt	1	Images per generation	Batch multiple images

🎯

LoRA Training for SD

Custom Models

Low-Rank Adaptation (LoRA) lets you train custom styles and concepts for Stable Diffusion on a single consumer GPU with as few as 10-20 images.

lora_training.py

from diffusers import StableDiffusionPipeline, StableDiffusionXLPipeline
from peft import LoraConfig, get_peft_model
from diffusers import AutoencoderKL, DDPMScheduler, UNet2DConditionModel

# ── Train LoRA with SDXL (using diffusers + PEFT) ──
from diffusers import SDXLTransformer2DModel, StableDiffusionXLImg2ImgPipeline

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
vae = AutoencoderKL.from_pretrained(model_id, subfolder="vae")
scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")

# ── Use PEFT LoRA config ──
from transformers import CLIPTokenizer

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    init_lora_weights="gaussian",
    target_modules=["to_q", "to_k", "to_v", "to_out.0"],
)

# For SDXL fine-tuning, use accelerate for multi-GPU
# accelerate launch train_lora_sdxl.py \
#   --pretrained_model_name_or_path=$MODEL_DIR \
#   --instance_data_dir=$INSTANCE_DIR \
#   --output_dir=$OUTPUT_DIR \
#   --instance_prompt="a photo of sks person" \
#   --resolution=1024 \
#   --train_batch_size=1 \
#   --gradient_accumulation_steps=4 \
#   --learning_rate=1e-4 \
#   --lr_scheduler="constant" \
#   --lr_warmup_steps=0 \
#   --max_train_steps=500 \
#   --mixed_precision="fp16"

# ── Inference with LoRA ──
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")
pipe.load_lora_weights("./my-lora-weights", weight_name="pytorch_lora_weights.safetensors")

image = pipe("a photo of sks person as a cyborg", num_inference_steps=30,
             guidance_scale=7.5).images[0]

# ── Combine multiple LoRAs ──
pipe.set_adapters(["style_lora", "character_lora"], adapter_weights=[0.7, 1.0])

LoRA Training Tips

Minimum Images10-20 images for a style/concept LoRA. 50-100 for a character. 500+ for a full style.

Image PreparationCrop to square, 512x512 (SD 1.5) or 1024x1024 (SDXL). Remove watermarks. Consistent lighting/angle.

Instance PromptUnique identifier: "a photo of sks person" or "in the style of sks". Must be distinct from common words.

Training Steps500-2000 steps for LoRA. Use lower LR (1e-4 to 5e-5) for longer training to avoid overfitting.

Rank (r)r=16 for good quality/size balance. r=32 for finer details. Higher r = larger file, more training.

🎮

ControlNet

Precise Control

ControlNet adds spatial conditioning to Stable Diffusion, enabling precise control over composition, pose, depth, and edge structure of generated images.

controlnet.py

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from controlnet_aux import CannyDetector, OpenposeDetector, DepthEstimator
import torch

# ── Edge/Canny ControlNet ──
controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0",
    torch_dtype=torch.float16,
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    torch_dtype=torch.float16,
).to("cuda")

# Detect edges
canny = CannyDetector()
control_image = canny(input_image)
result = pipe(
    "A futuristic city with neon lights",
    image=control_image,
    controlnet_conditioning_scale=0.8,
    num_inference_steps=30,
).images[0]

# ── Multi-ControlNet (pose + depth) ──
from diffusers import StableDiffusionXLControlNetPipeline, MultiControlNetModel

controlnet_pose = ControlNetModel.from_pretrained(
    "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16)
controlnet_depth = ControlNetModel.from_pretrained(
    "diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16)

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=MultiControlNetModel([controlnet_pose, controlnet_depth]),
    torch_dtype=torch.float16,
).to("cuda")

result = pipe(
    "A woman in a red dress walking in a garden",
    controlnet_conditioning_scale=[0.9, 0.7],
    image=[pose_image, depth_image],
    num_inference_steps=30,
).images[0]

ControlNet Pre-trained Models

Type	Model	Use For	Input
Canny Edge	controlnet-canny-sdxl-1.0	Edge-guided generation	Canny edge map
Depth	controlnet-depth-sdxl-1.0	Depth-aware generation	Grayscale depth map
Pose (OpenPose)	controlnet-openpose-sdxl-1.0	Pose-guided humans	Skeleton pose image
Scribble	controlnet-scribble-sdxl-1.0	Draw your composition	Simple line scribble
Segmentation	controlnet-seg-sdxl-1.0	Region-based generation	Semantic segmentation mask
Normal	controlnet-normal-sdxl-1.0	Surface normal guidance	Normal map
Tile	controlnet-tile-sdxl-1.0	Upscale with detail preservation	Tiled/resized image

🖼️

ComfyUI & Workflows

Node-Based UI

ComfyUI is a node-based interface for Stable Diffusion that enables complex workflows through visual programming. It is the most popular SD GUI for power users.

ComfyUI Key Features

Node-Based WorkflowDrag-and-drop nodes for Load Model, CLIP, VAE, Sampler, ControlNet, LoRA, etc. Connect outputs to inputs visually.

Built-in ControlNetPre-processors for Canny, Depth, Pose, Segmentation. Load as separate nodes with adjustable strength.

LoRA SupportLoad multiple LoRAs with individual weight sliders. Combine style + character LoRAs in one workflow.

High-Res FixTwo-pass generation: generate at low res, upscale with Latent upscale, refine at high res.

Batch ProcessingGenerate multiple images with different seeds/prompts in one workflow run.

API AccessEvery workflow can be exported as JSON and executed via API for programmatic generation.

Custom NodesInstall community extensions via ComfyUI-Manager. 1000+ custom nodes available.

💡

ComfyUI vs Automatic1111: ComfyUI has better workflow reproducibility (save/load JSON), is more memory-efficient, and supports advanced pipelines (ControlNet, LoRA, IP-Adapter). Automatic1111 has more plugins and a simpler UI for beginners. For production, ComfyUI workflows can be executed headlessly via API.

🗄️

Model Ecosystem

Model Landscape

Popular Diffusion Models (2025)

Model	Type	Resolution	Best For	License
Stable Diffusion XL 1.0	Latent Diffusion	1024x1024	General image generation	OpenRAIL-M
Stable Diffusion 3.5	Flow Matching	1MP+	Prompt adherence, text rendering	Community
FLUX.1 (Black Forest)	Flow Matching	1024x1024	Prompt following, photorealism	Apache 2.0
DALL-E 3	Diffusion (API)	1024x1024	Best text rendering, easy API	Proprietary
Midjourney v6	Proprietary	1024x1024	Artistic quality, aesthetics	Commercial license
PixArt-Sigma	Transformer	1024x1024	Open-source SDXL alternative	Apache 2.0
Kandinsky 3.0	Latent Diffusion	1024x1024	Russian/English bilingual	Apache 2.0
Playground v2.5	Latent Diffusion	1024x1024	Photorealism, composition	Custom license

💬

Interview Questions

Top 8

Q1: How does Stable Diffusion work?

AnswerSD is a latent diffusion model that works in 3 stages: 1) VAE encoder compresses image to latent space. 2) U-Net denoises the latent step-by-step (reverse diffusion). 3) VAE decoder converts latent back to pixel space. Training: start with clean images, add Gaussian noise progressively (forward process), train U-Net to predict and remove noise (reverse process). At inference, start from pure noise and denoise.

Q2: What is LoRA and why does it work for SD?

AnswerLoRA injects low-rank trainable matrices (A and B) into the cross-attention layers of the U-Net. During inference: W_new = W_original + BA where B is (d x r) and A is (r x d). Works because style/concept adaptations require only a low-rank subspace of the full weight space. Trains in 1-2 hours on a single RTX 3090 with 10-20 images.

Q3: Guidance scale explained

AnswerClassifier-free guidance (CFG) uses two predictions: one with prompt, one without. Output = uncond + scale * (cond - uncond). Higher scale = stronger adherence to prompt but less diversity. Low (1-3) = creative/diverse. Medium (7-10) = balanced (default). High (15-30) = very strict prompt following, may cause artifacts.

Q4: ControlNet architecture

AnswerControlNet is a trainable copy of the U-Net encoder that receives an additional conditioning image (edge map, depth, pose). Zero convolutions connect the ControlNet output to the original U-Net blocks. During training, the original U-Net is frozen, only ControlNet is trained. This preserves the original model's generation quality while adding precise spatial control.

Q5: Text encoder (CLIP) role

AnswerCLIP (Contrastive Language-Image Pre-training) encodes text prompts into embeddings that guide image generation. SDXL uses two text encoders: CLIP ViT-L (primary) and CLIP ViT-G (refiner). The U-Net's cross-attention layers attend to these text embeddings to condition the denoising process. Better prompts = better images because the text encoder sets the semantic direction.

Q6: Negative prompts

AnswerNegative prompts tell the model what to avoid. During classifier-free guidance, the negative prompt replaces the unconditional (empty) prediction. Common negatives: "blurry, ugly, bad anatomy, extra limbs, low quality, watermark, text." SDXL relies less on negative prompts than SD 1.5 due to better training data and architecture.

Q7: Flow matching vs diffusion

AnswerFlow matching (used in SD3, FLUX) trains a velocity field that transforms noise to data via an ODE. Compared to diffusion: uses rectified flow (straight paths), fewer steps needed for quality, more stable training. Diffusion uses a stochastic differential equation (SDE). Flow matching can achieve similar quality in 4-8 steps vs 20-50 for diffusion.

Q8: Inpainting vs image-to-image

AnswerInpainting: only modifies masked regions while preserving the rest. Requires a mask (white = modify, black = keep). Used for object removal, editing specific areas. Image-to-Image: transforms the entire image based on a new prompt. Strength parameter controls how much the original image influences the output. 0.0 = original, 1.0 = ignore original completely.

⏳

Loading cheatsheet...

from diffusers import StableDiffusionPipeline, StableDiffusionXLPipeline import torch # ── Basic Text-to-Image ── pipe = StableDiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", ).to("cuda") image = pipe( "A serene Japanese garden with cherry blossoms", num_inference_steps=30, guidance_scale=7.5, width=1024, height=1024, ).images[0] image.save("garden.png") # ── Image-to-Image ── from diffusers import StableDiffusionImg2ImgPipeline i2i_pipe = StableDiffusionImg2ImgPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, ).to("cuda") result = i2i_pipe( prompt="Oil painting style", image=input_image, strength=0.7, # 0.0 = keep original, 1.0 = ignore original num_inference_steps=30, ).images[0] # ── Inpainting ── from diffusers import StableDiffusionInpaintPipeline inpaint_pipe = StableDiffusionInpaintPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-inpainting-1.0", torch_dtype=torch.float16, ).to("cuda") result = inpaint_pipe( prompt="a red sports car", image=base_image, mask_image=mask, # White = inpaint, Black = keep num_inference_steps=30, ).images[0]

Parameter

Default

Description

Effect

num_inference_steps

Number of denoising steps

More = better quality, slower (20-50 ideal)

guidance_scale

7.5

How closely to follow prompt

Higher = more prompt adherence, less creative

width / height

512

Output image dimensions

Must be divisible by 8. 1024x1024 for SDXL

negative_prompt

None

What to avoid

Reduce artifacts: "blurry, ugly, bad quality"

seed

random

Reproducibility

Same seed = same image (deterministic)

strength

0.8

Img2Img transformation amount

0.0=original, 1.0=fully new

num_images_per_prompt

Images per generation

Batch multiple images

from diffusers import StableDiffusionPipeline, StableDiffusionXLPipeline from peft import LoraConfig, get_peft_model from diffusers import AutoencoderKL, DDPMScheduler, UNet2DConditionModel # ── Train LoRA with SDXL (using diffusers + PEFT) ── from diffusers import SDXLTransformer2DModel, StableDiffusionXLImg2ImgPipeline model_id = "stabilityai/stable-diffusion-xl-base-1.0" vae = AutoencoderKL.from_pretrained(model_id, subfolder="vae") scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler") # ── Use PEFT LoRA config ── from transformers import CLIPTokenizer lora_config = LoraConfig( r=16, lora_alpha=16, init_lora_weights="gaussian", target_modules=["to_q", "to_k", "to_v", "to_out.0"], ) # For SDXL fine-tuning, use accelerate for multi-GPU # accelerate launch train_lora_sdxl.py \ # --pretrained_model_name_or_path=$MODEL_DIR \ # --instance_data_dir=$INSTANCE_DIR \ # --output_dir=$OUTPUT_DIR \ # --instance_prompt="a photo of sks person" \ # --resolution=1024 \ # --train_batch_size=1 \ # --gradient_accumulation_steps=4 \ # --learning_rate=1e-4 \ # --lr_scheduler="constant" \ # --lr_warmup_steps=0 \ # --max_train_steps=500 \ # --mixed_precision="fp16" # ── Inference with LoRA ── pipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, ).to("cuda") pipe.load_lora_weights("./my-lora-weights", weight_name="pytorch_lora_weights.safetensors") image = pipe("a photo of sks person as a cyborg", num_inference_steps=30, guidance_scale=7.5).images[0] # ── Combine multiple LoRAs ── pipe.set_adapters(["style_lora", "character_lora"], adapter_weights=[0.7, 1.0])

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel from controlnet_aux import CannyDetector, OpenposeDetector, DepthEstimator import torch # ── Edge/Canny ControlNet ── controlnet = ControlNetModel.from_pretrained( "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, ) pipe = StableDiffusionControlNetPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, torch_dtype=torch.float16, ).to("cuda") # Detect edges canny = CannyDetector() control_image = canny(input_image) result = pipe( "A futuristic city with neon lights", image=control_image, controlnet_conditioning_scale=0.8, num_inference_steps=30, ).images[0] # ── Multi-ControlNet (pose + depth) ── from diffusers import StableDiffusionXLControlNetPipeline, MultiControlNetModel controlnet_pose = ControlNetModel.from_pretrained( "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16) controlnet_depth = ControlNetModel.from_pretrained( "diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16) pipe = StableDiffusionXLControlNetPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", controlnet=MultiControlNetModel([controlnet_pose, controlnet_depth]), torch_dtype=torch.float16, ).to("cuda") result = pipe( "A woman in a red dress walking in a garden", controlnet_conditioning_scale=[0.9, 0.7], image=[pose_image, depth_image], num_inference_steps=30, ).images[0]

Type

Model

Use For

Input

Canny Edge

controlnet-canny-sdxl-1.0

Edge-guided generation

Canny edge map

Depth

controlnet-depth-sdxl-1.0

Depth-aware generation

Grayscale depth map

Pose (OpenPose)

controlnet-openpose-sdxl-1.0

Pose-guided humans

Skeleton pose image

Scribble

controlnet-scribble-sdxl-1.0

Draw your composition

Simple line scribble

Segmentation

controlnet-seg-sdxl-1.0

Region-based generation

Semantic segmentation mask

Normal

controlnet-normal-sdxl-1.0

Surface normal guidance

Normal map

Tile

controlnet-tile-sdxl-1.0

Upscale with detail preservation

Tiled/resized image

Model

Type

Resolution

Best For

License

Stable Diffusion XL 1.0

Latent Diffusion

1024x1024

General image generation

OpenRAIL-M

Stable Diffusion 3.5

Flow Matching

1MP+

Prompt adherence, text rendering

Community

FLUX.1 (Black Forest)

Flow Matching

1024x1024

Prompt following, photorealism

Apache 2.0

DALL-E 3

Diffusion (API)

1024x1024

Best text rendering, easy API

Proprietary

Midjourney v6

Proprietary

1024x1024

Artistic quality, aesthetics

Commercial license

PixArt-Sigma

Transformer

1024x1024

Open-source SDXL alternative

Apache 2.0

Kandinsky 3.0

Latent Diffusion

1024x1024

Russian/English bilingual

Apache 2.0

Playground v2.5

Latent Diffusion

1024x1024

Photorealism, composition

Custom license