-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Open
Description
Just an idea.
It's not a problem or anything...
I've be using a custom offload for my potato GPU. Maybe there is another way to do it or so...
In short, I've being using sequential offloading for a long time, when I enable it it use a minimal of VRAM, however I know It could use more VRAM to do less IO, so I created a Mixin for partial CPU offload where the model can keep several layers on GPU and just offload some.
See code here: https://gist.github.com/rodjjo/20e2e842fea9ed58114adb560a4566b6
class MyQwen3ForCausalLM(Qwen3ForCausalLM, PartialOffloadMixin):
LAYERS_KEEP_GPU = 22
MODEL_ATTR_NAME = "model"
MODEL_LAYERS_ATTR_NAME = "layers"
OFFLOAD_ON_CALL = True
model = MyQwen3ForCausalLM.from_pretrained(
repo_id,
subfolder="text_encoder",
local_files_only=True,
torch_dtype=torch.bfloat16,
)
model.eval()
model.enable_partial_cpu_offload()
# pseudo code of inference
result = model(...) # call was overrided and calls go_gpu(True) go_gpu(False)
example transformer:
class MyZImageTransformer(ZImageTransformer2DModel, PartialOffloadMixin):
MODEL_LAYERS_ATTR_NAME = "layers"
LAYERS_KEEP_GPU = 22
model = MyZImageTransformer.from_pretrained(
repo_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
)
model.eval()
model.enable_partial_cpu_offload()
# denoise step
model.go_gpu(True)
while denoising: #pseudo code
predicted = model(...)
model.go_gpu(False)It's saving me 12 to 13 seconds of inference in zimage turbo (my custom pipeline with this partial layers offloading):
Metadata
Metadata
Assignees
Labels
No labels

