Sayak Paul’s Post

I am bullish and biased, but the best way to use flash attention 3 or 4 is via 🤗 kernels: ``` from kernels import get_kernel kernel_module = get_kernel("kernels-community/flash-attn3", version=1) flash_attn_func = kernel_module.flash_attn_func flash_attn_func(...) ````

  • graphical user interface, website

Should we be replacing uv.lock and pixi.lock with... lines of python? I'm trying to think: would this help us reproduce training code from a B300 node to a B200 node and then even to an H100 node? Solving dependencies has been painful for us because: build times take forever, we may need to solve the env differently depending on the node we're running on-- to properly leverage newer instruction sets in the more recent nvidia chips

/me slaps his 4090 with a large trout y u not a 5090?

Like
Reply

This is nice! Genuine question: when is the actual kernel compiled? When you import it or is there some sort of JIT sorcery to compile when the function is called?

The get_kernel API makes this look almost too easy. The bigger question is whether version pinning stays stable enough for reproducible training environments. That's usually where these conveniences break down at scale.

Like
Reply

Cool! Is there way to use in air gap environment?

Like
Reply

Great ! Can we get a notebook example for this?

Like
Reply

Open collaboration accelerates progress in machine learning. Your work on open models, evaluation, and tooling benefits both research and production.

Like
Reply

What's the most advanced model architecture you've seen someone use in production via a Hugging Face kernel?

Like
Reply

This said don't we all hate how we can't zoom on photos with LinkedIn on a phone?

Like
Reply

The CMAKE errors are a rite of passage at this point 😭 Didn't know kernels made this that clean — adding this to my toolkit. Thanks for sharing!

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories