Add an int64 path for mlp kernels #3614

mmathew23 · 2025-11-18T22:22:20Z

The llama mlp kernels produce nans with extremely long context length. This is happens when the num_elements is greater than 2**31. In these cases it's best to calculate offsets with tl.int64 instead of int32. This PR will route to int64 kernels if the num_elements is big enough.

danielhanchen · 2025-11-19T13:00:19Z

unsloth/kernels/geglu.py

    device = gate.device
    out = torch.empty((batch, seq_len, hd), dtype = gate.dtype, device = device)
    grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
+    if n_elements <= (2**31) - 1024:


Why -1024? Is it maybe hd?

yes I forgot to account for hd. The idea is that I wanted to add a buffer just to be safe.

wait actually it is 1024, ie the BLOCK_SIZE.

danielhanchen · 2025-11-19T13:00:51Z

unsloth/kernels/geglu.py

    batch_seq_len, hd = e.shape
    n_elements = e.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
+    if n_elements <= (2**31) - 1024:


Maybe move (2**31) to a global var

danielhanchen · 2025-11-19T13:46:16Z

unsloth/kernels/swiglu.py

+    e,
+    g,
+    n_elements,
+    BLOCK_SIZE: tl.constexpr,


there is actually a way to use 1 kernel only and dispatch, but for now this is fine - we can refactor later

mmathew23 · 2025-11-19T22:35:35Z

Why -1024? Is it maybe hd?

So the idea is that offsets cannot be more than 2**31-1 which means n_elements<=2**31. I want to add a buffer before this point and since we are processing in BLOCK_SIZE blocks instead of hidden_dim blocks I figured it would be better. Plus we get the added benefit of the behavior remaining consistent across models.

I've updated the PR to reflect your comments and finalized it. Let me know if there's anything else to address.

danielhanchen reviewed Nov 19, 2025

View reviewed changes

mmathew23 force-pushed the tiled/contextlen branch 2 times, most recently from c008eca to 262ada3 Compare November 19, 2025 17:24

Add an int64 path for mlp kernels

833d91f

mmathew23 force-pushed the tiled/contextlen branch from 262ada3 to 833d91f Compare November 19, 2025 19:22

mmathew23 marked this pull request as ready for review November 19, 2025 22:16

mmathew23 added 2 commits November 19, 2025 20:12

move constant expressions to globals

c428266

fix name

265ef5e

danielhanchen merged commit ac82560 into unslothai:main Nov 20, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add an int64 path for mlp kernels #3614

Add an int64 path for mlp kernels #3614

mmathew23 commented Nov 18, 2025

danielhanchen Nov 19, 2025

mmathew23 Nov 19, 2025

mmathew23 Nov 19, 2025

danielhanchen Nov 19, 2025

danielhanchen Nov 19, 2025

mmathew23 commented Nov 19, 2025

Uh oh!

Labels

2 participants

Uh oh!

Add an int64 path for mlp kernels #3614

Add an int64 path for mlp kernels #3614

Conversation

mmathew23 commented Nov 18, 2025

danielhanchen Nov 19, 2025

Choose a reason for hiding this comment

mmathew23 Nov 19, 2025

Choose a reason for hiding this comment

mmathew23 Nov 19, 2025

Choose a reason for hiding this comment

danielhanchen Nov 19, 2025

Choose a reason for hiding this comment

danielhanchen Nov 19, 2025

Choose a reason for hiding this comment

mmathew23 commented Nov 19, 2025

Uh oh!

Labels

2 participants