Trending

See what the GitHub community is most excited about today.

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient BLAS kernel library on GPU

Cuda 7,459 1,079 Built by

9 stars today

alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Cuda 1,244 222 Built by

5 stars today

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,451 437 Built by

3 stars today

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 3,116 283 Built by

1 star today

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 9,796 1,306 Built by

9 stars today

NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,834 462 Built by

1 star today

Dao-AILab / causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

Cuda 909 195 Built by

0 stars today

mirage-project / mirage

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

Cuda 2,347 226 Built by

5 stars today

pyscf / gpu4pyscf

A plugin to use Nvidia GPU in PySCF package

Cuda 318 66 Built by

0 stars today

NVIDIA / cuvs

cuVS - a library for vector search and clustering on the GPU

Cuda 795 197 Built by

1 star today

carlinds / splatad

SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving

Cuda 400 30 Built by

0 stars today

NVIDIA / nccl-tests

NCCL Tests

Cuda 1,570 385 Built by

1 star today

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,501 299 Built by

4 stars today