karpathy / llm.c
LLM training in simple, raw C/CUDA
See what the GitHub community is most excited about this month.
LLM training in simple, raw C/CUDA
GPU accelerated decision optimization
Tile primitives for speedy kernels
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
RAPIDS Accelerator JNI For Apache Spark
cuVS - a library for vector search and clustering on the GPU
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
how to optimize some algorithm in cuda.
Instant neural graphics primitives: lightning fast NeRF and more
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
DeepGEMM: clean and efficient BLAS kernel library on GPU
CUDA Kernel Benchmarking Library
Lightning fast differentiable SSIM.
NCCL Tests
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl