Skip to content

BBuf/how-to-optim-algorithm-in-cuda

Repository files navigation

how-to-optim-algorithm-in-cuda

CUDA, GPU kernel, and AI infrastructure optimization notes.

This repository collects hands-on CUDA kernels, CUTLASS/CuTe notes, Triton examples, PTX ISA notes, PyTorch internals notes, and LLM inference/training optimization material. It is one of my main public study and engineering notebooks for GPU systems work.

Repository Map

  • cuda-kernels/: handwritten CUDA kernels for reduce, softmax, elementwise, GEMV, indexing, atomic add, upsampling, and linear attention.
  • cuda-mode/: notes and code from the CUDA-MODE lecture series.
  • cutlass/: CUTLASS and CuTe DSL notes, including GEMM, TMA, WGMMA, swizzling, and instruction-level material.
  • triton/: Triton kernels, PyTorch interop examples, and meetup notes.
  • large-language-model/: LLM serving, training, and systems optimization notes.
  • pytorch/: PyTorch internals and CUDA-related notes.
  • papers/: GPU architecture and ML systems paper notes.
  • ptx-isa/: PTX ISA study notes.
  • tools/: small helper scripts.
  • deprecated/: older material kept for reference.

Related Repositories

Status

Actively curated around CUDA kernels, LLM inference optimization, and AI infrastructure. Older Chinese-language notes are being consolidated or replaced with English entry points.

Star History

Star History Chart

About

how to optimize some algorithm in cuda.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors