Question
I am working on an experimental project where I aim to have a large language model (LLM) automatically optimize CUDA kernels’ nested loops. The key requirement is to extract static loop and memory access features directly from CUDA source code into a structured, LLM-friendly representation (e.g., loop bounds, memory access patterns, dependencies, parallelism information).
Here’s my current setup and challenges:
1. Environment and Context
The CUDA kernels are handwritten, without any higher-level DSL or scheduling framework.
NVCC / NVIDIA Compiler SDK can generate NVVM IR or LLVM IR.
I am familiar with LLVM/MLIR pipelines and want to leverage them to analyze loops and memory accesses.
2. Desired Loop Features (example)
Loop depth, bounds, and step size
Array access patterns (stride, coalescing, shared/global memory usage)
Data dependencies (RAW/WAR/WAW)
Parallelism potential (thread-safety, tiling/fusion opportunities)
Optimization hints (tiling, loop interchange, vectorization)
The ideal output is a JSON or structured vector that can be fed directly to an LLM for kernel optimization guidance.
3. Technical Questions
CUDA → MLIR workflow
Is it necessary to go through NVVM IR / LLVM IR before converting to MLIR GPU/affine dialect?
NVCC defaults to generating PTX—can I directly get LLVM IR or MLIR?
Toolchain choices
Can LLVM passes, LoopInfo, ScalarEvolution, Polly/ISL directly extract nested loops and dependency information?
Are there any open-source tools that can generate LLM-friendly loop feature vectors?
Polyhedral analysis limitations
How should non-affine loops be handled (conservative approximations)?
What is the recommended approach to handle GPU built-in thread/block variables in polyhedral analysis?
4. Goal
Perform static analysis of CUDA kernels to generate structured loop features
Optionally combine with runtime profiling (Nsight Compute / CUPTI) for optimization hints
Produce output that can be directly used by an LLM or auto-scheduling tool for tiling/fusion/vectorization suggestions
5. Request for Guidance
Recommended pipeline or example: CUDA → LLVM IR / NVVM IR → MLIR → loop feature vector
Any open-source tools to directly extract loops + memory access features from CUDA kernels
Best practices for handling thread/block variables in GPU polyhedral analysis