How to extract nested loop features from CUDA kernels for LLM-based optimization?

Ask Question

Question

I am working on an experimental project where I aim to have a large language model (LLM) automatically optimize CUDA kernels’ nested loops. The key requirement is to extract static loop and memory access features directly from CUDA source code into a structured, LLM-friendly representation (e.g., loop bounds, memory access patterns, dependencies, parallelism information).

Here’s my current setup and challenges:

1. Environment and Context

The CUDA kernels are handwritten, without any higher-level DSL or scheduling framework.
NVCC / NVIDIA Compiler SDK can generate NVVM IR or LLVM IR.
I am familiar with LLVM/MLIR pipelines and want to leverage them to analyze loops and memory accesses.

2. Desired Loop Features (example)

Loop depth, bounds, and step size
Array access patterns (stride, coalescing, shared/global memory usage)
Data dependencies (RAW/WAR/WAW)
Parallelism potential (thread-safety, tiling/fusion opportunities)
Optimization hints (tiling, loop interchange, vectorization)

The ideal output is a JSON or structured vector that can be fed directly to an LLM for kernel optimization guidance.

3. Technical Questions

CUDA → MLIR workflow
- Is it necessary to go through NVVM IR / LLVM IR before converting to MLIR GPU/affine dialect?
- NVCC defaults to generating PTX—can I directly get LLVM IR or MLIR?
Toolchain choices
- Can LLVM passes, LoopInfo, ScalarEvolution, Polly/ISL directly extract nested loops and dependency information?
- Are there any open-source tools that can generate LLM-friendly loop feature vectors?
Polyhedral analysis limitations
- How should non-affine loops be handled (conservative approximations)?
- What is the recommended approach to handle GPU built-in thread/block variables in polyhedral analysis?

4. Goal

Perform static analysis of CUDA kernels to generate structured loop features
Optionally combine with runtime profiling (Nsight Compute / CUPTI) for optimization hints
Produce output that can be directly used by an LLM or auto-scheduling tool for tiling/fusion/vectorization suggestions

5. Request for Guidance

Recommended pipeline or example: CUDA → LLVM IR / NVVM IR → MLIR → loop feature vector
Any open-source tools to directly extract loops + memory access features from CUDA kernels
Best practices for handling thread/block variables in GPU polyhedral analysis

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.