Question

I am working on an experimental project where I aim to have a large language model (LLM) automatically optimize CUDA kernels’ nested loops. The key requirement is to extract static loop and memory access features directly from CUDA source code into a structured, LLM-friendly representation (e.g., loop bounds, memory access patterns, dependencies, parallelism information).

Here’s my current setup and challenges:


1. Environment and Context

  • The CUDA kernels are handwritten, without any higher-level DSL or scheduling framework.

  • NVCC / NVIDIA Compiler SDK can generate NVVM IR or LLVM IR.

  • I am familiar with LLVM/MLIR pipelines and want to leverage them to analyze loops and memory accesses.


2. Desired Loop Features (example)

  • Loop depth, bounds, and step size

  • Array access patterns (stride, coalescing, shared/global memory usage)

  • Data dependencies (RAW/WAR/WAW)

  • Parallelism potential (thread-safety, tiling/fusion opportunities)

  • Optimization hints (tiling, loop interchange, vectorization)

The ideal output is a JSON or structured vector that can be fed directly to an LLM for kernel optimization guidance.


3. Technical Questions

  1. CUDA → MLIR workflow

    • Is it necessary to go through NVVM IR / LLVM IR before converting to MLIR GPU/affine dialect?

    • NVCC defaults to generating PTX—can I directly get LLVM IR or MLIR?

  2. Toolchain choices

    • Can LLVM passes, LoopInfo, ScalarEvolution, Polly/ISL directly extract nested loops and dependency information?

    • Are there any open-source tools that can generate LLM-friendly loop feature vectors?

  3. Polyhedral analysis limitations

    • How should non-affine loops be handled (conservative approximations)?

    • What is the recommended approach to handle GPU built-in thread/block variables in polyhedral analysis?


4. Goal

  • Perform static analysis of CUDA kernels to generate structured loop features

  • Optionally combine with runtime profiling (Nsight Compute / CUPTI) for optimization hints

  • Produce output that can be directly used by an LLM or auto-scheduling tool for tiling/fusion/vectorization suggestions


5. Request for Guidance

  • Recommended pipeline or example: CUDA → LLVM IR / NVVM IR → MLIR → loop feature vector

  • Any open-source tools to directly extract loops + memory access features from CUDA kernels

  • Best practices for handling thread/block variables in GPU polyhedral analysis

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.