AI Execution Stack: JAX vs PyTorch 2.x

This title was summarized by AI from the post below.

5mo

🎯 AI Execution Stack (JAX vs PyTorch 2.x): From Model to Machine Code Earlier this week, I attended the 2025 JAX & OpenXLA DevLabs — and it was incredibly insightful. The deep dives into JAX’s lowering pipeline, StableHLO, and the broader OpenXLA ecosystem inspired me to visualize the full AI execution stack. Comparing JAX/XLA with the PyTorch ecosystem helped me better understand the low-level architecture of ML systems, including core concepts like IR (Intermediate Representation), ML Compiler, and Runtime Execution. 🔍 This visualization covers: 🔹 JAX → JAXPR → XLA → HLO → TPU/GPU/CPU 🔹 PyTorch 2.x → FX → Inductor → Triton/NvFuser/C++ → GPU/CPU 🔹 PyTorch → ONNX → TensorRT / ONNX Runtime → GPU It’s fascinating to see how ML compilation is evolving toward modular, backend-agnostic design, enabling portable and efficient execution across diverse hardware. 🙏 Special thanks to Han Qi (PyTorch/XLA expert) for generously sharing insights and helping clarify the internals of the stack. Also grateful to my teammates for the ongoing technical discussions and encouragement. 💬 Feel free to share feedback or correct anything in the diagram — I’m still learning too! #JAX #XLA #StableHLO #HLO #PyTorch #TorchInductor #ONNX #TensorRT #AIInfrastructure #MachineLearning #DeepLearning

11 Comments

Victor Stone

5mo

A very fascinating space!

1 Reaction

Yash Vanjani

5mo

Great visualization Yuchuan Gou!

Chenxi Lin

5mo

impressive

Chongxiao "Shawn" Cao, PhD

5mo

thanks for sharing.

Pranav Doma

5mo

Great post! Can relate to a quantization task worked recently.

Waleed Atallah

5mo

really nice visualization

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Walter Lee
1mo
Report this post
What is SimpleFSDP? SimpleFSDP (alpha) is a PyTorch-native, compiler-based implementation of Fully Sharded Data Parallel (FSDP), a distributed training technique for large language models (LLMs). Unveiled at PyTorch Conference 2025, it simplifies the core FSDP mechanics using PyTorch primitives like DTensor for sharding, parametrizations for activations, and selective activation checkpointing. The key innovation is its deep integration with torch.compile (via the TorchInductor backend), enabling full tracing of computation-communication graphs and backend-specific optimizations like IR node bucketing and reordering for better compute-communication overlap. Unlike traditional eager-mode frameworks (e.g., FSDP2), SimpleFSDP is designed for extreme simplicity while maintaining high performance. It’s implemented in TorchTitan and supports wrapping model components automatically or manually to expose optimizations. For more details, check the paper: arXiv:2411.00284. Key Benefits • Extreme Simplicity: Reduces implementation complexity with just DTensor + parametrizable activations + checkpointing, making it easier to maintain, debug, and compose with other PyTorch tools—no custom hooks or complex state management needed. • Compiler-Mandated Performance: Leverages torch.compile for automatic optimizations, achieving up to 68.67% throughput gains on models like Llama 3 (e.g., 8B to 405B params) by overlapping comms and compute. • Memory Efficiency: Cuts peak memory usage by up to 28.54% compared to FSDP2, especially in multi-GPU setups (e.g., from 32 to 128 GPUs), via targeted sharding and bucketing. • Full Traceability: Enables end-to-end graph tracing for better profiling and debugging, with alpha-stage FSDP-specific tweaks and ongoing work on generic reordering optimizations. • Scalability & Flexibility: Works seamlessly in TorchTitan for training massive models; supports user-controlled wrapping to minimize comm overhead, making it ideal for advanced distributed workflows. Overall, SimpleFSDP makes FSDP more accessible for researchers and devs tackling billion-parameter models, trading minimal code for big efficiency wins. If you’re experimenting, start with the TorchTitan repo! #PyTorch #SimpleFSDP
Like Comment
To view or add a comment, sign in
Sahil Bhatia
1mo Edited
Report this post
Understanding the fundamentals, such as data types and programming languages, is crucial before delving into more complex topics like machine architecture and data quality. In a recent discussion, we explored the significance of these foundational elements in deep learning, particularly focusing on floating-point precision: FP16, FP32, and FP64. Link to Blog : https://lnkd.in/gyRecE_w The choice of floating-point precision can greatly affect: - Memory usage - Training speed - Numerical stability - Hardware efficiency Key Takeaway: - FP16 allows for faster training and reduced memory usage but necessitates careful stabilization techniques. - FP32 strikes the best balance for most applications. - FP64 is typically reserved for scientific machine learning scenarios. Understanding these differences compounds back to overall efficiency in the system. #AI #DeepLearning #GenAI #LLM #Coding #Pytorch
Like Comment
To view or add a comment, sign in
Daily AI News and Tools

160 followers
1mo
Report this post
Ex-OpenAI CTO Mira Murati Unveils Tinker — Thinking Machines’ First AI Fine-Tuning Platform Tinker is a flexible API for fine-tuning language models, giving researchers and developers full creative control without the infrastructure headaches. Write training loops in Python right from your laptop — Tinker runs them on distributed GPUs. You keep control of data, algorithms, and loss functions, while Tinker handles the complex stuff like infra, model forward/backward passes, and scaling. Tinker makes post-training simple, fast, and modular, letting you explore where fine-tuning outperforms prompting — especially for narrow, high-data tasks or LLM pipelines. #AI #LLM #MachineLearning #FineTuning #ResearchTools
Like Comment
To view or add a comment, sign in
Shivam Pandey
1mo
Report this post
🚀 Just Built What Might Be the Most Optimized Transformer Layer on GitHub This week, I learned and implemented a fully fused Transformer layer that combines 6 operations into a single GPU kernel - achieving what I believe is peak performance for LLM inference. 🧠 What's Fused: • RMSNorm + Residual Connection • QKV Projections + Bias • Rotary Positional Encoding (RoPE) • Sequence & Tensor Parallelism • All in one Triton kernel ⚙️ Tech Highlights •🚀 Triton kernels for custom GPU fusion •⚡ Sequence & Tensor Parallelism (PyTorch Distributed) • 💾 60%+ memory savings vs FlashAttention-3 path •🧩 Fully GPT-5-style forward pass •🔄 Backward kernel structure ready for training 📊 Results • 2.1× faster than FlashAttention-3 QKV path • Significant VRAM reduction — enabling longer context windows • Matches real LLM inference behavior down to the memory layout 💡 Key Insight: Modern AI hardware is memory-bandwidth bound. By fusing operations, we eliminate 5 out of 6 memory reads/writes - this is where the real performance gains live, not just in the FLOPs. 🔗 Check out the full code and benchmarks here: 👉 GitHub: https://lnkd.in/dUzeCE-y 🌐 Notes Website: https://lnkd.in/d-JgxWYx #AI #LLM #Transformer #DeepLearning #Triton #PyTorch #GPT5 #KernelFusion #MachineLearning #NVIDIA #GPU

GitHub - Thisishivam/GPT-Transformer-Stage5: This is the Stage 5 of understanding the GPT Transformer in this is have implemented Fused Rotary QKV Projection to make the Transformer more memory efficient and get more boost performance github.com
Like Comment
To view or add a comment, sign in
Shashank I.
1mo
Report this post
I just explored IBM Granite 4.0 + Unsloth, and it looks like a game-changer for fine-tuning open models. (Unsloth is a Python library that makes LLM fine-tuning up to 80% more memory-efficient, supporting full, LoRA, and QLoRA with no accuracy loss) Why this matters: - Efficient: fine-tune big models with drastically less memory - Flexible: supports GPT-OSS, Qwen3, Gemma 3n, and more - Simple: minimal setup, easy to integrate - Granite-ready: run Granite-4.0 Micro (3B) on just 4GB RAM or Small (32B) on 40GB RAM - Powerful: Granite excels at agentic tasks, RAG, summarization, and edge AI This pairing makes high-quality model fine-tuning accessible for both personal and enterprise AI. Link to the Colab notebook in the first comment. #LLM #IBM #OpenSource #FineTuning #Granite #Unsloth #AI
2 Comments
Like Comment
To view or add a comment, sign in
Walter Lee
1mo
Report this post
What is Helion? Helion is a Python-embedded domain-specific language (DSL) from the PyTorch team, designed to simplify authoring high-performance, scalable machine learning kernels. Nicknamed “PyTorch with Tiles,” it compiles down to Triton kernels via TorchInductor, enabling efficient GPU programming with minimal boilerplate. Unveiled at PyTorch Conference 2025, Helion bridges the gap between intuitive PyTorch syntax and low-level optimizations, supporting operations like pointwise (e.g., add, sigmoid), reductions (e.g., sum, softmax), views, and matrix multiplications (e.g., torch.addmm). A single Helion kernel compiles to exactly one GPU kernel, making it ideal for attention mechanisms and other ML primitives. Check out the GitHub repo at pytorch/helion for examples and setup (install via pip install helion after PyTorch and Triton). Key Benefits • Familiar and Intuitive Syntax: If you know PyTorch, you already know most of Helion—standard operators map seamlessly to Triton ops, reducing the learning curve and enabling rapid prototyping without switching DSLs. • Drastic Code Reduction: Attention kernels that take 30+ lines of code (LOC) in CUDA or 120 LOC in Triton shrink to just 20 LOC in Helion, slashing development time and boilerplate for complex ML workloads. • Automated Ahead-of-Time Autotuning: The compiler generates and evaluates thousands of Triton kernel variants from one Helion kernel (taking ~10 minutes), delivering optimal performance without manual tuning—far easier than hand-optimizing in lower-level tools. • High Performance and Scalability: Leverages Trition’s backend for fast, portable kernels across GPUs, with automatic mapping via TorchInductor ensuring efficient compute-comms overlap and broad hardware support. • Ease of Integration: Embeds directly in PyTorch workflows, fostering composability with existing models; it’s developer-friendly for metaprogramming and interop, making kernel authoring accessible to a wider ML community. Helion democratizes GPU kernel writing, letting devs focus on innovation over low-level drudgery. Perfect for scaling attention-based models—try it for your next transformer tweak! #PyTorch #Helion #MLKernels
Like Comment
To view or add a comment, sign in
David S.
1mo
Report this post
What if every AI model came with a mathematical proof of correctness? Here's why today's ML models are so fragile: The Problem: ⛔ Silent mismatches: CUDA kernels drift 0.1% from NumPy reference → model accuracy drops 5% ⛔ Hardware roulette: Works on A100, fails on RTX 4090 due to FP16 quirks ⛔Reproducibility crisis: 99% of papers don't reproduce because no one verifies the actual computations Result? Months wasted debugging "random" failures. Teams resort to "vibes-based" validation. I'm building tensor-lang, a CUDA DSL that already outputs .npy files for automatic verification. The big idea: What if we took it further and made provable correctness the default? Imagine: @kernel verified def conv2d(x, w): return conv2d(x, w) # → CUDA + .npy + Z3 proof: "These match exactly" Why this pushes Trustworthy AI: 👍 Tesla: Confidence FSD behaves exactly as tested 👍Medtronic: FDA audits become "show us the proofs" 👍xAI: Train Grok knowing every tensor op is verified No drama. Just reliable results. Question for AI builders: Would automatic proofs save you time? What's your biggest "CUDA ≠ NumPy" headache? Live prototype: https://lnkd.in/ejSe2H7A Vision: Build tools so solid that "written in tensor-lang" becomes the gold standard for AI. #AI #MachineLearning #FormalVerification #TrustworthyAI #Reproducibility

GitHub - davro/tensor-lang: TensorLang is a native ML open-source language that compiles tensor operations directly to kernels for execution github.com
Like Comment
To view or add a comment, sign in
Gev Balyan
1mo
Report this post
When DeepSeek casually open-sourced a 3B-parameter OCR model that compresses text into images with near-lossless fidelity, it wasn’t just another AI drop, it was a shot fired at how machines read the world. It is a new way for AI to see text and retain more information while computing less. If visual encoding becomes the next layer of language compression, long-context models will no longer be constrained by memory or token limits. The implications for cloud cost, model architecture, and training efficiency are enormous. https://lnkd.in/eDadzX2j

deepseek-ai/DeepSeek-OCR · Hugging Face huggingface.co
Like Comment
To view or add a comment, sign in
Madhur Prashant
1mo Edited
Report this post
Excited to introduce a new solution on efficient agent LLM serving on vLLM for deep research tasks using OpenAI's browseComp dataset. LLM serving infrastructure is critical for agent systems because agents require multiple sequential inference calls: tool invocations, reasoning steps, and critique loops, making latency and throughput bottlenecks exponentially worse than single-shot completions. Traditional serving approaches fail under the sustained load patterns and long context windows (100k+ tokens) that deep research agents demand. This repository implements a solution using vLLM's PagedAttention and continuous batching to serve a GPT OSS 20B parameter model with 131k context, enabling hierarchical multi-agent architectures (research + critique sub-agents) orchestrated via LangGraph on long running tasks. The system benchmarks on OpenAI's BrowseCompLongContext dataset, demonstrating how optimized inference engines reduce multi-step agent execution from minutes to a couple of seconds, proving that efficient model serving is the foundation for scalable agentic systems in production. Will be adding SGLang/LLMDeploy and other frameworks to test which one works the best for long running LLM tasks. Reach out for questions, thoughts! GitHub implementation: https://lnkd.in/eQgKQhtU Blog: https://lnkd.in/eafqashJ #agents #llmserving #vllm #deepresearch

Efficient LLM Agent Serving with vLLM: A Deep Dive into Research Agent Benchmarking medium.com
Like Comment
To view or add a comment, sign in
Santhos Raj
1mo
Report this post
Your ML model's performance bottleneck isn't what you think it is. It's the compiler. While everyone focuses on model architecture and hyperparameters, compilers quietly determine whether your model runs in milliseconds or minutes. XLA. TVM. TorchInductor. MLIR. These aren't just buzzwords, they're the infrastructure enabling: - 2-3x faster inference - 4x smaller model sizes - Deployment on edge devices - Cross-platform portability I just published a deep dive on ML compilers covering: ✓ How they optimize your models automatically ✓ Techniques: graph fusion, quantization, memory planning ✓ Real benchmarks from a transformer case study ✓ The trade-offs you need to know Have a read and comment down your thoughts /suggestions: https://lnkd.in/ggr-cyhg #MachineLearning #Compilers #AI #Performance #SoftwareEngineering

The Critical Role of Compilers in Machine Learning and AI. medium.com
Like Comment
To view or add a comment, sign in

2,299 followers

13 Posts

View Profile Connect

AI Execution Stack: JAX vs PyTorch 2.x

More Relevant Posts

Explore related topics

Explore content categories