Sign in to view Haoming’s full profile
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
Seattle, Washington, United States
Sign in to view Haoming’s full profile
Haoming can introduce you to 10+ people at Google
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
744 followers
500+ connections
Sign in to view Haoming’s full profile
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
View mutual connections with Haoming
Haoming can introduce you to 10+ people at Google
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
View mutual connections with Haoming
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
Sign in to view Haoming’s full profile
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
Experience & Education
-
Google
***** ******** ********
-
*****
*** ******
-
******* ******** *******
****** ******
-
********** ** **********
****** ** ********** ***** undefined undefined
-
-
*** **** **** ********** ** ******* *** **********
****** ** ********** ********* undefined
-
View Haoming’s full experience
See their title, tenure and more.
Welcome back
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
New to LinkedIn? Join now
or
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
View Haoming’s full profile
-
See who you know in common
-
Get introduced
-
Contact Haoming directly
Other similar profiles
Explore more posts
-
James Hongyi Zeng
Meta • 2K followers
Last week in PyTorch Conference 2025, we announced we are open sourcing torchcomms and NCCLX/CTran. Today, we share more details about NCCLX/CTran design and how we used them in production GenAI training and inference. Check out our white paper on this topic - https://lnkd.in/gySuXi6Y Some features we covered in this paper - * Host-driven collectives * Zero-copy data transfer * CTran/Network co-design (DQPLB) * Zero-copy and SM-free Send/Receive for PP * RMA Put for TP * Fault tolerant AllReduce * GPU-resident collectives for EP * Low-latency optimization * Scalable Initialization in Training * GPU Memory Management for comms * Fault localization and Performance Observability * CPU emulation This paper covers years of innovation and production experience from GPU communication teams at Meta, in supporting generations of LLAMA models training and inference. Hope you enjoy reading it!
313
9 Comments -
Mrukant Popat
Yantra • 5K followers
𝗠𝗶𝘅𝘁𝘂𝗿𝗲 𝗼𝗳 𝗘𝘅𝗽𝗲𝗿𝘁𝘀 (𝗠𝗼𝗘): 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗟𝗟𝗠𝘀 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁𝗹𝘆 𝘄𝗶𝘁𝗵 𝗦𝗽𝗮𝗿𝘀𝗲 𝗖𝗼𝗺𝗽𝘂𝘁𝗮𝘁𝗶𝗼𝗻 Large Language Models (LLMs) continue to grow in size, pushing the limits of AI capabilities but also introducing challenges in cost, memory, and inference speed. Mixture of Experts (MoE) offers an innovative approach by using sparse computation, activating only a subset of parameters per input. Let's explore recent advances in MoE architectures and how models like DeepSeek-v2 and DeepSeek-v3 are optimizing efficiency. 🔹 𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀 𝗶𝗻 𝗠𝗼𝗘: 𝗥𝗼𝘂𝘁𝗶𝗻𝗴 𝗕𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸𝘀 & 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗧𝗿𝗮𝗱𝗲-𝗼𝗳𝗳𝘀 While MoE improves efficiency, it also faces key challenges: 𝗧𝗼𝗸𝗲𝗻 𝗗𝗿𝗼𝗽𝗽𝗶𝗻𝗴 𝗶𝗻 𝗟𝗼𝗻𝗴 𝗦𝗲𝗾𝘂𝗲𝗻𝗰𝗲𝘀: OpenMoE struggles with routing stability, sometimes losing tokens in long sequences. Fixed Routing in Pretraining: Early routing patterns can be inefficient post-training. 𝗗𝗼𝗺𝗮𝗶𝗻 𝗦𝗵𝗶𝗳𝘁 𝗜𝘀𝘀𝘂𝗲𝘀: MoE models may struggle to generalize across different data distributions. A recommended solution is incorporating instruction-following data in pretraining to enhance routing adaptability. 🚀 𝗗𝗲𝗲𝗽𝗦𝗲𝗲𝗸 𝗠𝗼𝗘: Smarter Scaling for AI Models The DeepSeek series addresses these issues with innovative optimizations: 🔸 𝗗𝗲𝗲𝗽𝗦𝗲𝗲𝗸-𝘃𝟮: 𝟮𝟯𝟲𝗕 𝗣𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝘀, 𝟮𝟭𝗕 𝗔𝗰𝘁𝗶𝘃𝗲 1️⃣ Multi-Head Latent Attention (MLA): Cuts memory use by 93% with efficient KV cache storage. 2️⃣ Fine-Grained Expert Allocation: Balances shared and specialized experts across devices. 3️⃣ Device-Level Load Balancing Loss: Ensures even routing across devices, improving stability. 🔸 𝗗𝗲𝗲𝗽𝗦𝗲𝗲𝗸-𝘃𝟯: 𝗔 𝟲𝟳𝟭𝗕 𝗣𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿 𝗠𝗼𝗱𝗲𝗹 𝘄𝗶𝘁𝗵 𝗡𝗲𝘄 𝗘𝗻𝗵𝗮𝗻𝗰𝗲𝗺𝗲𝗻𝘁𝘀 1️⃣ Multi-Token Prediction (MTP): Predicts multiple tokens at once for better efficiency. 2️⃣ Auxiliary-Loss-Free Load Balancing: Dynamically adjusts expert selection without added inefficiencies. 3️⃣ FP8 Mixed Precision Training: Reduces training costs significantly (~$5.6M for full training). 4️⃣ Extensive Post-Training: Includes context extension (128K tokens), SFT, RLHF, and knowledge distillation. 📊 𝗞𝗲𝘆 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆𝘀 ✅ Trained with 2.78M H800 GPU hours ✅ Performance rivals top closed-source LLMs ✅ Practical, scalable MoE for real-world deployment 🔮 𝗧𝗵𝗲 𝗙𝘂𝘁𝘂𝗿𝗲 𝗼𝗳 𝗠𝗼𝗘: 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗔𝗜 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 MoE is revolutionizing LLM training, making sparse computation viable at scale. While early MoE models had challenges, recent breakthroughs like MLA, MTP, and smarter load balancing are proving MoE's potential. DeepSeek-v3 shows that sparse models can match dense models, signaling a shift in AI scaling strategies. What’s your take on MoE architectures? Will they define the future of AI, or do dense models still have an edge? Let’s discuss! 👇 credit : Cameron R. Wolfe, Ph.D.
185
20 Comments -
Ben (Xiaojun) Li
Microsoft • 32K followers
Google Research Director Denny Zhou, who founded the LLM Reasoning Team at DeepMind, recently gave a great talk at Stanford’s CS25 class: Large Language Model Reasoning. This was one of the most intuitive talks about reasoning model training and application. He outlined new directions for training models to handle questions where answers aren’t easily verifiable. My team is currently working on an enterprise LLM project focused on using reasoning to extract relevant context and generate responses for unverifiable cases. It’s a tough but promising area, and it’s an interesting time to be building. [Link to Denny Zhou's talk: https://lnkd.in/gdJ6Mzi8]
1,026
8 Comments -
Zhilin Wang
NVIDIA • 1K followers
We built ProfBench to raise the bar for LLMs - literally. At NVIDIA, we worked with domain experts to create a benchmark that goes far beyond trivia and short answers. ProfBench tests LLMs on complex, multi-step tasks that demand the kind of reasoning, synthesis, and clarity you'd expect from a PhD physicist or MBA consultant. 🌎 This isn’t just a dataset drop. It’s a global collaboration: 38 professionals across 8 countries contributed over 7,000 expert-written rubrics across finance MBA 💵, consulting MBA 📊, chemistry PhD 🧪and physics PhD 🚀. 🧗Every prompt and grading rubric was handcrafted, requiring tens of hours of dedicated and focussed work. Now fully supported in the NeMo Evaluator SDK, ProfBench enables reproducible, rubric-based evaluations and side-by-side model comparisons. 🔗 ProfBench on Hugging Face https://lnkd.in/g2mCMcnc 🔗 NeMo Evaluator SDK https://lnkd.in/gF6SQCwt I’m so proud of the team that made this happen. Let’s keep pushing what AI can do. #ProfBench #LLM #AIevaluation #NeMo #NVIDIA #OpenSourceAI #AIresearch #AgenticAI #GenerativeAI #BuiltByExperts #GTCDC Work done with Jaehun Jung Ximing Lu Shizhe Diao Ellie Evans Jiaqi Zeng Pavlo Molchanov Yejin Choi Jan Kautz Yi Dong Collaborators: Vivienne Zhang Isabel Hulseman, MBA Seph Mard Pablo Ribalta, PhD Grzegorz Chlebus Wojciech Prazuch Ankit J. Patel
205
3 Comments -
Sameer Bhardwaj
Layrs • 43K followers
You are in a system design interview at Google for the L5 Senior Engineer role, and the interviewer leans in and asks: “Why does Spotify keep playing when I drive into a tunnel with no signal, but YouTube Music often stops or buffers? If you were designing a music streaming system, what different design choices would lead to these two behaviors?” Here is how you break it down. Btw, if you’re preparing for system design/coding interviews, check out our mock interview tool. You can use it for free here: https://lnkd.in/gpCn7t2T We’ve added new features as well: -Company-specific interviews -In-built interview scheduler -performance insights & trends — Both apps look like simple music players. Under the hood, they are optimized for very different priorities. [1] Spotify style – Cache first, stream second Idea: The client behaves like a smart offline player that happens to stream. The backend is built to support aggressive prefetching. What happens when you hit play - Client requests the track from a CDN or edge node. - Instead of tiny chunks, it downloads a big buffer ahead of the playhead. - In parallel, it starts pre downloading the next few tracks in the queue. - Data is written to an encrypted cache on disk, not just memory. - Playback reads from that local cache, not directly from the network socket. What this means for tunnels and bad networks - When the network drops, the player already has tens of seconds or entire tracks cached. - Because the next one or two songs are already downloaded, you can be offline for a while and never notice. - Cold start cost is a bit higher. First play might take slightly longer, but then everything feels smooth. - It burns more local storage and possibly more data, because not every prefetched song will be listened to fully. In a design answer, you can mention - Local disk cache with eviction policies (LRU per user, per device). - Background prefetch of N upcoming tracks based on queue. - Download manager that adapts how aggressively it prefetches based on network quality and user settings. - CDN tuned for larger object delivery and range requests. - Explicit offline mode that pins playlists into cache. [2] YouTube Music style – Stream first, cache is minimal Idea: Treat audio like video streams. Cost and bandwidth are optimized first. When you hit play - Player requests audio (and sometimes video) via HLS or DASH-style chunks. - Each chunk is only a few seconds long. - Client keeps a small rolling buffer in memory, not a large queue on disk. - Prefetch of future songs is limited, because video tracks are large and expensive to fetch speculatively. What this means for tunnels and bad networks - If your connection dies, the player only has a few seconds of buffered data. - As soon as those chunks are consumed, playback stalls. - Startup can feel snappy and data usage is controlled, especially for casual listeners. - Works very well on stable networks, feels fragile in spotty coverage.
1,221
38 Comments -
Srinivas Narayanan
97K followers
We released an improved speech-to-speech model and new API capabilities to help build production grade voice agents. Highlights: - Realtime API is now GA - gpt‑realtime: our most advanced speech-to-speech model is more natural, expressive, and better at following complex instructions (even mid-sentence language switches) and tool calling - New integrations: support for remote MCP servers, image input, plus phone calling via SIP - Two fresh voices: Cedar and Marin - 20% price reduction https://lnkd.in/gnzHiNjd
866
23 Comments -
Jürgen Schmidhuber
KAUST (King Abdullah… • 22K followers
Our Huxley-Gödel Machine learns to rewrite its own code, estimating its own long-term self-improvement potential. It generalizes on new tasks (SWE-Bench Lite), matching the best officially checked human-engineered agents. With Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge. ArXiv: https://lnkd.in/e3zbgJQe Github: https://lnkd.in/e5_UW2MK
729
34 Comments -
Mayank Goel
OpenAI • 6K followers
Finally finished going through the much-discussed paper “AlphaGo Moment for Model Architecture Discovery.” I’m thoroughly impressed, but not fully converted yet. What’s great: automated loops that ideate → implement → evaluate at scale are real productivity multipliers, and they aren’t weighed down by “we’ve always done it this way” thinking. Where I push back: building AI architectures isn’t the same as winning AlphaGo. Unlike AlphaGo, AI architectures may not always have a perfect, rule-based verifier. Move 37 was one of the finite possible moves (however big that finite number maybe doesn't matter at scale). Real problems on other hand are multi-objective, noisy and not within a finite set of possibilities. Brute force will happily optimize for the benchmark, even if that doesn’t translate to real-world value. The biggest leverage still lies in problem selection, framing, and evaluation design, which are deeply human decisions. If research were purely compute-bound, the only winner would be Nvidia, and my “research roadmap” would just be new purchase orders for GB200s. :) Humans set the questions, shape the rewards, design the guardrails, and extract the theory. At least for now, this automation is a force multiplier for that work and not a substitute. #AI https://lnkd.in/gZf8ZyEf #ArtificialIntelligence
11
1 Comment
Explore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top content