Top LinkedIn Content on Training AI Models With Limited Data

Jim Fan

NVIDIA Director of AI & Distinguished Scientist. Co-Lead of Project GR00T (Humanoid Robotics) & GEAR Lab. Stanford Ph.D. OpenAI's first intern. Solving Physical AGI, one motor at a time.

235,515 followers 1y

Robotics has a data scarcity problem - you simply can't scrape robot control data from webpages. Introducing GR00T-Mimic and GR00T-Gen: using both Graphics 1.0 & Graphics 2.0 to multiply your robot datasets by 1,000,000x. We trade compute for synthetic data, so we are not capped by the fundamental physical limit of 24 hrs/robot/day. Robotics is right in the thick of Moravec's paradox: things that are easy for humans turn out to be incredibly hard for machines. We are crushing the Moravec's paradox, one token at a time. > Graphics 1.0: Isaac simulators with manually written, GPU-accelerated physics and rendering equations. > Graphics 2.0: big neural nets (Cosmos) that repaint the pixels from sim textures to real, given an open-ended prompt. Robot data multiplier workflow: 1. GR00T-Teleop: use XR device like Apple Vision Pro to map human finger poses to humanoid hands. 2. GR00T-Mimic: given a human-collected task demonstration, we augment the actions in Isaac and filter out ones that fail the task. 3. GR00T-Gen: apply Graphics 1.0 and then Graphics 2.0 to produce tons of visual variations. The above is an exponential pipeline, adding orders of magnitude at each step.

64 Comments

Aishwarya Srinivasan

621,610 followers 9mo

As LLMs become the core engine behind more and more AI products, customizing them with precision becomes critical. But the question I get most often is: “Should we use Supervised Fine-Tuning (SFT) or Reinforcement Fine-Tuning (RFT)?” Let’s break it down. 👀 𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 (𝗦𝗙𝗧) 𝗨𝘀𝗲 𝘄𝗵𝗲𝗻: → You have a clean, labeled dataset (preferably >100k examples). → The task is verifiable and deterministic—think classification, factual QA, structured output. 𝗪𝗵𝘆 𝗶𝘁 𝘄𝗼𝗿𝗸𝘀: → Efficient and reproducible. → Offline training with minimal infra orchestration. → Works well with modular fine-tuning (e.g., LoRA adapters). 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀: → Doesn't adapt well to subjective or multi-objective tasks. → Plateaus with small or noisy datasets. ♾️ 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 (𝗥𝗙𝗧 / 𝗥𝗟𝗛𝗙) 𝗨𝘀𝗲 𝘄𝗵𝗲𝗻: → You’re optimizing for subjective quality, human preference, or multi-turn reasoning. → Labeled data is limited, but you can evaluate outputs programmatically (via reward models or heuristics). 𝗪𝗵𝘆 𝗶𝘁 𝘄𝗼𝗿𝗸𝘀: → Incorporates task-specific feedback loops (e.g., correctness, engagement, success). → Allows for dynamic alignment with non-differentiable objectives. → Crucial for tasks like dialogue, summarization, tool use, and creativity. 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀: → More complex training pipelines. → Reward model design is critical and often brittle. Fireworks AI just dropped major updates to support both paradigms: 💡 𝗦𝗙𝗧 𝘃𝟮: https://lnkd.in/dZM8d54N → Optimized for speed, multi-token training, and massive context lengths → Supports modular LoRA, function-calling fine-tuning, and quantization-aware training 🚀 𝗥𝗙𝗧 (𝗕𝗲𝘁𝗮): https://lnkd.in/dWqB8WEh → Simplifies RLHF for open models (Llama, Qwen, Phi, DeepSeek) → Write your reward function, and Fireworks handles the rest Already showing performance on par with GPT-4o, at a fraction of the latency! If you’re building with open models and need production-grade tuning, these tools lower the barrier significantly. Start customizing your models on Fireworks AI: fireworks.ai/models

21 Comments

Andrés Marafioti

Multimodal Research Lead @ Hugging Face | 10+ YOE in AI R&D

24,414 followers 10mo

🚀 SmolVLA is live! We just released a 450M parameter model to control robots with natural language. And it's fully open-source. SmolVLA delivers: ✅ Real-time inference ✅ Strong performance across diverse tasks ✅ Training and deployment recipes that fit on a single consumer GPU How? We gathered all the open @LeRobotHF robotics datasets on the Hugging Face Hub, cleaned them up, and used them to pretrain SmolVLA. This step alone improved downstream success rates by 26%. We also introduced asynchronous inference, so robots can act and react at the same time, a game-changer for fast control. But this isn’t just a model release. It’s a step towards accessible, community-driven robotics. Base models should be built on public data, reproducible code, and affordable hardware! 🛠️ Everything’s open: • Model weights • Code • Data • Demo • Blog 📖 Dive into the blog post to explore the architecture, benchmarks, and how to get started, check the comments!

72 Comments

Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

41,675 followers 1y

If you're working on AI projects with limited training data, building domain-specific AI applications, or struggling with the economics of data labeling, you should know about this new approach from the DeepSeek team. Reinforcement Fine-Tuning (RFT) is a new technique for fine-tuning large language models, cutting the required labeled data from thousands to just tens of examples. Traditional supervised fine-tuning (SFT) approaches have always been hampered by their dependence on vast amounts of labeled data. RFT takes a fundamentally different approach by utilizing a reward function to evaluate response correctness, enabling the model to learn more effectively than through simple mimicry of examples. The same technique that was used to develop DeepSeek-R1. This method proves particularly powerful in three key scenarios: (1) When no labeled data exists but correctness can be verified - such as code transpilation where outputs can be automatically tested. (2) When only limited labeled examples are available - fewer than 100 examples, where traditional methods typically overfit. (3) For tasks that benefit from chain-of-thought reasoning - where step-by-step logical thinking significantly improves results. A well-written post from Predibase here (they also added support for RFT on their platform recently!) https://lnkd.in/gHBdW5De P.S. Predibase just released an open-source model that outperforms OpenAI o1 by 67% for PyTorch-to-Triton transpilation tasks, enabling more efficient and intelligent AI models (link in comments).

6 Comments

Asad Ansari

29,467 followers 1mo

You cannot train AI on reality alone anymore. There is not enough of it. Jensen Huang explains why NVIDIA built Cosmos, an AI world model that generates synthetic training data grounded in physics. The problem is simple. Teaching physical AI like robotics requires vast amounts of diverse interaction data. Videos exist, but not nearly enough to capture the variety of situations robots will encounter. So NVIDIA transformed compute into data. Using synthetic data generation grounded by laws of physics, they can selectively generate training scenarios that would be impossible to capture otherwise. The example Huang shows is remarkable. A basic traffic simulator output gets fed into Cosmos. What emerges is physically plausible surround video that AI can learn from. This solves a fundamental limitation. You cannot train autonomous systems on every possible scenario by recording reality. There are not enough cameras or time. But you can simulate physics accurately enough that AI trained on synthetic data generalises to real environments. This applies beyond robotics. Any AI learning physical interactions, from manufacturing to logistics to infrastructure monitoring, faces the same data scarcity problem. Synthetic data generation grounded in physics laws is how you create training sets reality cannot provide. The organisations building AI for physical systems will either master synthetic data generation or get limited by whatever reality they can record. Watch the full presentation to hear Huang explain how Cosmos generates training data for physical AI. What physical AI application needs synthetic data because reality cannot provide enough examples? #AI #SyntheticData #Robotics #NVIDIA #MachineLearning

62 Comments

Vik Pant, PhD

Applied AI and Quantum Information @ PwC, Synthetic Intelligence Forum, University of Toronto

12,420 followers 1y

Thank you to the University of Toronto Machine Intelligence Student Team for inviting me to present a keynote on augmenting human-labeled datasets using Large Language Models (LLMs). Human-labeled data is crucial for testing, tuning, customizing, and validating LLMs in organizations. This is because human labeled data provides the ground truth for developing trustworthy #GenerativeAI applications and #AgenticAI systems. Yet acquiring sufficient human labeled data is often a bottleneck in many organizations. Subject matter experts and domain specialists typically have limited time for labeling tasks due to competing professional demands, making large-scale manual labeling difficult to sustain. My talk focused on how LLMs can be used not to substitute human labels, but to systematically augment them—extending the utility of existing human labeled data and improving model robustness without proportionally increasing manual labeling effort. I described practical methods for implementing two augmentation techniques with strong empirical grounding: • Negative Reinforcement with Counterfactual Examples – This technique involves analyzing labeled examples to generate counterfactual examples—outputs that are intentionally incorrect or undesirable—and using them to teach the model about what not to generate. By guiding the model using these negative samples, the model learns sharper decision boundaries, increasing robustness against hallucinations and confabulations. • Contrastive Learning with Controlled Perturbations – This technique creates diverse, label-preserving variants of human-labeled examples by introducing controlled modifications to the prompts and/or completions. These perturbations maintain core semantic meaning while varying surface-level features such as syntax, phrasing, or structure, encouraging the model to generalize beyond shallow lexical or syntactic cues. These techniques have been shown to drive measurable improvements in model behavior: • Lower Perplexity → More predictable completions and improved alignment with ground-truth targets. • Reduced Token Entropy → More focused and efficient completions, reducing inference complexity. • Higher Self-Consistency → More stable completions across repeated generations of the same prompt—a key requirement for dependable downstream use. These are not theoretical constructs—they are practical techniques for overcoming constraints in human-labeled data availability and scaling of #LLM applications with greater efficiency and rigor. Appreciate the University of Toronto Machine Intelligence Student Team (UTMIST) for a well-curated conference, and the UofT AI group for their initiatives in the space. Grateful to my research partner, Olga, for her contributions in collaboratively developing content for this presentation. Kudos to my PwC Canada teammates including Michelle B, Annie, Chris M, Michelle G, Chris D, Brenda, Bahar, Danielle, and Abhinav for their partnership on our PwC #AI portfolio.

+2

8 Comments

Brian Heater

14,659 followers 2w

NVIDIA’s Physical AI Data Factory Blueprint is Designed to Improve Robot Training Data One of the biggest hurdles standing between physical AI and its “ChatGPT moment” is a lack of quality data. A big part of the reason LLMs have been such a massive – and often surprising – success is the fact that humans have essentially been creating training data for 100,000 years or so. The same can’t be said for the input required to train robots. NVIDIA is among the companies working to address the gap, and this morning at GTC the company announced Physical AI Data Factory Blueprint, an open reference architecture designed to improve how both real-world and simulated data is gathered, shaped, and assessed. The company has already recruited some big names from across autonomous driving and robotics, including FieldAI, Hexagon AB Robotics, Linker Vision, Milestone Systems, Skild AI, Uber, and Teradyne Robotics. The platform is host to number of processes designed to do right by the real and synthetic robot data. There’s Cosmos Curator, which processes and annotates datasets, Cosmos Tranffer, which is designed to address edge cases and long tail scenarios, and Cosmos Evaluator, which, you know, evaluates data. “Physical AI is the next frontier of the AI revolution, where success depends on the ability to generate massive amounts of data,” says Omniverse VP, Rev Lebaredian. “Together with cloud leaders, we’re providing a new kind of agentic engine that transforms compute into the high-quality data required to bring the next generation of autonomous systems and robots to life. In this new era, compute is data.” #nvidia #gtc #nvidiagtc #robotics #physicalai

1 Comment

Anima Anandkumar

226,899 followers 7mo

How do we build AI for science? Augment with AI or replace with AI? Popular prescription is to augment AI into existing workflows rather than replace them, e.g., keep the approximate numerical solver for simulations, and use AI only to correct its errors in every time step. The other extreme is to completely discard the existing workflow and replace it fully with AI. We have seen this approach win in areas like weather forecasting. Such end-to-end AI is significantly better for speed: 1000-million x faster. In our latest paper, we show end-to-end learning also wins in data efficiency, which is counterintuitive. Where do these savings come from? The former approach that augments AI relies only on fully accurate training data that is expensive. But end-to-end learning can use both approximate and accurate training data, if the model can learn how to mix them correctly. In many physical systems, coarse-grid numerical solvers yield approximate data while fine-grid solvers fully resolve the scales and yield exact answers. It turns out that Neural Operators offer a perfect solution when such multi-fidelity and multi-resolution data is available, and can learn with high data efficiency requiring only a small amount of fully resolved data, since it can also utilize approximate training data. In contrast, the standard approach of augmenting AI to a coarse-grid numerical solver (closure model) can only train on fully-resolved simulations, making it very expensive and hard to train. Our results are applicable in multi-scale chaotic systems that have traditionally required running long simulations at high resolution such as climate change or plasma in nuclear fusion and astrophysics. Now you can replace expensive simulation fully with AI (Neural Operators), and also train it without requiring such simulations in large numbers for training in many scenarios.

24 Comments

Sohrab Rahimi

Director, AI/ML Lead @ Google

23,104 followers 1y

I recently delved into some intriguing research about the often-overlooked potential of Small Language Models (SLMs). While LLMs usually grab the headlines with their impressive capabilities, studies on SLMs fascinate me because they challenge the “bigger is better” mindset. They highlight scenarios where smaller, specialized models not only hold their own but actually outperform their larger counterparts. Here are some key insights from the research: 𝟏. 𝐑𝐞𝐚𝐥-𝐓𝐢𝐦𝐞, 𝐏𝐫𝐢𝐯𝐚𝐜𝐲-𝐅𝐨𝐜𝐮𝐬𝐞𝐝 𝐀𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬: SLMs excel in situations where data privacy and low latency are critical. Imagine mobile apps that need to process personal data locally or customer support bots requiring instant, accurate responses. SLMs can deliver high-quality results without sending sensitive information to the cloud, thus enhancing data security and reducing response times. 𝟐. 𝐒𝐩𝐞𝐜𝐢𝐚𝐥𝐢𝐳𝐞𝐝, 𝐃𝐨𝐦𝐚𝐢𝐧-𝐒𝐩𝐞𝐜𝐢𝐟𝐢𝐜 𝐓𝐚𝐬𝐤𝐬: In industries like healthcare, finance, and law, accuracy and relevance are paramount. SLMs can be fine-tuned on targeted datasets, often outperforming general LLMs for specific tasks while using a fraction of the computational resources. For example, an SLM trained on medical terminology can provide precise and actionable insights without the overhead of a massive model. 𝟑. 𝐀𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐓𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞𝐬 𝐟𝐨𝐫 𝐋𝐢𝐠𝐡𝐭𝐰𝐞𝐢𝐠𝐡𝐭 𝐀𝐈: SLMs leverage sophisticated methods to maintain high performance despite their smaller size: • Pruning: Eliminates redundant parameters to streamline the model. • Knowledge Distillation: Transfers essential knowledge from larger models to smaller ones, capturing the “best of both worlds.” • Quantization: Reduces memory usage by lowering the precision of non-critical parameters without sacrificing accuracy. These techniques enable SLMs to run efficiently on edge devices where memory and processing power are limited. Despite these advantages, the industry often defaults to LLMs due to a few prevalent mindsets: • “Bigger is Better” Mentality: There’s a common belief that larger models are inherently superior, even when an SLM could perform just as well or better for specific tasks. • Familiarity Bias: Teams accustomed to working with LLMs may overlook the advanced techniques that make SLMs so effective. • One-Size-Fits-All Approach: The allure of a universal solution often overshadows the benefits of a tailored model. Perhaps it’s time to rethink our approach and adopt a “right model for the right task” mindset. By making AI faster, more accessible, and more resource-efficient, SLMs open doors across industries that previously found LLMs too costly or impractical. What are your thoughts on the role of SLMs in the future of AI? Have you encountered situations where a smaller model outperformed a larger one? I’d love to hear your experiences and insights.

13 Comments

Training AI Models With Limited Data

More in Training AI Models With Limited Data

More Artificial Intelligence topics

Explore categories