Top LinkedIn Content on Advanced Computer Vision Techniques

Turning physics & data into insights | Engineer • Algorithm Developer • Researcher

4,405 followers 6mo

Rarely do I miss an opportunity to apply techniques from work to things I enjoy doing in my free time. Often that thing is cycling, the greatest sport in the world. But when autumn comes around and the streets turn cold and dark, I prefer the iron of the weight room over the carbon of the road. A technical sport like Olympic weightlifting is of course best mastered with the guidance of an experienced coach. To the less fortunate or ambitious, video analysis may help hone skills and track progress. Various apps offer basic video tools, but I’ve yet to find one that provides the data I am truly interested in: ground reaction forces (GRFs), forces applied to the barbell, accurate kinematics, and mechanical power output. That is why I developed my own weightlifting video analyzer. The demo below is the result of several techniques from work combined: AI & computer vision (Raidyn), state-input estimation & multibody dynamics (KU Leuven Mecha(tro)nic System Dynamics (LMSD)), and biomechanical impact modelling (Classified Cycling). At the core is a flexible multibody model of the barbell that feeds into a combined state-input-parameter estimator to infer the forces applied to the bar. With accurate barbell forces, GRFs can be estimated with higher precision than methods based solely on accelerations derived from video. In addition, the model unlocks a virtually unlimited supply of synthetic training data for the AI model, ensuring robust segmentation of barbell motion and deformation. Initially I cast myself as the hero of the demo, but I soon realized that a video starring the Pogačar of weightlifting would make for a more spectacular analysis. Enter Bulgarian Karlos Nasar, 20 years of age at this year’s European Championships in Moldova, lifting a record-breaking 229 kg overhead. Running my analyzer on Nasar’s low-resolution YouTube video allowed me to test its performance on a subject it hadn’t seen before in training, with frame rate and resolution far below what I use in my own sessions – and yet the algorithms didn’t miss a beat. Sure, it helps that weightlifting consists of only two precisely defined movement patterns, but I was still pleasantly surprised. The demo below is just the start. Belgium’s cold season still has a way to go and I’ve got plenty of ideas to improve and expand the algorithms. Next up: joint and muscle force estimation.

130 Comments

Niels Rogge

Machine Learning Engineer at ML6 & Hugging Face

69,605 followers 6mo

Big news for the 3D computer vision community! 🙌 ByteDance released Depth Anything 3 on Hugging Face 🔥. This is the world's most powerful model for 3D understanding: it predicts spatially consistent geometry (depth and ray maps) from an arbitrary number of visual inputs, with or without known camera poses. In other words, it allows you to reconstruct a 3D scene just from 2D inputs. DA3 extends monocular depth estimation to any-view scenarios, hence the model can take in single images, multi-view images, and video. Interestingly, the authors reveal two key insights: - A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture is required. - A single depth-ray representation objective is enough. The model does not require a complex multi-task training. Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series. Metric estimation, also called absolute estimation, determines the distance in meters relative to the camera, whereas monocular depth estimation determines the distance relative among the pixels. The authors also released a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. DA3 sets a new state-of-the-art across all 10 tasks, surpassing prior SOTA, Meta's VGGT, by an average of 35.7% in camera pose accuracy and 23.6% in geometric accuracy. Furthermore, DA3 facilitates SLAM (Simultaneous Localization and Mapping) and 3D Gaussian Splatting by providing a robust and generalizable method for predicting spatially consistent geometry from various visual inputs. Links: - Models: https://lnkd.in/eFFHJhJx - Paper: https://lnkd.in/ewtxy7p6 - Demo: https://lnkd.in/e7Qr3tnG - Code: https://lnkd.in/e89B6JpR

51 Comments

Alexey Navolokin

FOLLOW ME for breaking tech news & content • helping usher in tech 2.0 • GM @ AMD • Turning AI, Cloud & Emerging Tech into Revenue

782,497 followers 10mo

Robots on the pitch....You better believe it. Will you be able to play with this one? No more standing cones or passive drills. Athletes today are dodging dynamic robots—machines that track, move, and react in real time. These aren’t gimmicks; they’re next-gen training partners. ⚽ In football, systems like SKILLSLAB, Rezzil, and Trailblazer Training Bots are already used by top clubs to simulate high-pressure situations, improve decision-making, and measure milliseconds of reaction time. 🏀 In basketball, robotic arms help perfect shooting arcs, while AI vision tools break down footwork frame by frame. 🎾 In tennis, smart ball machines adjust spin, speed, and placement in unpredictable sequences—training the brain as much as the body. Why it matters: + Athletes improve reaction speed by up to 20% using adaptive robotic drills. + Training bots allow 3x more touches per minute compared to traditional drills. + Machine-learning platforms track thousands of data points per session—customizing feedback instantly. This isn’t just tech—it’s transformation. Robots are helping players train faster, smarter, and with a grin on their face. #Innovation #Tech #Robots

83 Comments

Jeremy Park, PhD

Founding DevRel Engineer @ VLM Run | Computer Vision PhD

2,749 followers 3mo

I made a computer vision app to measure back curvature during deadlift! My previous post focused on measuring deadlift rep timing by tracking the barbell plate. Since then, I’ve added the ability to quantify back roundness, so that we can determine whether the lift was completed with either a straight back or a rounded back. This could be used to help guide proper technique during training. Here are the main technical highlights: — RF-DETR (Roboflow) to segment the person (great performance out-the-box with no additional training!) — YOLO11n (Ultralytics) for bounding box prediction around the barbell weight plate (trained on my own small dataset). — Mediapipe (Google) for pose landmark detection to guide the bounds for the line fitting. — Custom logic to fit a line across the back. I shared this work at AI Tinkerers Raleigh last night and got 2nd place demo! It was great to hear everyone’s feedback and to connect with this inspiring group. I’m excited for how this work can enable AI-assisted feedback and for what could be the beginning of an “AI coach.” I’m very passionate about fitness, and excited to connect with more people in this AI and fitness space. Happy to chat if you have any questions or if you just want to connect! — To explain a bit more about the plots on the right side: The Back Roundness Map shows the deviation of the estimated back curve from a line of best fit. I was very happy to see that the two reps that I intentionally did with a rounded back show up so clearly in the data! The Average Back Shape plot shows how the rounded back reps look visually distinct from the straight back reps. — #computervision #ai #ml #machinelearning #deeplearning #phd #roboflow #google #ultralytics #yolo #gym #fitness #deadlift #aitinkerers Google for Developers Google AI for Developers

76 Comments

Swami Sivasubramanian

VP, AWS Agentic AI

194,292 followers 3mo

For most of football’s history, much of what we watched on the field went unmeasured. Today, nearly every player and ball movement throughout the game is measured, modeled, and analyzed in real time. This data is improving fan experiences and giving them richer sport insights. It's also changing how professionals approach the game—from improving player safety to unlocking new training environments. The results speak for themselves: a 35% reduction in lower-extremity injuries from the redesigned kickoff format, informed by Next Gen Stats data. Innovations like completion probability and rush yards over expectation that make broadcasts more engaging. And now, pose-tracking technology that captures full skeletal data 60 times per second, is opening doors to VR training that could accelerate player development from years to months. I'm proud of how we've expanded our partnership with the NFL on Next Gen Stats, powered by AI tools like Amazon SageMaker and Amazon Quick. What started as a tracking experiment in 2015 has become a critical part of the NFL’s infrastructure that uses machine learning models on AWS to process data from 22 players, generating 500-1,000 stats per play, instantly. What a win for the Hawks last night! If you're still riding the excitement, take a few minutes to read through this deep dive into the science that powers the complex stats you see on screen throughout the season. Cool look at the history of our partnership with the NFL through Next Gen Stats! https://lnkd.in/gX8Mpe7T

8 Comments

Bhavishya Pandit

85,667 followers 7mo

You can now generate infinite-length videos!? Yes, literally infinite. Let me quickly explain why it's a problem to begin with: AI models generate videos frame by frame, and each new frame depends on the previous one. The problem? Tiny errors stack up. By frame 100, your subject starts distorting. By frame 500, everything's a mess 💩 This happens because the model was trained on clean data, but during generation, it has to build on top of its own imperfect outputs. That gap kills quality over time [vanishing gradient analogy]. Plus, existing methods only handle one prompt, so you get repetitive scenes with no real story progression. Here's where Stable Video Infinity from EPFL shines 💡: Instead of fighting errors, it learns from them. The breakthrough is Error-Recycling Fine-Tuning. During training, the model deliberately injects its own past errors into clean frames, watches what goes wrong, and figures out how to fix it. Here's the process: → inject historical errors to simulate real generation conditions → predict where drift will happen → bank those errors in memory → learn to correct them before they compound. This creates three powerful results: • Videos can extend infinitely without quality collapse • Scene transitions happen naturally with controllable storylines • Works with multiple conditions like audio, skeleton poses, and text streams They've generated 10-minute Tom & Jerry videos from a single image. Not stitched clips, but continuous generation. The efficiency comes from only training LoRA adapters, not the full model. You can customise it without massive computing. The challenges? Real-time streaming isn't there yet. The model generates clip-by-clip with bidirectional attention for quality, which means you can't stream live outputs instantly. You still need decent hardware to train custom versions, though inference is manageable. And while error recycling is clever, the model needs to bank enough error patterns during training to handle diverse scenarios. But the future's interesting. They're working on Wan 2.2 5B-based SVI and true streaming generation. If they can achieve real-time inference while maintaining quality, this becomes viable for live content creation and gaming. The bigger idea here is training models on their own mistakes, rather than just clean data. That could apply beyond video to any autoregressive generation task. What's the longest AI-generated video you've successfully created without quality degradation, and what method did you use? Follow me, Bhavishya Pandit, for honest takes on AI breakthroughs that actually work 🔥

30 Comments

Henry Ajder

AI and Deepfake Cartographer

17,179 followers 1y

OpenAI's Sora is dominating the news, but Tencent's latest generative video model Hunyuan has been much less discussed. Here's why I think it's significant: Hunyuan is a 13bn parameter model providing text-to-video, avatar animation, and notably video-to-audio capabilities. Tencent claims outputs are "comparable to, if not superior to", other leading generative video models, with independent evaluations finding Hunyuan outperformed Runway Gen-3 alpha and Luma 1.6. I've found the quality impressive but inconsistent. Compared to Sora, the outputs lagged on motion fluidity and human subjects, although others have had better results. So why is it so significant? Hunyuan's is open source and represents the most powerful and dynamic open generative video model currently available. There has been progress in OS generative video (such as Mochi 1), but most advances/multi-functional capabilities are seen in proprietary/closed models. Accessible closed models like Sora may be in the hands of many users right now, but open models like Hunyuan unlock the foundations for the global OS community to experiment and develop novel applications. As we've seen with other open generative models and modalities, these permutations could reshape how we view what's possible with generative video- for better and/or for worse. https://lnkd.in/eKYc6KvW

1 Comment

Arjun Jain

Founder & CEO, Fast Code AI | Research-grade AI for enterprises with hard problems | Dad

37,138 followers 10mo

#MIT's new "Radial Attention" makes Generative Video 4.4x cheaper to train and 3.7x faster to run. Here's why: The problem with current AI video? It's BRUTALLY expensive. Every frame must "pay attention" to every other frame. With thousands of frames, costs explode exponentially. Training one model? $100K+ Running it? Painfully slow. Massachusetts Institute of Technology, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence just changed the game. Their breakthrough insight: Video attention works like physics. - Sound gets quieter with distance - Light dims as it travels - Heat dissipates over space Turns out, AI video tokens follow the same rules. Why waste compute power on distant, irrelevant connections? Enter Radial Attention: Instead of checking EVERY connection: • Nearby frames → full attention • Distant frames → sparse attention • Computation scales logarithmically, not quadratically Technical result: O(n log n) vs O(n²) Translation: MASSIVE efficiency gains Real-world results on production models: 📊 HunyuanVideo (Tencent): • 2.78x training speedup • 2.35x inference speedup 📊 Mochi 1: • 1.78x training speedup • 1.63x inference speedup Quality? Maintained or IMPROVED. What this unlocks: 4x longer videos, same resources 4.4x cheaper training costs 3.7x faster generation Works with existing models (no retraining!) And, MIT open-sourced everything: https://lnkd.in/gETYw8eT The bigger picture: The internet is transforming. BEFORE: A place to store videos from the real world NOW: A machine that generates synthetic content on demand Think about it: • TikTok filled with AI-generated content • YouTube creators using AI for entire videos • Streaming services producing personalized shows • Educational content generated for each student This changes everything. Remember when only big tech could afford image AI? 2020: GPT-3 → Only OpenAI 2022: Stable Diffusion → Everyone 2024: Midjourney everywhere Video AI is next. Radial Attention probably just accelerated the timeline. The future isn't coming. It's here. And it's more accessible than ever. Want to ride this wave? → Follow me for weekly AI breakthroughs → Share if this opened your eyes → Try the code: https://lnkd.in/gETYw8eT What will YOU create when video AI costs 4x less? #AI #VideoGeneration #MachineLearning #TechInnovation #FutureOfContent

2 Comments

Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

42,073 followers 1y

The last few weeks have been huge for open-source video generation and research. After two years of limited usability in open-source video generative models, we’re finally seeing major advancements. These new models outperform commercial ones, including Runway, Pika Labs, and Luma Labs. —> Mochi, released by Genmo a week ago, ranks 2nd among top generative video models, permitting commercial use with an Apache 2.0 license —> CogVideoX-5B from Tsinghua University released last month supports both text2video and image2video, allowing commercial use for companies with <1M users —> Allegro from Rhymes AI is a small model capable of generating a wide range of content, from human close-ups to diverse, dynamic scenes, permitting commercial use with an Apache 2.0 license Also, over the last few weeks, Meta announced MovieGen for generating HD personalized videos with synchronized audio, and Peking University openly released Pyramid Flow. On the proprietary generative video side of things, Runway released a new tool for transforming simple video and voice inputs into expressive character performances, Pika Labs released Pikaffects to transform video subjects with surreal effects, and Luma Labs announced an API access to its generative video models. As for OpenAI’s Sora, who knows, it might launch soon after the US elections are out of the way (this Tuesday). Generative video models leaderboard https://lnkd.in/grWJDVkd Links to the mentioned models are in the comments.

3 Comments

Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

16,490 followers 2y

🤔 Why don't we have abundant video-language models (VLMs)? The simple answer, compute! 💻 The pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the progress of video-language models. 📊 This is the very problem that is tackled in "PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning" 📝 The paper discusses the significant improvement in performance across a wide range of image-language applications due to vision-language pre-training. 🚀 The authors propose a straightforward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for dense video understanding. 👍 Preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames as inputs on video datasets leads to performance saturation or even a drop. 😕 The authors attribute this to the bias of learned high-norm visual features. 📉 To address this, they propose a simple but effective pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from the extreme features. 🔄 The new model is termed Pooling LLaVA. 🌊 Pooling LLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks. 🏆 Notably, on the recent popular Video ChatGPT benchmark, Pooling LLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, exceeding the previous SOTA results from GPT4V (IG-VLM) by 9%. 📈 On the latest multi-choice benchmark MVBench, Pooling LLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higher than GPT4V (IG-VLM). 🔝 The model and the code for the model are open-sourced as well! 🌐 Paper: https://lnkd.in/gcWZMP6d 📜 Code: https://lnkd.in/gQbYaTXn 💻 Models: https://lnkd.in/gZgE6gwk 🤖

Advanced Computer Vision Techniques

More in Advanced Computer Vision Techniques

More Technology topics

Explore categories