Advancements in Open-Source Video Models

Explore top LinkedIn content from expert professionals.

Summary

Advancements in open-source video models are rapidly expanding what’s possible with AI-generated video, making it easier for anyone to create high-quality content without relying on proprietary tools. These models are freely accessible frameworks that use artificial intelligence to generate, extend, or manipulate videos in new ways, unlocking creative opportunities and driving innovation for a wider community.

  • Access new capabilities: Experiment with cutting-edge features like error-recycling fine-tuning, which allows you to generate longer videos without losing quality over time.
  • Boost content creation: Take advantage of open-source frameworks that support text-to-video, audio integration, and enhanced motion quality to create engaging and customized video projects.
  • Reduce production costs: Use models like those with radial attention mechanisms to lower resource needs for training and running video generation, making advanced tools more affordable and practical for everyday creators.
Summarized by AI based on LinkedIn member posts
  • View profile for Henry Ajder
    Henry Ajder Henry Ajder is an Influencer

    AI and Deepfake Cartographer

    17,178 followers

    OpenAI's Sora is dominating the news, but Tencent's latest generative video model Hunyuan has been much less discussed. Here's why I think it's significant: Hunyuan is a 13bn parameter model providing text-to-video, avatar animation, and notably video-to-audio capabilities. Tencent claims outputs are "comparable to, if not superior to", other leading generative video models, with independent evaluations finding Hunyuan outperformed Runway Gen-3 alpha and Luma 1.6. I've found the quality impressive but inconsistent. Compared to Sora, the outputs lagged on motion fluidity and human subjects, although others have had better results. So why is it so significant? Hunyuan's is open source and represents the most powerful and dynamic open generative video model currently available. There has been progress in OS generative video (such as Mochi 1), but most advances/multi-functional capabilities are seen in proprietary/closed models. Accessible closed models like Sora may be in the hands of many users right now, but open models like Hunyuan unlock the foundations for the global OS community to experiment and develop novel applications. As we've seen with other open generative models and modalities, these permutations could reshape how we view what's possible with generative video- for better and/or for worse. https://lnkd.in/eKYc6KvW

  • View profile for Arjun Jain

    Founder & CEO, Fast Code AI | Research-grade AI for enterprises with hard problems | Dad

    37,136 followers

    #MIT's new "Radial Attention" makes Generative Video 4.4x cheaper to train and 3.7x faster to run. Here's why: The problem with current AI video? It's BRUTALLY expensive. Every frame must "pay attention" to every other frame. With thousands of frames, costs explode exponentially. Training one model? $100K+ Running it? Painfully slow. Massachusetts Institute of Technology, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence just changed the game. Their breakthrough insight: Video attention works like physics. - Sound gets quieter with distance - Light dims as it travels - Heat dissipates over space Turns out, AI video tokens follow the same rules. Why waste compute power on distant, irrelevant connections? Enter Radial Attention: Instead of checking EVERY connection: • Nearby frames → full attention • Distant frames → sparse attention • Computation scales logarithmically, not quadratically Technical result: O(n log n) vs O(n²) Translation: MASSIVE efficiency gains Real-world results on production models: 📊 HunyuanVideo (Tencent): • 2.78x training speedup • 2.35x inference speedup 📊 Mochi 1: • 1.78x training speedup • 1.63x inference speedup Quality? Maintained or IMPROVED. What this unlocks: 4x longer videos, same resources 4.4x cheaper training costs 3.7x faster generation Works with existing models (no retraining!) And, MIT open-sourced everything: https://lnkd.in/gETYw8eT The bigger picture: The internet is transforming. BEFORE: A place to store videos from the real world NOW: A machine that generates synthetic content on demand Think about it: • TikTok filled with AI-generated content • YouTube creators using AI for entire videos • Streaming services producing personalized shows • Educational content generated for each student This changes everything. Remember when only big tech could afford image AI? 2020: GPT-3 → Only OpenAI 2022: Stable Diffusion → Everyone 2024: Midjourney everywhere Video AI is next. Radial Attention probably just accelerated the timeline. The future isn't coming. It's here. And it's more accessible than ever. Want to ride this wave? → Follow me for weekly AI breakthroughs → Share if this opened your eyes → Try the code: https://lnkd.in/gETYw8eT What will YOU create when video AI costs 4x less? #AI #VideoGeneration #MachineLearning #TechInnovation #FutureOfContent

  • View profile for Massimiliano Viola

    ML @Bedrock Robotics | Ex Stanford, ETH Zurich | Computer Vision • 3D • Generative Models

    14,566 followers

    This DEFINITELY flew under the radar: just a few days ago, AI at Meta released V-JEPA 2.1, taking a massive step toward closing the gap between image and video domains. For a long time, image backbones were the only option for solving dense vision tasks. This model disagrees, showing that universal spatial understanding also emerges from large-scale video models! 🎥 Quick recap on V-JEPA: it is a joint embedding predictive architecture built on a classic teacher-student setup. The teacher sees the full video, and its weights slowly update as an exponential moving average of the student. The student sees a masked input and predicts the latent features of the missing regions rather than reconstructing them in pixel space. What changed between V1 and V2 was largely a matter of scale. The encoder grew to a 1B-parameter ViT-g, the dataset from 2M to 22M videos, training got longer and progressive, and clips were pushed to higher temporal and spatial resolution. V2 also introduced images into the mix via temporal duplication, training on 1M ImageNet samples. But the difference between V2 and V2.1 is conceptual, on top of just scaling. Sure, they pushed the model to 2B parameters and expanded the image dataset from 1M to 142M, but the real breakthrough lies in the training loss. In V-JEPA 2, supervision was only applied to the masked regions, despite the predictor outputting a token for every input, masked or not. Thus, the visible tokens were free to ignore local structure and aggregate global information if that would minimize the loss, similar to register tokens. V-JEPA 2.1 fixes this by extending supervision to the visible tokens too. Every patch, masked or visible, now has a training signal forcing it to encode where things actually are in space and time. This results in feature maps that look nothing like before: spatially structured, semantically coherent, and temporally consistent. Looking at the features below, you would almost think this is some small variant of DINOv3 (with due respect), except these results came from video pretraining! 🤯 This feature quality obviously translates to downstream tasks. Motion benchmarks got only a small buff, but spatial tasks are where the gains are staggering, with improvements ranging anywhere from 30 to 95%. The idea that we now basically have a SOTA image encoder baked into video features is crazy to me, and as someone working with video models on a daily basis, I could not be happier to put this to the test and distill it down into even smaller and faster variants than the smallest 80M. Resources are down in the comments. Try it out if you were using the previous version, and let me know how it goes! ⏬

  • View profile for Sachin Kumar

    Senior Data Scientist III at LexisNexis | Experienced Agentic AI and Generative AI Expert

    8,711 followers

    HunyuanVideo: A Systematic Framework For Large Video Generative Models. In this paper, authors present HunyuanVideo, a novel open-source video foundation model that exhibits performance in video generation that is comparable to, if not superior to, leading closed-source models. HunyuanVideo features a comprehensive framework that integrates several key contributions, including data curation, advanced architecture design, progressive model scaling and training, and an efficient infrastructure designed to facilitate large-scale model training and inference. 𝗗𝗮𝘁𝗮 𝗣𝗿𝗲-𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 - use an image-video joint training strategy, with videos divided into five distinct groups, and images categorized into two groups, each tailored to fit specific requirements of their respective training processes. - raw data pool initially comprised videos spanning a wide range of domains including people, animals, plants, landscapes, vehicles, objects, buildings, and animation i) Data Filtering - employ various filters for data filtering and progressively increase their thresholds to build 4 training datasets, i.e., 256p, 360p, 540p, and 720p, while the final SFT dataset is built through manual annotation ii) Data Annotation - developed and implemented an in-house Vision Language Model(VLM) designed to generate structured captions for images and videos 𝗠𝗼𝗱𝗲𝗹 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗗𝗲𝘀𝗶𝗴𝗻 i) Unified Image and Video Generative Architecture - In dual-stream phase, video and text tokens are processed independently through multiple Transformer blocks, enabling each modality to learn its own appropriate modulation mechanisms without interference. - In single-stream phase, concatenate the video and text tokens and feed them into subsequent Transformer blocks for multimodal information fusion ii) MLLM Text Encoder - utilize a pretrained Multimodal Large Language Model (MLLM) with a Decoder-Only structure as text encoder iii) 3D VAE - trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space iv) Prompt Rewrite -  provide two rewrite modes: Normal mode and Master mode -  Normal mode is designed to enhance video generation model's comprehension of user intent, for a more accurate interpretation of instructions provided - Master mode enhances description of aspects such as composition, lighting, and camera movement, for generating videos with a higher visual quality 𝗖𝗼𝗺𝗽𝗮𝗿𝗶𝘀𝗼𝗻 𝘄𝗶𝘁𝗵 𝗦𝗢𝗧𝗔 𝗺𝗼𝗱𝗲𝗹𝘀 - comparisons made with five strong baselines from closed-source video generation models using three criteria: Text Alignment, Motion Quality, and Visual Quality - HunyuanVideo demonstrated best overall performance, particularly excelling in motion quality 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗥𝗲𝗽𝗼𝗿𝘁: https://lnkd.in/e8fYi3yp 𝗖𝗼𝗱𝗲: https://lnkd.in/eCp-SfZZ 𝗠𝗼𝗱𝗲𝗹 𝗰𝗵𝗲𝗰𝗸𝗽𝗼𝗶𝗻𝘁𝘀: https://lnkd.in/ekMGehv7

  • View profile for Bhavishya Pandit

    Turning AI into enterprise value | $20 M in Business Impact | Speaker - MHA/IITs/IIMs/NITs | Google AI Expert | 50 Million+ views | MS in ML - UoA

    85,668 followers

    You can now generate infinite-length videos!? Yes, literally infinite. Let me quickly explain why it's a problem to begin with: AI models generate videos frame by frame, and each new frame depends on the previous one. The problem? Tiny errors stack up. By frame 100, your subject starts distorting. By frame 500, everything's a mess 💩 This happens because the model was trained on clean data, but during generation, it has to build on top of its own imperfect outputs. That gap kills quality over time [vanishing gradient analogy]. Plus, existing methods only handle one prompt, so you get repetitive scenes with no real story progression. Here's where Stable Video Infinity from EPFL shines 💡: Instead of fighting errors, it learns from them. The breakthrough is Error-Recycling Fine-Tuning. During training, the model deliberately injects its own past errors into clean frames, watches what goes wrong, and figures out how to fix it. Here's the process: → inject historical errors to simulate real generation conditions → predict where drift will happen → bank those errors in memory → learn to correct them before they compound. This creates three powerful results: • Videos can extend infinitely without quality collapse • Scene transitions happen naturally with controllable storylines • Works with multiple conditions like audio, skeleton poses, and text streams They've generated 10-minute Tom & Jerry videos from a single image. Not stitched clips, but continuous generation. The efficiency comes from only training LoRA adapters, not the full model. You can customise it without massive computing. The challenges? Real-time streaming isn't there yet. The model generates clip-by-clip with bidirectional attention for quality, which means you can't stream live outputs instantly. You still need decent hardware to train custom versions, though inference is manageable. And while error recycling is clever, the model needs to bank enough error patterns during training to handle diverse scenarios. But the future's interesting. They're working on Wan 2.2 5B-based SVI and true streaming generation. If they can achieve real-time inference while maintaining quality, this becomes viable for live content creation and gaming. The bigger idea here is training models on their own mistakes, rather than just clean data. That could apply beyond video to any autoregressive generation task. What's the longest AI-generated video you've successfully created without quality degradation, and what method did you use? Follow me, Bhavishya Pandit, for honest takes on AI breakthroughs that actually work 🔥

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    16,490 followers

    🤔 Why don't we have abundant video-language models (VLMs)? The simple answer, compute! 💻 The pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the progress of video-language models. 📊 This is the very problem that is tackled in "PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning" 📝 The paper discusses the significant improvement in performance across a wide range of image-language applications due to vision-language pre-training. 🚀 The authors propose a straightforward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for dense video understanding. 👍 Preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames as inputs on video datasets leads to performance saturation or even a drop. 😕 The authors attribute this to the bias of learned high-norm visual features. 📉 To address this, they propose a simple but effective pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from the extreme features. 🔄 The new model is termed Pooling LLaVA. 🌊 Pooling LLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks. 🏆 Notably, on the recent popular Video ChatGPT benchmark, Pooling LLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, exceeding the previous SOTA results from GPT4V (IG-VLM) by 9%. 📈 On the latest multi-choice benchmark MVBench, Pooling LLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higher than GPT4V (IG-VLM). 🔝 The model and the code for the model are open-sourced as well! 🌐 Paper: https://lnkd.in/gcWZMP6d 📜 Code: https://lnkd.in/gQbYaTXn 💻 Models: https://lnkd.in/gZgE6gwk 🤖

  • View profile for Bilawal Sidhu
    Bilawal Sidhu Bilawal Sidhu is an Influencer

    Creator (1.6M+) | TED Tech Curator | Ex-Google PM (XR & 3D Maps) | Spatial Intelligence, World Models & Visual Effects

    59,100 followers

    This is some quietly impressive work on making video world models actually controllable in 4D space. VerseCrafter lets you take an input image, use something like Blender to animate the 3D camera path and object trajectories, then uses that to condition generation. Scribbling in 2D feels so crude in comparison. The authors represent everything in a shared 4D world state - static background as a point cloud, moving objects as 3D gaussian trajectories. The gaussians are an interesting choice because they capture position, shape, and orientation probabilistically rather than forcing rigid bounding boxes or category specific models like SMPL-X for human bodies. They bolt this onto frozen Wan2.1 with a lightweight adapter, so they get a strong video prior. They also built a pipeline to auto extract 4D annotations from real world videos to train this puppy. It doesn't look sexy yet, but IMO this is the interface video world models need - actual 3D authoring tools to exert control rather than crude scribbles and prompt incantations. #ai #3d

  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    42,073 followers

    The last few weeks have been huge for open-source video generation and research. After two years of limited usability in open-source video generative models, we’re finally seeing major advancements. These new models outperform commercial ones, including Runway, Pika Labs, and Luma Labs. —> Mochi, released by Genmo a week ago, ranks 2nd among top generative video models, permitting commercial use with an Apache 2.0 license —> CogVideoX-5B from Tsinghua University released last month supports both text2video and image2video, allowing commercial use for companies with <1M users —> Allegro from Rhymes AI is a small model capable of generating a wide range of content, from human close-ups to diverse, dynamic scenes, permitting commercial use with an Apache 2.0 license Also, over the last few weeks, Meta announced MovieGen for generating HD personalized videos with synchronized audio, and Peking University openly released Pyramid Flow. On the proprietary generative video side of things, Runway released a new tool for transforming simple video and voice inputs into expressive character performances, Pika Labs released Pikaffects to transform video subjects with surreal effects, and Luma Labs announced an API access to its generative video models. As for OpenAI’s Sora, who knows, it might launch soon after the US elections are out of the way (this Tuesday). Generative video models leaderboard https://lnkd.in/grWJDVkd Links to the mentioned models are in the comments.

  • View profile for Niels Rogge

    Machine Learning Engineer at ML6 & Hugging Face

    69,605 followers

    Let's go!! Meta released a new video LLM on Hugging Face, and it sets a new SOTA (state-of-the-art) for open-source video understanding. 🔥 The model is called LongVU, a new multimodal large language model capable of processing long videos (for things like answering questions about it, summarizing it, identifying important passages, etc). LongVU is capable of processing very long videos thanks to various clever compression techniques, which increasingly reduce the amount of tokens used to represent a video (and which a Transformer needs to process in parallel). First, the authors employ DINOv2, a self-supervised image model open-sourced by Meta as well, to remove redundant frames that exhibit high feature similarity across time. Next, features for the remaining frames are combined with features from SigLIP, an important vision encoder open-sourced by Google. The large language model (text decoder part of LongVU) is conditioned on these features. Next, after temporal reduction, the authors employ spatial reduction (reducing the width and height dimensions of certain video features). Based on the embeddings of the text query (e.g. "What did this man put on the pizza?"), less important frames get their features' resolution reduced, whereas the most important frames's features keep their original resolution. Finally, spatial token compression (STC) is performed to further reduce the amount of tokens. This is based on a technique where a non-overlapping window is slided over the tokens, where tokens which exhibit high cosine similarity with the first frame of each window are removed. In terms of performance, the model gets SOTA results on EgoSchema, MVBench and VideoMME and MLVU. Only on VideoMME, the gap with closed-source (GPT-4o and Gemini) is still large, but it's quite impressive to see the results. Resources: * paper: https://lnkd.in/eG-rC8Fg * Gradio demo for you to try: https://lnkd.in/e8Ey9ci7 * checkpoints: https://lnkd.in/eJr8WWQB * project page: https://lnkd.in/ee7dirPR #huggingface #video #largelanguagemodels #generativeai #ai

  • View profile for Avani Rajput

    Helping businesses scale with AI | Sales Leader

    14,153 followers

    AI video creation crossed a new milestone ! Alibaba Group has released Wan2.2-S2V-14B, an open-source video generation model that feels like it is finally ready for real-world use. Here’s what you can do with Wan2.2 : 1. Film-style control You can adjust lighting, color, and overall “cinematic feel” just like a director would. 2. Voice-to-video Record a short audio clip, upload an image, and Wan2.2 turns it into a full cinematic sequence. Not just a talking head - but expressive, dynamic scenes. 3. Realistic motion Wan2.2 was trained on one of the largest video datasets yet, which means smoother, more natural movements. 4. Works on consumer hardware This is really amazing: you can run it on a single 4090 GPU and still get 720P @ 24fps video. Before, you’d need data-center-level hardware for that. 5. One tool, many modes Text-to-video, image-to-video, or hybrid, all supported in the same model. Think about what can be using this : - A marketer can create a brand video in hours, not weeks - An e-commerce team can turn product images into lifestyle clips instantly - An educator can narrate a lesson and have it auto-converted into an engaging video - A filmmaker can prototype entire scenes before ever stepping on set - This isn’t just about faster video. It’s about who gets to participate in video creation. If you had access to film-grade video AI on your laptop tomorrow… what’s the very first thing you’d create? Let me know in the comments section Follow Avani Rajput For More Such AI Insights

Explore categories