Text-to-Video Generation
Generate high-quality videos directly from text prompts using UniVideo's advanced natural language understanding and video synthesis capabilities.
Experience the next generation of AI video creation with UniVideo from Kling Team (KwaiVGI). Unify text-to-video generation, image-to-video conversion, in-context video editing, and multimodal understanding in one powerful open-source framework.
UniVideo is a groundbreaking open-source unified video foundation model developed by the Kling Team at KwaiVGI. Unlike traditional task-specific video AI models, UniVideo combines video understanding, generation, and editing capabilities into a single cohesive framework.
Built on an innovative dual-stream architecture that pairs a Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), UniVideo achieves state-of-the-art performance across multiple video AI benchmarks while enabling seamless task composition and zero-shot generalization.
MLLM (Qwen2.5-VL) + MMDiT (HunyuanVideo) for unified processing
Text-to-video, image-to-video, editing, and understanding in one model
Transfer image capabilities to video tasks without additional training
Text / Image / Video Instructions
Multimodal Understanding & Instruction Processing
Video Generation & Editing Engine
Generated / Edited Video Output
Discover the powerful features that make UniVideo the most versatile unified video AI model available. From text-to-video generation to advanced in-context editing, UniVideo delivers exceptional results.
Generate high-quality videos directly from text prompts using UniVideo's advanced natural language understanding and video synthesis capabilities.
Transform static images into dynamic videos with UniVideo's image-to-video AI capabilities, maintaining visual consistency and natural motion.
Create videos with multiple consistent characters and objects using in-context learning, enabling multi-ID video generation without additional training.
Edit videos without masks using UniVideo's instruction-based editing. Replace backgrounds, change materials, insert objects, and more with natural language commands.
Apply artistic styles to videos while preserving content structure. Transform video aesthetics with UniVideo's powerful style transfer capabilities.
Use images and diagrams as visual prompts to guide video generation. UniVideo understands visual references for precise creative control.
Explore real examples of UniVideo's capabilities across text-to-video generation, video editing, style transfer, and in-context video generation.
From filmmaking to marketing, UniVideo transforms how professionals create and edit video content across diverse industries.
Create pre-viz sequences, concept videos, and storyboard animations for film and TV production.
Generate cinematics, character animations, and promotional content for games and interactive media.
Produce engaging video ads, social media content, and marketing videos at scale.
Create educational videos, training materials, and explainer content for learning applications.
UniVideo serves diverse professionals from AI researchers to content creators, providing tools tailored to each user's needs.
Access source code, model weights, and technical documentation for research and development.
Generate and edit professional video content with intuitive tools and workflow templates.
Integrate UniVideo into applications and services with comprehensive API documentation.
Accelerate creative workflows with AI-powered video generation and style transfer tools.
UniVideo employs an innovative dual-stream architecture that combines the strengths of Multimodal Large Language Models (MLLM) for understanding with Multimodal Diffusion Transformers (MMDiT) for generation.
The MLLM branch, based on Qwen2.5-VL-7B-Instruct, processes multimodal instructions and provides semantic understanding. The MMDiT branch, derived from HunyuanVideo-T2V-13B, handles the actual video generation and editing through diffusion-based synthesis.
This architecture enables UniVideo to handle diverse tasks including text-to-video generation, image-to-video conversion, in-context generation, and free-form video editing through a unified instruction paradigm.
UniVideo achieves state-of-the-art results across multiple video AI benchmarks, demonstrating superior performance in understanding, generation, and editing tasks.
| Benchmark | UniVideo Score | Previous SOTA | Improvement |
|---|---|---|---|
| MMBench (Understanding) | 83.5 | 81.2 | +2.8% |
| VBench (Generation) | 82.58 | 80.1 | +3.1% |
| CLIP-I (Identity Alignment) | 0.728 | 0.695 | +4.7% |
| Video Editing Quality | 87.3 | 82.6 | +5.7% |
Get UniVideo running in minutes with our streamlined installation process. Follow these steps to start generating AI videos.
Download UniVideo source code from GitHub and set up your development environment with Python 3.11 and CUDA 12.1.
Fetch UniVideo model weights from Hugging Face. The model includes both MLLM and MMDiT components.
Execute the provided inference scripts to generate videos from text prompts or edit existing video content.
# Clone the UniVideo repository git clone https://github.com/KwaiVGI/UniVideo.git cd UniVideo # Install dependencies pip install -r requirements.txt # Download model weights from Hugging Face huggingface-cli download KwaiVGI/UniVideo --local-dir ./checkpoints # Run text-to-video generation python inference.py --prompt "A cat walking through a beautiful garden"
Ready-to-use workflow templates for common video generation scenarios. Copy and customize for your specific needs.
Create compelling 5-15 second product showcase videos from text descriptions and product images.
Transform lesson scripts into engaging educational videos with visual explanations and animations.
Generate consistent game cinematics from concept art with style-unified video sequences.
UniVideo's text-to-video generation capability transforms natural language descriptions into high-quality video content. Using the powerful combination of MLLM for text understanding and MMDiT for video synthesis, UniVideo generates coherent, visually appealing videos that match your text prompts.
"A majestic eagle soaring through mountain peaks at golden hour, cinematic lighting, 4K quality"
Transform static images into dynamic video content with UniVideo's image-to-video (I2V) capability. The model preserves visual consistency while adding natural motion and animation to your source images.
Input: Portrait photograph
Instruction: "Make the person smile and turn their head slightly"
Output: Animated video with natural facial movement
UniVideo enables powerful video editing capabilities through natural language instructions, eliminating the need for masks or complex selection tools.
Replace video backgrounds instantly with natural language commands. No manual masking required.
Transform object materials and textures throughout the video while maintaining temporal consistency.
Add or remove objects from videos seamlessly with instruction-based editing powered by AI.
UniVideo's in-context generation capability enables creating videos with multiple consistent characters, objects, or references without additional training. Simply provide reference images, and UniVideo maintains identity consistency throughout the generated video.
References: [Image of person A], [Image of person B]
Prompt: "Person A and Person B having a conversation at a café"
Result: Video with consistent character identities
Use images, diagrams, and visual references to guide video generation with unprecedented precision. UniVideo understands visual prompts to create videos that match your creative vision.
Input: Sketch diagram of scene layout
Style Ref: Oil painting reference image
Text: "Sunset over mountains"
Output: Video matching layout in oil painting style
UniVideo's unified architecture enables combining multiple tasks in single operations, such as editing + style transfer or generation + in-context references.
Integrate UniVideo into your applications with our comprehensive API. Build custom video generation and editing workflows programmatically.
from univideo import UniVideoModel # Initialize UniVideo model model = UniVideoModel(checkpoint_path="./checkpoints") # Text-to-Video generation video = model.generate( prompt="A beautiful sunset over the ocean with waves", num_frames=48, resolution=(1280, 720) ) # Image-to-Video conversion video = model.image_to_video( image="input_image.jpg", motion_prompt="Camera slowly pans right" ) # Video editing edited = model.edit_video( video="source_video.mp4", instruction="Replace the background with a forest scene" ) # Save output video.save("output_video.mp4")
Comprehensive documentation to help you get the most out of UniVideo's capabilities.
Step-by-step instructions for setting up UniVideo on your local machine or cloud environment.
Read Guide →Complete API documentation with function signatures, parameters, and usage examples.
View API →Watch video walkthroughs covering UniVideo features and best practices.
Watch Now →Dive into the research behind UniVideo's groundbreaking unified video model architecture.
Comprehensive paper detailing the dual-stream architecture, training methodology, and benchmark results.
See how UniVideo compares to other leading AI video generation models in terms of capabilities and features.
| Feature | UniVideo | Kling v1.6 | Sora | Pika |
|---|---|---|---|---|
| Text-to-Video | ✓ | ✓ | ✓ | ✓ |
| Image-to-Video | ✓ | ✓ | ✓ | ✓ |
| In-Context Generation | ✓ | — | — | — |
| Free-Form Editing | ✓ | — | ✓ | ✓ |
| Task Composition | ✓ | — | — | — |
| Open Source | ✓ | — | — | — |
| Video Understanding | ✓ | — | — | — |
Connect with researchers, developers, and creators working with UniVideo and AI video generation.
Star the repo, contribute code, and report issues.
Visit GitHubAccess model weights and try online demos.
Visit HFFollow for updates and community highlights.
Follow UsJoin discussions with the AI video community.
Join DiscordFind answers to common questions about UniVideo installation, usage, capabilities, and requirements.
UniVideo is an open-source unified video foundation model developed by the Kling Team at KwaiVGI. It uses a dual-stream architecture combining a Multimodal Large Language Model (MLLM based on Qwen2.5-VL) for understanding with a Multimodal Diffusion Transformer (MMDiT based on HunyuanVideo) for generation. This enables UniVideo to handle video understanding, generation, and editing tasks in a single unified framework.
UniVideo requires a GPU with at least 24GB VRAM for basic usage. For optimal performance, 40GB or 80GB GPUs are recommended. The software requirements include Python 3.11, PyTorch 2.4.1, and CUDA 12.1. Cloud platforms like Google Colab Pro or Hugging Face Spaces can also be used to run UniVideo without local GPU hardware.
Yes, UniVideo is released under the Apache-2.0 license, which permits both personal and commercial use. You can freely use, modify, and distribute UniVideo in your projects. The source code is available on GitHub and model weights are hosted on Hugging Face for easy access.
UniVideo differentiates itself by being fully open-source and offering unified capabilities that most other models lack. While Sora, Runway, and Pika are proprietary and task-specific, UniVideo combines video understanding, generation, and editing in one model. UniVideo uniquely supports in-context generation, task composition, and zero-shot generalization from images to videos.
UniVideo supports comprehensive free-form video editing through natural language instructions. Capabilities include green screen background replacement, material and texture changes, object insertion and removal, style transfer, and in-context editing. Unlike traditional tools, UniVideo doesn't require manual masking or complex selection processes.
Yes, UniVideo can be run on cloud platforms. Hugging Face Spaces offers hosted demos for quick testing. For more control, you can use Google Colab Pro with A100 GPUs or cloud providers like AWS, GCP, or Azure with appropriate GPU instances. The community has also created various notebooks and deployment guides for cloud setup.
In-context video generation allows you to provide reference images of specific characters, objects, or styles, and UniVideo will maintain their consistency throughout the generated video. This enables multi-ID video generation where multiple distinct characters appear with preserved identities, without requiring additional model training or fine-tuning.
Installation involves three main steps: 1) Clone the GitHub repository, 2) Install Python dependencies using pip install -r requirements.txt, and 3) Download model weights from Hugging Face using huggingface-cli. Detailed installation instructions are available in the GitHub repository README and our Getting Started guide above.
Track the development progress of UniVideo and upcoming features planned by the Kling Team.
UniVideo model weights, source code, and documentation released on GitHub and Hugging Face.
Improved video editing with better temporal consistency and expanded instruction support.
Extended support for 4K video generation and longer video durations.
Planned audio generation and synchronization capabilities for complete video production.
Discover how creators and businesses are using UniVideo for AI video generation and editing.
How a film studio reduced pre-viz production time by 60% using UniVideo for rapid scene conceptualization.
A marketing agency created 50+ personalized video ads in a week using UniVideo's batch generation capabilities.
An EdTech startup built an entire course library with AI-generated explainer videos using UniVideo.
Choose the right hardware configuration for your UniVideo deployment based on your usage needs.
UniVideo is open source and welcomes contributions from the community. Here's how you can get involved.
Submit pull requests to improve UniVideo's functionality, fix bugs, or add new features.
Help improve documentation, write tutorials, or translate content to other languages.
Report bugs, suggest features, or share feedback to help improve UniVideo.
UniVideo is developed by the Kling Team at KwaiVGI (Kuaishou Vision Generation & Intelligence), a leading research group focused on AI video generation and multimodal understanding.
The team has previously released influential models including the Kling video generation series, and continues to push the boundaries of what's possible with AI-powered video creation and editing.
Join thousands of creators and researchers using UniVideo for AI video generation and editing.