Abstract
LTX-2 is an open-source audiovisual diffusion model that generates synchronized video and audio content using a dual-stream transformer architecture with cross-modal attention and classifier-free guidance.
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation (2025)
- ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation (2025)
- MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning (2026)
- DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation (2025)
- JoVA: Unified Multimodal Learning for Joint Video-Audio Generation (2025)
- In-Context Audio Control of Video Diffusion Transformers (2025)
- JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
putting this in the comment section by accident is wild lmao
it's literally crazy.
Создай видео по очереди (общий 12 картинок) сначала создай первый формат 18:9 ## ГОТОВЫЙ СЦЕНАРИЙ: "5 Countries That Could Disappear in Our Lifetime"
Хронометраж: ~1:45 – 2:00
Тон: Тревожный, но фактологический
ХУК (0:00 – 0:10)
Картинка: Карта мира, на ней начинают исчезать куски. Тревожная музыка.
Текст:
"You look at a world map and think it's permanent. It's not. Some countries we know today might not exist when you're old. Here are 5 nations fighting for survival right now."
МЕСТО №5: Мальдивы (Maldives)
Картинка: Райские острова, океан, волны, люди на пляже.
Текст:
"Number 5: The Maldives. The most beautiful islands in the Indian Ocean. Average height above sea level? Just 1.5 meters. Scientists say if sea levels keep rising, the Maldives could be underwater by the end of this century. The government is already buying land in other countries to move its people. A paradise that's disappearing."
МЕСТО №4: Тайвань (Taiwan)
Картинка: Карта, показывающая Тайвань рядом с Китаем, флаги.
Текст:
"Number 4: Taiwan. This is not about climate — it's about politics. Taiwan has been independent in practice for decades, but China claims it as its territory. Tensions are rising. If China decides to take control by force, Taiwan as an independent country could cease to exist."
МЕСТО №3: Кирибати (Kiribati)
Картинка: Тихий океан, маленькие острова, карта.
Текст:
"Number 3: Kiribati. A nation of 33 islands in the Pacific Ocean. Most of them are barely above water. Their president bought land in Fiji just to have somewhere to move when the ocean swallows them. They might be the first country to disappear completely. And it's happening now."
МЕСТО №2: Бангладеш (Bangladesh)
Картинка: Наводнения, люди по колено в воде, карта Бангладеш.
Текст:
"Number 2: Bangladesh. One of the most densely populated countries on Earth. 170 million people living on a giant river delta. Every year, floods get worse. By 2050, scientists predict 20% of the country could be underwater. That's 30 million climate refugees. One of the poorest nations could simply become unlivable."
МЕСТО №1: Тувалу (Tuvalu)
Картинка: Маленький остров посреди океана, волны, солнце.
Текст:
"Number 1: Tuvalu. A tiny island nation in the Pacific. The highest point is 4.5 meters above sea level. But when high tides come, the whole country floods. The government is building seawalls, but it might not be enough. Tuvalu could be the first country to lose its land completely. And the scariest part? It could happen in the next 30 years."
АУТРО (1:45 – 2:00)
Картинка: Карта мира с вопросом. Музыка становится тише.
Текст:
"Which of these countries would you save? Let me know in the comments. And if you want more geography and history — subscribe. The next video will be about what happens when a country disappears completely."
Models citing this paper 74
Browse 74 models citing this paperDatasets citing this paper 0
No dataset linking this paper
