This directory contains writeups of four unified multimodal models. Read in chronological order they look like four separate research lines. Read as a family, they look like one ongoing argument about which architectural pieces of the original "Transfusion recipe" are actually necessary, and which are workarounds that should be removed.
This page introduces the family before the individual writeups so you can read each paper as a move within a shared design conversation, not as an isolated artifact.
A model belongs to the Transfusion family if and only if it satisfies all five of the following properties:
These five together form a coherent design point. Drop any one of them and you land in a different family of models with different properties.
The clearest way to read the family is as four moves in a single argument:
Each step does one of two things: add machinery to handle a shortcoming (BAGEL adds MoT and a second encoder), or remove machinery the previous step relied on (Tuna removes MoT, Tuna-2 removes encoders). The pattern across the family is overwhelmingly subtractive — every paper after BAGEL is asking "do we really need this part?"
| Transfusion | BAGEL | Tuna | Tuna-2 | |
|---|---|---|---|---|
| Vision input | VAE patches | SigLIP 2 ViT | VAE → SigLIP 2 cascade | raw pixels → linear patchify |
| Vision output | VAE patches | FLUX VAE patches | same cascade (re-encoded each step) | raw pixels (no VAE) |
| Encoder count | 1 (VAE) | 2 (decoupled) | 1 (cascade) | 0 |
| Generation space | VAE latent | VAE latent | VAE latent | pixel space |
| Diffusion variant | DDPM (predict ε) | rectified flow (predict v) | rectified flow (predict v) | rectified flow (predict clean image, x-pred) |
| Per-modality params | none — fully shared | MoT split | none — fully shared | none — fully shared |
| Generation head | linear / U-Net on Transformer output | velocity prediction in same Transformer | separate DiT-style head with AdaLN-Zero | separate flow-matching head, x-prediction |
| LLM init | from scratch | Qwen 2.5 | Qwen 2.5 | Qwen 2.5 |
| Cost per generation step | 1 LLM forward | 1 LLM forward | 1 SigLIP 2 + 1 LLM + 1 head | 1 patchify + 1 LLM + 1 head |
| Training stages | 1 | 3+ (data curricula) | 3 (incl. encoder warm-up) | 2 (no warm-up) |
A nice exercise is to look across the family and tease apart what's fixed from what's up for debate. The five defining properties above never change. But within those constraints, every paper picks differently on:
You can imagine combinations not yet explored — e.g., MoT + raw pixels, or x-prediction + VAE latent. The design space is far from exhausted.
The boundaries of the family are sharp. These nearby models look superficially similar but break one of the five defining properties:
| Model | Why it's not in the family |
|---|---|
| Chameleon (Meta) | Uses VQ-VAE discrete image tokens. Violates property 4 (continuous representations). Different family: "quantized AR". |
| Emu3 (BAAI) | Same — discrete image tokens, single-loss next-token prediction over text + image tokens. |
| Janus / Janus-Pro (DeepSeek) | Discrete image tokens for generation, separate semantic encoder for understanding. Property 4 violated. |
| Show-o (1.x) | Discrete image tokens with masked generation. Property 4 violated. |
| MetaQuery (Saining Xie group) | External diffusion module conditioned on LLM-emitted latent tokens. Two separate models. Property 1 violated. |
| DreamLLM, Emu, NextGPT | Same external-diffuser pattern. |
| JanusFlow (DeepSeek) | Borderline. Uses rectified flow and joint training, but separate vision encoders with tight latent bottleneck connectors. Closer to a cousin than a member. |
The four writeups can be read in any order, but two reading paths make the most sense:
Best if you want to watch the design space evolve in real time and see each paper's contribution as a response to its predecessors.
Best if you want to start from the most elaborate architecture and watch successive papers strip pieces away.
Each writeup is structured the same way: a one-line idea, the design space context, an architecture walkthrough, a concrete tensor flow with example shapes, a per-block detail, and a comparison to the rest of the family.
The Transfusion family is a single research conversation playing out across four papers. Each paper is making a sharp claim — adding or (more often) removing machinery — about what's actually necessary in the recipe. The argument isn't settled, and the open questions sit right at the disputed parameters above.