The Transfusion Family

An overview of four papers building on a shared recipe, August 2024 → April 2026

The framing

This directory contains writeups of four unified multimodal models. Read in chronological order they look like four separate research lines. Read as a family, they look like one ongoing argument about which architectural pieces of the original "Transfusion recipe" are actually necessary, and which are workarounds that should be removed.

This page introduces the family before the individual writeups so you can read each paper as a move within a shared design conversation, not as an isolated artifact.

The defining recipe

A model belongs to the Transfusion family if and only if it satisfies all five of the following properties:

  1. One Transformer processes all modalities. (Single shared backbone, possibly with parameter splits internally.)
  2. Two losses, summed: cross-entropy on text token positions + a diffusion / flow-matching loss on image positions.
  3. End-to-end joint training on mixed-modality sequences — no separate "first train the LLM, then the image module" handoff.
  4. Continuous image representations. No VQ-VAE, no quantization, no discrete image tokens.
  5. Mixed attention mask: causal across modalities (and within text), bidirectional within an image.

These five together form a coherent design point. Drop any one of them and you land in a different family of models with different properties.

The four members

The lineage as a thesis-refinement loop

The clearest way to read the family is as four moves in a single argument:

Transfusion (2024) "One Transformer with two losses works." │ ▼ BAGEL (2025) "Scale it. Add MoT and a second encoder to handle the understanding-vs-generation tension." │ ▼ Tuna (2025) "The tension was a symptom of using two mismatched encoders. Fix the encoder (cascade VAE → SigLIP 2) and you don't need MoT." │ ▼ Tuna-2 (2026) "The encoder itself was a workaround. Drop it and let the Transformer learn vision from raw pixels — it works better."

Each step does one of two things: add machinery to handle a shortcoming (BAGEL adds MoT and a second encoder), or remove machinery the previous step relied on (Tuna removes MoT, Tuna-2 removes encoders). The pattern across the family is overwhelmingly subtractive — every paper after BAGEL is asking "do we really need this part?"

Members at a glance

Transfusion BAGEL Tuna Tuna-2
Vision input VAE patches SigLIP 2 ViT VAE → SigLIP 2 cascade raw pixels → linear patchify
Vision output VAE patches FLUX VAE patches same cascade (re-encoded each step) raw pixels (no VAE)
Encoder count 1 (VAE) 2 (decoupled) 1 (cascade) 0
Generation space VAE latent VAE latent VAE latent pixel space
Diffusion variant DDPM (predict ε) rectified flow (predict v) rectified flow (predict v) rectified flow (predict clean image, x-pred)
Per-modality params none — fully shared MoT split none — fully shared none — fully shared
Generation head linear / U-Net on Transformer output velocity prediction in same Transformer separate DiT-style head with AdaLN-Zero separate flow-matching head, x-prediction
LLM init from scratch Qwen 2.5 Qwen 2.5 Qwen 2.5
Cost per generation step 1 LLM forward 1 LLM forward 1 SigLIP 2 + 1 LLM + 1 head 1 patchify + 1 LLM + 1 head
Training stages 1 3+ (data curricula) 3 (incl. encoder warm-up) 2 (no warm-up)

The disputed parameters of the recipe

A nice exercise is to look across the family and tease apart what's fixed from what's up for debate. The five defining properties above never change. But within those constraints, every paper picks differently on:

You can imagine combinations not yet explored — e.g., MoT + raw pixels, or x-prediction + VAE latent. The design space is far from exhausted.

What's not in this family (for contrast)

The boundaries of the family are sharp. These nearby models look superficially similar but break one of the five defining properties:

ModelWhy it's not in the family
Chameleon (Meta)Uses VQ-VAE discrete image tokens. Violates property 4 (continuous representations). Different family: "quantized AR".
Emu3 (BAAI)Same — discrete image tokens, single-loss next-token prediction over text + image tokens.
Janus / Janus-Pro (DeepSeek)Discrete image tokens for generation, separate semantic encoder for understanding. Property 4 violated.
Show-o (1.x)Discrete image tokens with masked generation. Property 4 violated.
MetaQuery (Saining Xie group)External diffusion module conditioned on LLM-emitted latent tokens. Two separate models. Property 1 violated.
DreamLLM, Emu, NextGPTSame external-diffuser pattern.
JanusFlow (DeepSeek)Borderline. Uses rectified flow and joint training, but separate vision encoders with tight latent bottleneck connectors. Closer to a cousin than a member.

How to read the rest of this directory

The four writeups can be read in any order, but two reading paths make the most sense:

Chronological (the historical path)

Best if you want to watch the design space evolve in real time and see each paper's contribution as a response to its predecessors.

  1. Transfusion (Aug 2024)
  2. BAGEL (May 2025)
  3. Tuna (Dec 2025)
  4. Tuna-2 (Apr 2026)

By complexity (the simplifying path)

Best if you want to start from the most elaborate architecture and watch successive papers strip pieces away.

  1. BAGEL — most machinery (MoT + 2 encoders + 14B params)
  2. Tuna — drops MoT, consolidates to 1 cascaded encoder
  3. Transfusion — back to fully shared params, 1 simple VAE
  4. Tuna-2 — drops the encoder entirely, raw pixels in and out

Each writeup is structured the same way: a one-line idea, the design space context, an architecture walkthrough, a concrete tensor flow with example shapes, a per-block detail, and a comparison to the rest of the family.


The Transfusion family is a single research conversation playing out across four papers. Each paper is making a sharp claim — adding or (more often) removing machinery — about what's actually necessary in the recipe. The argument isn't settled, and the open questions sit right at the disputed parameters above.