The Transfusion Family

An overview of four papers building on a shared recipe, August 2024 → April 2026

The framing

This directory contains writeups of four unified multimodal models. Read in chronological order they look like four separate research lines. Read as a family, they look like one ongoing argument about which architectural pieces of the original "Transfusion recipe" are actually necessary, and which are workarounds that should be removed.

This page introduces the family before the individual writeups so you can read each paper as a move within a shared design conversation, not as an isolated artifact.

The defining recipe

A model belongs to the Transfusion family if and only if it satisfies all five of the following properties:

One Transformer processes all modalities. (Single shared backbone, possibly with parameter splits internally.)
Two losses, summed: cross-entropy on text token positions + a diffusion / flow-matching loss on image positions.
End-to-end joint training on mixed-modality sequences — no separate "first train the LLM, then the image module" handoff.
Continuous image representations. No VQ-VAE, no quantization, no discrete image tokens.
Mixed attention mask: causal across modalities (and within text), bidirectional within an image.

These five together form a coherent design point. Drop any one of them and you land in a different family of models with different properties.

The four members

August 2024 · Transfusion (Meta + Waymo + USC) The original. One Transformer trained from scratch with LM + DDPM losses on text + VAE patches. Establishes the recipe and shows it scales 34× better than Chameleon's quantized-token baseline.
May 2025 · BAGEL (ByteDance) Adds machinery: MoT parameter split (separate FFN / QKV per modality, shared self-attention) plus two decoupled encoders (semantic ViT for understanding, generation VAE for synthesis). Initializes from Qwen2.5 and trains on trillions of interleaved tokens to expose emergent abilities.
December 2025 · Tuna (Meta + HKU + Waterloo + KAUST) Removes the MoT split and the two-encoder design. Replaces them with a single cascaded encoder: VAE → modified SigLIP 2 → unified visual representation. Argues that the format mismatch BAGEL was patching over with MoT was the bug, not the feature.
April 2026 · Tuna-2 (Meta + HKU + Waterloo) Removes the cascaded encoder entirely. Just a linear patch-embedding layer on raw RGB pixels. Generation moves to pixel space (no VAE decoder). Pretrained vision encoders, the paper argues, are workarounds that limit the model — strip them away and the Transformer learns better visual features end-to-end.

The lineage as a thesis-refinement loop

The clearest way to read the family is as four moves in a single argument:

Transfusion (2024) "One Transformer with two losses works." │ ▼ BAGEL (2025) "Scale it. Add MoT and a second encoder to handle the understanding-vs-generation tension." │ ▼ Tuna (2025) "The tension was a symptom of using two mismatched encoders. Fix the encoder (cascade VAE → SigLIP 2) and you don't need MoT." │ ▼ Tuna-2 (2026) "The encoder itself was a workaround. Drop it and let the Transformer learn vision from raw pixels — it works better."

Each step does one of two things: add machinery to handle a shortcoming (BAGEL adds MoT and a second encoder), or remove machinery the previous step relied on (Tuna removes MoT, Tuna-2 removes encoders). The pattern across the family is overwhelmingly subtractive — every paper after BAGEL is asking "do we really need this part?"

Members at a glance

	Transfusion	BAGEL	Tuna	Tuna-2
Vision input	VAE patches	SigLIP 2 ViT	VAE → SigLIP 2 cascade	raw pixels → linear patchify
Vision output	VAE patches	FLUX VAE patches	same cascade (re-encoded each step)	raw pixels (no VAE)
Encoder count	1 (VAE)	2 (decoupled)	1 (cascade)	0
Generation space	VAE latent	VAE latent	VAE latent	pixel space
Diffusion variant	DDPM (predict ε)	rectified flow (predict v)	rectified flow (predict v)	rectified flow (predict clean image, x-pred)
Per-modality params	none — fully shared	MoT split	none — fully shared	none — fully shared
Generation head	linear / U-Net on Transformer output	velocity prediction in same Transformer	separate DiT-style head with AdaLN-Zero	separate flow-matching head, x-prediction
LLM init	from scratch	Qwen 2.5	Qwen 2.5	Qwen 2.5
Cost per generation step	1 LLM forward	1 LLM forward	1 SigLIP 2 + 1 LLM + 1 head	1 patchify + 1 LLM + 1 head
Training stages	1	3+ (data curricula)	3 (incl. encoder warm-up)	2 (no warm-up)

The disputed parameters of the recipe

A nice exercise is to look across the family and tease apart what's fixed from what's up for debate. The five defining properties above never change. But within those constraints, every paper picks differently on:

Encoder count: 0 (Tuna-2), 1 (Transfusion, Tuna), or 2 decoupled (BAGEL).
Parameter sharing: fully shared (Transfusion, Tuna, Tuna-2) or split via MoT (BAGEL).
Generation space: VAE latent (Transfusion, BAGEL, Tuna) or raw pixels (Tuna-2).
Diffusion parameterization: predict ε (DDPM, Transfusion), predict v (rectified flow, BAGEL/Tuna), predict clean image x (x-prediction, Tuna-2).
Generation head: same Transformer output (Transfusion, BAGEL) or separate DiT-style head (Tuna, Tuna-2).
LLM initialization: from scratch (Transfusion) or pretrained Qwen2.5 (everyone after).

You can imagine combinations not yet explored — e.g., MoT + raw pixels, or x-prediction + VAE latent. The design space is far from exhausted.

What's not in this family (for contrast)

The boundaries of the family are sharp. These nearby models look superficially similar but break one of the five defining properties:

Model	Why it's not in the family
Chameleon (Meta)	Uses VQ-VAE discrete image tokens. Violates property 4 (continuous representations). Different family: "quantized AR".
Emu3 (BAAI)	Same — discrete image tokens, single-loss next-token prediction over text + image tokens.
Janus / Janus-Pro (DeepSeek)	Discrete image tokens for generation, separate semantic encoder for understanding. Property 4 violated.
Show-o (1.x)	Discrete image tokens with masked generation. Property 4 violated.
MetaQuery (Saining Xie group)	External diffusion module conditioned on LLM-emitted latent tokens. Two separate models. Property 1 violated.
DreamLLM, Emu, NextGPT	Same external-diffuser pattern.
JanusFlow (DeepSeek)	Borderline. Uses rectified flow and joint training, but separate vision encoders with tight latent bottleneck connectors. Closer to a cousin than a member.

How to read the rest of this directory

The four writeups can be read in any order, but two reading paths make the most sense:

Chronological (the historical path)

Best if you want to watch the design space evolve in real time and see each paper's contribution as a response to its predecessors.

Transfusion (Aug 2024)
BAGEL (May 2025)
Tuna (Dec 2025)
Tuna-2 (Apr 2026)

By complexity (the simplifying path)

Best if you want to start from the most elaborate architecture and watch successive papers strip pieces away.

BAGEL — most machinery (MoT + 2 encoders + 14B params)
Tuna — drops MoT, consolidates to 1 cascaded encoder
Transfusion — back to fully shared params, 1 simple VAE
Tuna-2 — drops the encoder entirely, raw pixels in and out

Each writeup is structured the same way: a one-line idea, the design space context, an architecture walkthrough, a concrete tensor flow with example shapes, a per-block detail, and a comparison to the rest of the family.

The Transfusion family is a single research conversation playing out across four papers. Each paper is making a sharp claim — adding or (more often) removing machinery — about what's actually necessary in the recipe. The argument isn't settled, and the open questions sit right at the disputed parameters above.