Diffusion Based Language Models • luminary.blog

Diffusion-Based Language Models

Traditional language models (like GPT) create text one word or token at a time, from left to right. Think of it like writing a sentence sequentially. These are called autoregressive models (ARMs).

Diffusion-based language models (DLMs) are a newer approach. Instead of writing sequentially, they start with something like noisy or scrambled text and gradually clean it up or “denoise” it over several steps until it becomes a clear, coherent sentence or passage.

How it works

Adding “Noise” (Forward Process): You start with a clean piece of text. The model adds “noise” to it bit by bit. For text, this might mean masking out words or replacing them until the text is highly corrupted and unreadable.
Removing “Noise” (Reverse Process): The model is trained to reverse this process. Given a noisy version, it learns to predict and remove the noise, step by step, getting closer to the original text. This “denoising” process is how it generates new text from scratch, starting from pure noise and refining it iteratively.

Why are they interesting?

Non-Sequential Writing: Unlike ARMs, DLMs don’t just go token by token. They can work on larger parts of the text at once.
Potential for Speed: Because parts of the denoising process can happen at the same time (in parallel), they have the potential to generate text faster than ARMs, especially newer models like Mercury.
Better Understanding of Context: By refining the text iteratively and often working on the whole sequence or blocks, DLMs can better consider the overall context, potentially leading to more coherent text.

What are the challenges?

Can Be Slow: The step-by-step denoising process, while parallelizable, can still take many steps, making generation slower than some ARMs in practice, although this is rapidly improving with new research and models.
Performance: While improving rapidly, DLMs have historically sometimes lagged behind top-performing ARMs on certain language tasks.
Training: Training diffusion models can be complex and potentially require significant computation.

Examples of Diffusion Language Models include LLaDA (Large Language Diffusion Model) and Mercury by Inception Labs, which is noted for its speed.

Conclusion

In summary, Diffusion-based language models offer an alternative, non-sequential method for generating text with promising advantages in speed and contextual understanding, and are an active area of research aiming to match or exceed the performance of traditional models.

← Avoiding Stack Overflows with Trampoline Functions

Embedding Selection for RAG Systems →