UE8M0 FP8 Number Format • luminary.blog

Training a 671-billion parameter AI model requires massive computational resources—typically hundreds of H100 GPUs costing millions of dollars. But what if you could achieve similar results with half the memory and double the throughput using a clever 8-bit number format?

DeepSeek’s UE8M0 FP8 represents a breakthrough in efficient AI training, enabling massive language models to run on alternative hardware while maintaining competitive performance. This specialized 8-bit floating-point format trades precision for unprecedented efficiency, allowing models like DeepSeek-V3.1 to be trained without relying entirely on expensive Nvidia hardware.

But UE8M0 isn’t just about cost savings—it’s a fundamental rethink of how we represent numbers in AI computation, prioritizing range over precision in ways that challenge conventional wisdom about floating-point arithmetic.

The FP8 Landscape

First, it’s helpful to understand where UE8M0 fits in the growing family of 8-bit floating-point formats:

Format	Sign	Exponent	Mantissa	Key Characteristic	Primary Use
E4M3	1 bit	4 bits	3 bits	Balanced precision/range	General AI training
E5M2	1 bit	5 bits	2 bits	Extended range	Large dynamic ranges
UE8M0	0 bits	8 bits	0 bits	Maximum range, powers of 2 only	Scaling factors

UE8M0 stands out as the most extreme design choice—eliminating both sign and mantissa bits to maximize the exponent range. This makes it uniquely suited as a scaling factor rather than a general-purpose number format.

Breaking Down UE8M0 FP8

The name UE8M0 describes the structure of the format:

U (Unsigned): The format does not include a sign bit. This trade-off eliminates negative numbers but extends the range of positive values.
E8 (Exponent 8): Eight bits are allocated to the exponent, giving the format a wide dynamic range to represent values of vastly different magnitudes.
M0 (Mantissa 0): No bits are used for the mantissa (or significand). This extreme choice maximizes range at the expense of precision.

The Scaling Factor Strategy

UE8M0’s unique design makes it ideal for a specific role in AI training: scaling factors. Here’s how this works:

Mixed-Precision Training

Instead of doing all computation in FP8, UE8M0 enables a clever mixed-precision approach:

Core computation happens in FP8 (fast, memory-efficient)
Scaling factors use UE8M0 to maintain numerical stability
Critical operations fall back to higher precision when needed

This hybrid approach delivers the speed benefits of 8-bit arithmetic while avoiding the numerical instability that typically plagues low-precision training.

Hardware Optimization

UE8M0 was specifically engineered for Chinese-made accelerators, which offers several advantages:

Native support eliminates conversion overhead
Optimized pipelines designed around the format’s constraints
Reduced dependency on Nvidia’s hardware ecosystem
Cost efficiency for large-scale domestic AI development

Understanding UE8M0 Through Examples

The key insight about UE8M0 is simple: it can only represent powers of 2. With no mantissa bits, every value must be exactly 2^n for some integer n.

The Conversion Process

Converting to UE8M0 follows these steps:

Handle negatives: Take absolute value (UE8M0 is unsigned)
Find nearest power of 2: Round the number to the closest 2^n
Calculate storage value: Add bias of 127 to the exponent

Let’s work through some examples with the tensor: [0.01, -0.5, 3.14, 1.25, 1000.0, -9000.0]

Handling Negative Numbers

Since UE8M0 is an unsigned format, negative values like -0.5 and -9000.0 cannot be directly represented. In practical applications, the handling of negative numbers depends on the specific implementation. A common approach is to take the absolute value of the number before conversion, as the sign is often handled separately or is irrelevant for a scaling factor. For this example, we will use the absolute value.

Our tensor now becomes: [0.01, 0.5, 3.14, 1.25, 1000.0, 9000.0]

Converting Each Value

Value: 0.01

Find the True Exponent (E): We need to find E such that 2^E is close to 0.01.
- 2^-7 = 1/128 ≈ 0.0078
- 2^-6 = 1/64 = 0.0156 0.01 is closer to 0.0156, so we round to 2^-6. The true exponent is E = -6.
Calculate Biased Exponent: Stored Value = -6 + 127 = 121.
UE8M0 Representation: The 8-bit unsigned integer is 121.

Value: 0.5

Find the True Exponent (E): 0.5 is exactly 1/2, which is 2^-1. The true exponent is E = -1.
Calculate Biased Exponent: Stored Value = -1 + 127 = 126.
UE8M0 Representation: The 8-bit unsigned integer is 126.

Value: 3.14

Find the True Exponent (E): We need to find the nearest power of 2.
- 2¹ = 2
- 2² = 4 3.14 is closer to 2, so we round to 2¹. The true exponent is E = 1.
Calculate Biased Exponent: Stored Value = 1 + 127 = 128.
UE8M0 Representation: The 8-bit unsigned integer is 128.

Value: 1.25

Find the True Exponent (E):
- 2⁰ = 1
- 2¹ = 2 1.25 is closer to 1, so we round to 2⁰. The true exponent is E = 0.
Calculate Biased Exponent: Stored Value = 0 + 127 = 127.
UE8M0 Representation: The 8-bit unsigned integer is 127.

Value: 1000.0

Find the True Exponent (E):
- 2⁹ = 512
- 2¹⁰ = 1024 1000.0 is closer to 1024, so we round to 2¹⁰. The true exponent is E = 10.
Calculate Biased Exponent: Stored Value = 10 + 127 = 137.
UE8M0 Representation: The 8-bit unsigned integer is 137.

Value: 9000.0 (from -9000.0)

Find the True Exponent (E):
- 2¹³ = 8192
- 2¹⁴ = 16384 9000.0 is closer to 8192, so we round to 2¹³. The true exponent is E = 13.
Calculate Biased Exponent: Stored Value = 13 + 127 = 140.
UE8M0 Representation: The 8-bit unsigned integer is 140.

Conversion Result

Taking our sample tensor and converting each value:

Original	Absolute	Nearest Power of 2	Exponent	UE8M0 Value
0.01	0.01	2^-6 (0.0156)	-6	121 (-6+127)
-0.5	0.5	2^-1 (0.5)	-1	126 (-1+127)
3.14	3.14	2^1 (2.0)	1	128 (1+127)
1.25	1.25	2^0 (1.0)	0	127 (0+127)
1000.0	1000.0	2^10 (1024)	10	137 (10+127)
-9000.0	9000.0	2^13 (8192)	13	140 (13+127)

The key insight: we’re essentially storing the exponent of the nearest power of 2, plus a bias of 127 (similar to IEEE 754 standard).

Our original tensor [0.01, -0.5, 3.14, 1.25, 1000.0, -9000.0] becomes: [121, 126, 128, 127, 137, 140]

What We Lose and Gain

This conversion reveals UE8M0’s fundamental trade-off:

Lost:

Precision: 1.25 becomes 1.0, 3.14 becomes 2.0
Sign information: All values become positive
Exact representations: Only powers of 2 are exact

Gained:

Massive range: Can represent from 2^-127 to 2^127
Memory efficiency: 8 bits vs 16+ bits for higher precision formats
Computational speed: Simpler arithmetic operations

Real-World Applications

Training Large Language Models

DeepSeek has proven UE8M0’s effectiveness by training their 671-billion-parameter DeepSeek-V3.1 model using this format. The results demonstrate that you can achieve competitive performance while:

Reducing memory usage by ~50% compared to BF16
Doubling training throughput on supported hardware
Maintaining model quality through careful scaling factor management

Practical Implementation

In practice, UE8M0 works as part of a mixed-precision training pipeline:

1
Input (BF16) → Scale to UE8M0 range → Compute (FP8) → Scale back → Output (BF16)

The scaling factors (stored in UE8M0) ensure that the reduced precision doesn’t destabilize training, while the bulk of computation happens in efficient 8-bit arithmetic.

Limitations and Trade-offs

UE8M0 isn’t suitable for all use cases. Key limitations include:

Precision Loss:

Only represents powers of 2 exactly
Significant rounding errors for most real numbers
Not suitable for applications requiring fine-grained precision

Implementation Complexity:

Requires careful tuning of scaling factors
Mixed-precision pipelines are more complex to implement
Hardware support is limited to specific accelerators

Limited Applicability:

Best suited for scaling factors, not general computation
Requires domain expertise to implement effectively
May not provide benefits on non-optimized hardware

The Bigger Picture

UE8M0 represents more than just a clever number format—it’s part of a broader shift toward specialized computing for AI. As models grow larger and computational demands increase, we’re seeing:

Hardware diversification: Moving beyond GPU monoculture toward specialized AI accelerators Format innovation: Custom number formats optimized for specific workloads rather than general-purpose computing
Geopolitical implications: Domestic hardware ecosystems reducing dependency on single vendors

The success of DeepSeek-V3.1 proves that innovative number formats can unlock new possibilities in AI training, potentially democratizing access to large-scale AI development through more efficient computation.

Whether UE8M0 becomes widely adopted remains to be seen, but it demonstrates that there’s still room for fundamental innovation in how we represent and compute with numbers—even in domains as mature as floating-point arithmetic.

Appendix

Logarithmic Number Systems (LNS)

Here are several examples to illustrate how the Logarithmic Number System (LNS) works, especially compared to the usual floating-point approach:

1. Basic Representation:

In traditional floating-point (FP), a number $x$ is represented as $x = m \times 2^e$ , where $m$ is the mantissa and $e$ the exponent.
In LNS, a number $x$ $x$ is stored as its logarithm:
- Store: $y = \log_b(x)$ (usually, $b = 2$ ), so $x = b^y$
- Only the exponent/log value is stored.

2. Simple Examples:

Decimal Value	LNS Value (log₂)	Binary (FP)
1	$\log_2 1 = 0$	0x3F800000 (FP32)
2	$\log_2 2 = 1$	0x40000000
4	$\log_2 4 = 2$	0x40800000
0.5	$\log_2 0.5 = -1$	0x3F000000
8	$\log_2 8 = 3$	0x41000000
0.25	$\log_2 0.25 = -2$	0x3E800000

3. LNS Math Operations:

Multiplication:
- FP: $z = x \times y$
- LNS: $z = b^{(a+b)}$ where $a = \log_b x,\, b = \log_b y$
- You just add log values: $\log_b(xy) = \log_b x + \log_b y$
Division:
- FP: $z = x / y$
- LNS: $z = b^{(a-b)}$ , so subtraction.
Exponentiation:
- FP: $z = x^k$
- LNS: $z = b^{k \cdot a}$
Addition is not trivial in LNS and requires special handling (using log-add-exp or similar tricks).

4. Example Calculation (Multiplication):

Suppose you want to multiply 2 and 8:

In FP: $2 \times 8 = 16$
In LNS:
- $\log_2 2 = 1$ , $\log_2 8 = 3$
- Add logs: $1 + 3 = 4$
- Back to real: $2^4 = 16$

5. Non-Power-of-Two:

Number: 3
- $\log_2 3 \approx 1.58496$
Number: 5
- $\log_2 5 \approx 2.32193$

Multiply 3 and 5:

Add logs: $\approx 1.58496 + 2.32193 = 3.90689$
Back to real: $2^{3.90689} \approx 15$ (True: $3 \times 5 = 15$ )

Summary Table:

Operation	FP Approach	LNS Approach
Multiply	Multiply two numbers	Add their logs, $\log(xy) = \log x + \log y$
Divide	Divide two numbers	Subtract logs, $\log(x/y) = \log x - \log y$
Power	Raise to a power	Multiply log by scalar, $k\log x$

Practical Use:

In hardware or AI chips, using LNS means multipliers can be replaced with simple adders, making them much more efficient.

NVIDIA’s Implementation

NVIDIA has also adopted UE8M0 in their PTX (Parallel Thread Execution) instruction set architecture, with some key implementation details:¹

Packed storage: UE8M0 values are stored as ue8m0x2 format (two 8-bit values in a 16-bit register)
Special values: Uses 0xff for NaN; no support for infinity
Instruction-level support: Available as source/destination format for specific operations
Range-first philosophy: Prioritizes dynamic range over precision for specialized workloads

This shows that UE8M0 might be becoming a recognized format for efficient AI computation across different hardware ecosystems.

Footnotes

https://docs.nvidia.com/cuda/parallel-thread-execution/#alternate-floating-point-data-formats ↩

← Understanding ML Numerical Formats

2-Hour Streamlit Workshop →