skip to content
luminary.blog
by Oz Akan
number one sketched

Understanding ML Numerical Formats

Understanding INT4, INT8, FP16, BF16, and TF32 formats in machine learning - their precision, speed, and memory trade-offs for training and inference.

/ 19 min read

Updated:
Table of Contents

Numerical formats such as INT4, INT8, INT16, FP16, BF16, and TF32 form the foundation of modern machine learning. They directly impact a model’s speed, memory footprint, accuracy, and even the design of the hardware it runs on. A thoughtful choice of numerical format can mean the difference between a model that is fast, efficient, and deployable, and one that is too slow and costly to be practical.

First let’s talk about floating-point numbers.

What is a Floating-Point number?

Floating-point numbers (like FP32, FP16, BF16, TF32) follow a structure defined by the IEEE-754 standard. Every floating-point number has three parts:

  1. Sign bit

    • 1 bit (green in the diagram).
    • 0 → positive, 1 → negative.
  2. Exponent

    • Determines the scale / magnitude of the number.

    • Think of it like scientific notation’s power of 10.

      • Example: 1.23×1051.23 × 10^5, the 5 is the exponent.
    • Larger exponent bits = wider dynamic range (you can represent both very small and very large numbers).

  3. Significand

    • Determines the precision (how many digits of detail you can keep).
    • In 1.23×1051.23 × 10^5, the 1.23 is the significand.
    • In 1.23×1051.23 × 10^5, the .23 is the fraction.
    • More significand bits = more fine-grained accuracy.

    Mantissa is a legacy term with an ambiguous meaning. Can refer to either the significand or just the fractional part of it.

How significand controls decimal precision

  • The significand (significand) determines how many significant digits of a number can be represented.

  • Rough rule of thumb:

    Decimal digits of precisionsignificand bits×log10(2)\text{Decimal digits of precision} ≈ \text{significand bits} × \log_{10}(2)

    because each extra bit ~ adds ~0.301 decimal digits.

So:

  • FP32 (23 significand bits): 23×0.301723 × 0.301 ≈ 7 decimal digits. Can store numbers like 3.1415926 (7 digits).

  • FP16 (10 significand bits): 10×0.301310 × 0.301 ≈ 3 decimal digits. Can store numbers like 3.14 or 2.718.

  • BF16 (7 significand bits): 7×0.3012.17 × 0.301 ≈ 2.1 decimal digits. Can store numbers like 3.1, but not 3.141.

Example

Take FP32 (32 bits total):

  • 1 bit → sign
  • 8 bits → exponent (range ≈ 103810^{-38} to 103810^{38})
  • 23 bits → significand (about 7 decimal digits of precision)

Take BF16 (16 bits total):

  • 1 bit → sign
  • 8 bits → exponent (same range as FP32)
  • 7 bits → significand (≈ 2 decimal digits of precision)

So BF16 can still represent very tiny and very huge numbers like FP32, but it loses detail in between because it has fewer significand bits.

Of course. Here is an enhanced version of the article with more detailed explanations and concrete examples to illustrate the concepts.

Why Precision Matters in ML?

In machine learning, and especially in the domain of deep learning, numerical precision is not a one-size-fits-all concept. Different stages of a model’s lifecycle have distinct requirements:

  • Training: This phase involves iteratively adjusting the model’s parameters (weights) based on very-very small updates called gradients. Higher precision is critical here to accurately accumulate these tiny gradients. If the precision is too low, these updates can get rounded to zero (a problem known as “underflow”), effectively stalling the learning process.
  • Inference: After a model is trained, its primary job is to make predictions. This process is generally more tolerant of lower precision. By reducing precision during inference, we can create models that are significantly smaller and faster, often with a negligible, or even zero, drop in accuracy.
  • Hardware Accelerators (GPUs, TPUs, NPUs): Modern processors are not just general-purpose computers; they are highly specialized. They contain dedicated execution units optimized for specific numerical formats. For instance, NVIDIA’s Tensor Cores are designed to accelerate FP16 and TF32 matrix math, Google’s TPUs are built to excel at BF16 computations, and mobile NPUs (Neural Processing Units) have hardware specifically for ultra-fast INT8 operations.

This leads to a fundamental trade-off:

  • Higher Precision (like FP32) = More accurate and stable representation of numbers, but slower computations and higher memory and power consumption.
  • Lower Precision (like INT8 or FP16) = Faster computations, smaller memory footprint, and lower power usage, but at the risk of numerical errors like overflow (number becomes too large) or underflow (number becomes too small).

Data Types in Detail

1. INT4 (4-bit Integer)

  • Range: A tiny range of 16 distinct values, typically from -8 to +7.
  • Usage: Primarily used for extreme quantization of very large models, especially Large Language Models (LLMs) like GPT and LLaMA, for inference.
  • Benefits: The memory savings are enormous. An INT4 model is 8 times smaller than its original FP32 version and 4 times smaller than an FP16 version. This is what allows a massive 70-billion parameter model, which would normally require over 140GB of VRAM, to potentially fit into a high-end consumer GPU with 24GB of VRAM.
  • Downside: This aggressive level of quantization can significantly impact accuracy if not handled properly. To mitigate this, sophisticated techniques are used, such as Quantization-Aware Training (QAT), where the model learns to adapt to the precision loss during the training phase itself.
  • Real-World Example: The open-source llama.cpp project and libraries like bitsandbytes in the Hugging Face ecosystem use 4-bit quantization to allow powerful LLMs to run on CPUs and consumer-grade GPUs, making advanced AI accessible to enthusiasts and researchers without access to large-scale data centers.

2. INT8 (8-bit Integer)

  • Range: -128 to +127.
  • Usage: Considered the “sweet spot” for optimized inference, INT8 is widely supported across hardware and software platforms, including NVIDIA TensorRT, Intel OpenVINO, and mobile runtimes like ONNX Runtime and TensorFlow Lite.
  • Benefits: Delivers a nearly 4x reduction in model size from FP32 and a significant speedup in inference, often with a minimal accuracy drop of less than 1%. This makes it ideal for production environments where throughput and latency are critical.
  • Real-World Example: A self-driving car’s perception system relies on computer vision models to detect pedestrians, vehicles, and traffic lanes in real-time. These models are often quantized to INT8 to run on the car’s embedded NVIDIA Drive AGX system, ensuring decisions are made in milliseconds. Similarly, recommendation engines at companies like Meta and Google use INT8 to serve billions of users efficiently.

3. INT16 (16-bit Integer)

  • Range: A much larger range from -32,768 to +32,767.
  • Usage: Less common in mainstream deep learning but serves as a useful middle ground when INT8 is too restrictive and floating-point precision is not strictly necessary. It can be found in signal processing and some audio models where data has a wide dynamic range.
  • Relevance: It’s an option for quantization in tasks where preserving a wider dynamic range of integer values is more important than the absolute smallest model size.

4. FP16 (16-bit Floating-Point, “Half Precision”)

  • Format: 1 sign bit, 5 exponent bits, 10 significand (precision) bits. The 5-bit exponent severely limits its dynamic range, making it prone to underflow.
  • Usage: The default format for mixed-precision training on NVIDIA GPUs. In this technique, a model’s weights are stored in FP32, but computations (like matrix multiplications) are performed in FP16 for speed.
  • Benefits: FP16 operations are twice as fast and use half the memory of FP16. To combat the underflow issue, a technique called Loss Scaling is used. It multiplies the loss value by a large factor (e.g., 1024) to shift the small gradient values into the representable range of FP16. Before the weights are updated, the gradients are scaled back down.
  • Real-World Example: Training state-of-the-art generative models like Stable Diffusion or large NLP models like BERT is almost exclusively done using FP16 mixed-precision on NVIDIA A100 or H100 GPUs. This allows for larger batch sizes and drastically reduces training time.

5. BF16 (Brain Floating-Point, 16-bit)

  • Format: 1 sign bit, 8 exponent bits, 7 significand bits.
  • Usage: Originally developed by Google for its TPUs, BF16 is now a new industry standard for training, supported by NVIDIA, Intel, and ARM.
  • Benefits: The key innovation of BF16 is its 8-bit exponent, the same as FP32. This gives it the same massive dynamic range as FP32, making it highly resistant to overflow and underflow problems. It sacrifices precision (significand) to maintain this range, a trade-off that works exceptionally well for deep learning, where the magnitude of a number is often more important than its exact value. This makes training more stable and eliminates the need for techniques like Loss Scaling.
  • Real-World Example: The training of foundational models like Google’s Gemini and PaLM, and Meta’s LLaMA 2/3, heavily relies on BF16 to ensure numerical stability during training runs that can last for weeks or months.

6. FP32 (32-bit Floating-Point, “Single Precision”)

  • Format: 1 sign bit, 8 exponent bits, 23 significand (precision) bits. This format strikes a balance between precision and range, making it both practical and widely adopted in computing and machine learning.
  • Usage: FP32 has long been the “gold standard” for training deep neural networks, serving as the default numerical representation for model weights, activations, and gradients across frameworks like PyTorch, TensorFlow, and JAX.
  • Benefits: With 23 bits dedicated to the significand, FP32 provides about 7 decimal digits of precision. Its dynamic range (103810^{-38} to 103810^{38}) is broad enough to cover the vast spectrum of values encountered during both training and inference. This high precision ensures very stable accumulation of tiny gradients in deep learning, making it resilient to problems like underflow or overflow.
  • Real-World Example: The majority of research papers in deep learning from the past decade, including the original work on AlexNet, ResNet, and GPT, train their models using FP32 by default. Scientific computations, signal processing, and many enterprise AI workloads also rely on FP32 to guarantee numerical accuracy in both research and production.

7. TF32 (TensorFloat-32)

  • Format: An internal 19-bit format used by NVIDIA’s Tensor Cores, with 1 sign bit, 8 exponent bits (like FP32), and 10 significand bits (like FP16).
  • Usage: TF32 is an internal, “invisible” accelerator on NVIDIA Ampere and newer GPUs. It’s not a format you store data in, but rather a computational mode.
  • Benefits: It offers a brilliant compromise: it provides the same dynamic range as FP32 and the precision of FP16. For deep learning developers using frameworks like PyTorch or TensorFlow, TF32 is enabled by default. It accelerates FP32 matrix multiplications by internally casting them to TF32, providing a significant speedup (up to 6x on A100 GPUs) with zero code changes and no loss in accuracy for most AI workloads.
  • Real-World Example: Any researcher or engineer training a model with standard FP32 data on an NVIDIA RTX 3080 or A100 GPU is likely already benefiting from TF32’s automatic performance boost.

Comparison

FormatBitsSpeedMemoryAccuracy / StabilityCommon Use Case
INT44FastestSmallestRisky, needs careful calibrationExtreme quantization for LLM inference
INT88Very FastSmallGood for inferenceProduction Inference (CV, NLP, RecSys)
INT1616FastMediumBetter Integer PrecisionSpecial quantization, audio
FP1616FastSmallProne to underflow, needs loss scalingMixed-precision training & inference (NVIDIA)
FP3232StandardLargeHighest, very stableDefault for training, debugging, high-precision research
BF1616FastSmallHighly stable, wide rangeGold standard for training large models (TPUs, GPUs)
TF3219*FastMediumHigh, same range as FP32Default “hidden” accelerator on NVIDIA GPUs

(*TF32 is an internal 19-bit computational format, not a storage format.)

Relevance in ML Workflows

  1. Training:

    • Old Standard: FP32 (safe, but slow and memory-intensive).
    • Modern Standard: BF16 or FP16 with mixed precision are the go-to choices.
    • On NVIDIA Hardware: The choice is often between FP16 (slightly faster but requires loss scaling) and BF16 (more stable, “just works”).
    • On Google TPUs: BF16 is the native and optimal format.
  2. Inference:

    • A typical workflow: A model is trained in BF16/FP16 and then converted to INT8 for deployment. This process, known as Post-Training Quantization (PTQ), drastically reduces its size and latency.
    • For cutting-edge LLMs, researchers are pushing the boundaries with INT4 and even more aggressive formats to make these massive models runnable on consumer hardware.
  3. Hardware Dictates the Choice:

    • NVIDIA GPUs: Excel at FP16/TF32 for training and INT8 for inference using Tensor Cores.
    • Google TPUs: Natively built for BF16 matrix operations.
    • Mobile/Edge NPUs: Highly optimized for INT8 integer math to maximize speed and battery life.
  4. Concrete Workflow Example:

    • Step 1 (Training): A team trains a massive 100-billion parameter language model on a cluster of Google Cloud TPUs using BF16 for its numerical stability.
    • Step 2 (Cloud Deployment): To serve this model in a product like a chatbot, they quantize it to INT8 and deploy it on servers with NVIDIA GPUs using TensorRT. This maximizes throughput, allowing them to serve more users with fewer servers.
    • Step 3 (Edge Deployment): To create an on-device version for a smartphone app, they further quantize the model to INT4. This reduces the model size from hundreds of gigabytes to a manageable size that can be included in the app bundle and run directly on the phone’s NPU.

In Short:

  • BF16/FP16/TF32: Primarily for Training, where numerical stability and precision are key.
  • INT8/INT4: Primarily for Inference, where speed, memory, and power efficiency are paramount.
  • Making the right choice for the right task can save millions in computational costs.

The “True” Value of π

First, let’s establish the mathematical constant we are trying to represent: π ≈ 3.1415926535897932384626433832795…

This is an irrational number, meaning it has an infinite number of digits that never repeat. We can only ever represent an approximation of it in any finite-precision format like FP32, FP16, or BF16.

1. π in FP32 (Single Precision)

FP32 uses 32 bits: 1 bit for the sign, 8 bits for the exponent, and 23 bits for the fraction (significand).

The Math: How to Find the FP32 Representation

  1. Sign Bit: π is positive, so the sign bit is 0.
  2. Convert to Binary Scientific Notation:
    • 3.1415926535… in binary is approximately 11.00100100001111110110101010001000...
    • Normalize this: 1.100100100001111110110101010001000... × 2^1
    • The leading 1 is implicit (it’s the “hidden bit” and not stored).
    • The fraction part is the bits after the decimal point: 10010010000111111011010 (we take the first 23 bits).
  3. Exponent:
    • The exponent is 1. In FP32, we add a bias of 127 to the true exponent.
    • Biased Exponent = 1 + 127 = 128.
    • Convert 128 to an 8-bit binary: 10000000.

Putting it all together (Sign | Exponent | Fraction): 0 | 10000000 | 10010010000111111011010

This is the binary representation. It’s often written in hexadecimal for compactness: 0x40490FDB.

The Decimal Value Stored: When this bit pattern is interpreted by the FP32 standard, its exact decimal value is: 3.1415927410125732421875

Compare this to the true π: True π: 3.14159265358979323846… FP32 π: 3.14159274101257324218…

The error is about 8.74 × 10⁻⁸, which is incredibly accurate for most applications.

2. π in FP16 (Half Precision)

FP16 uses 16 bits: 1 bit for the sign, 5 bits for the exponent, and 10 bits for the fraction.

The Math: How to Find the FP16 Representation

  1. Sign Bit: Positive, so 0.
  2. Convert to Binary Scientific Notation:
    • Same binary number: 11.00100100001111110110...
    • Normalize: 1.100100100001111110110... × 2^1
    • The fraction part is the first 10 bits after the decimal point: 1001001000
  3. Exponent:
    • The exponent is 1. In FP16, the bias is 15.
    • Biased Exponent = 1 + 15 = 16.
    • Convert 16 to a 5-bit binary: 10000.

Putting it all together (Sign | Exponent | Fraction): 0 | 10000 | 1001001000

In hexadecimal, this is 0x4248.

The Decimal Value Stored: The exact value represented by this FP16 bit pattern is: 3.140625

Compare this to the true π: True π: 3.1415926535… FP16 π: 3.140625

The error is about 0.00096765, or roughly 0.03%. This is a significant loss of precision compared to FP32.

3. π in BF16 (Brain Float 16)

BF16 is a hybrid format designed for machine learning. It uses 16 bits: 1 bit for the sign, 8 bits for the exponent (like FP32), and 7 bits for the fraction.

The Math: How to Find the BF16 Representation

  1. Sign Bit: Positive, so 0.
  2. Convert to Binary Scientific Notation:
    • Same binary number: 11.00100100001111110110...
    • Normalize: 1.100100100001111110110... × 2^1
    • The fraction part is only the first 7 bits after the decimal point: 1001001
  3. Exponent:
    • The exponent is 1. BF16 uses the same exponent bias as FP32: 127.
    • Biased Exponent = 1 + 127 = 128.
    • Convert 128 to an 8-bit binary: 10000000.

Putting it all together (Sign | Exponent | Fraction): 0 | 10000000 | 1001001

In hexadecimal, this is 0x4049.

The Decimal Value Stored: The exact value represented by this BF16 bit pattern is: 3.140625 (Wait, that’s the same as FP16? Let’s see why.)

Let’s calculate it:

  • Value = (-1)^sign × (1 + fraction) × 2^(exponent - bias)
  • Fraction: 1001001 (binary) = 0.5703125 (decimal)
    • Calculation: (1/2) + (0/4) + (0/8) + (1/16) + (0/32) + (0/64) + (1/128) = 0.5 + 0.0625 + 0.0078125 = 0.5703125
  • Value = 1 × (1 + 0.5703125) × 2^(1) = 1.5703125 × 2 = 3.140625

Yes, in this specific case, the values for FP16 and BF16 are identical: 3.140625. However, this is a coincidence because of how the bits aligned for π. For other numbers, especially very large or very small ones, BF16 will behave very differently from FP16 due to its much larger dynamic range (from the 8-bit exponent).

Summary Table & Explanation

FormatSignExponent BitsFraction BitsBiasHex ValueDecimal Value (Stored)
FP3218231270x40490FDB3.1415927410125732421875
FP161510150x42483.140625
BF161871270x40493.140625

Key Takeaways:

  1. Precision vs. Range: FP32 has high precision (23-bit fraction) and high range (8-bit exponent). It is the “gold standard” for accuracy.
  2. FP16’s Trade-off: FP16 sacrifices both precision (10-bit fraction) and range (5-bit exponent) to save memory and increase computational speed. This is often sufficient for storing gradients and activations in deep learning, where exact values are less critical than the overall direction of the optimization.
  3. BF16’s Trade-off: BF16 takes a different approach. It sacrifices almost all precision (only a 7-bit fraction) but keeps the range of FP32 (8-bit exponent). This is crucial in deep learning because it prevents gradients from underflowing (becoming zero) or overflowing (becoming infinity) during training, which is a common problem with FP16. The loss of precision is often acceptable for the training process.

So, while FP16 and BF16 might store π as the same value, they are fundamentally different tools designed to solve different problems in the world of high-performance computing, particularly ML.

π in INT4 and INT8?

The key difference is that INT4 and INT8 are integer formats. They can only represent whole numbers (…, -2, -1, 0, 1, 2, …). π is not a whole number.

Therefore, we cannot directly represent π in a pure integer format like INT4 or INT8.

To represent a non-integer like π using integers, we must use a technique called fixed-point representation.

Fixed-Point Representation

Fixed-point arithmetic pretends an integer represents a fractional number by implicitly placing a decimal point (the “point” in fixed-point) at a predetermined location in the bit sequence.

  • The integer value is called the quantized value.
  • A scale factor (S) defines the resolution—the value of the smallest step (the Least Significant Bit or LSB).
  • The real-world value is calculated as: Real Value = Quantized Integer * Scale Factor

Example: Let’s say we have an 8-bit integer and we define a scale factor of S = 0.1.

  • The integer 31 would represent the real number 31 * 0.1 = 3.1.
  • The integer 32 would represent 3.2.

Let’s explore how to represent π using INT4 and INT8 in this fixed-point context.

1. “π” in INT4 (4-bit Integer)

INT4 is extremely constrained. It uses 4 bits to represent an integer.

  • Range: With 4 bits, you can represent 2⁴ = 16 possible values.
  • For unsigned integers (uint4): The range is 0 to 15.
  • For signed integers (int4): Using two’s complement, the range is -8 to 7.

How to Represent π: Let’s choose a common unsigned representation for positive values like π. We need to define a scale factor.

  1. Define the Scale (S): What should the value of the smallest bit represent?

    • A reasonable choice might be S = 0.2. This gives us a range of 0 to 15 * 0.2 = 3.0.
    • But π is ~3.14, which is outside this range! We need a smaller scale.
    • Let’s choose S = 0.1. Now our range is 0 to 1.5. Still not enough.
    • Let’s choose S = 0.05. Range is now 0 to 0.75. Too small.
    • The limited range of INT4 makes it impossible to represent numbers near 3.0 with useful precision. This is why INT4 is almost always used in a signed context for differential values (like weight deltas) rather than for absolute values like π.
  2. A More Plausible Use Case - Signed: Let’s try a signed INT4 with a larger scale. We need to represent ~3.14. Our range is -8 to 7.

    • Let’s set S = 0.5. Our representable values are now … -4.0, -3.5, … 3.0, 3.5, 4.0…
    • To represent π (~3.14), we must round to the nearest representable value. The choices are 3.0 (integer 6) or 3.5 (integer 7).
    • 3.14 - 3.0 = 0.14
    • 3.5 - 3.14 = 0.36
    • Therefore, 3.0 is closer. We would quantize π to the integer value 6 with a scale of 0.5.

Math for INT4 (Signed) Representation of π:

  • Scale (S): 0.5
  • Quantized Value (Q): round(π / S) = round(3.14159 / 0.5) = round(6.28318) = 6
  • Dequantized Value: Q * S = 6 * 0.5 = 3.0
  • Error: |π - 3.0| = 0.14159

Representing a value like π with INT4 is difficult and results in very low accuracy (~14% error in this case). It highlights the severe trade-off between range and precision in ultra-low-bit integers.

2. “π” in INT8 (8-bit Integer)

INT8 is much more practical. It uses 8 bits to represent an integer.

  • Range: 2⁸ = 256 possible values.
  • For unsigned integers (uint8): The range is 0 to 255.
  • For signed integers (int8): The range is -128 to 127.

How to Represent π: We can use a unsigned format since π is positive. We need to choose a scale factor that lets us represent numbers around 3.0 with good precision.

  1. Choose a Scale Factor (S):

    • We want our range to cover at least 0 to just over 3.14.
    • With a uint8 max value of 255, the maximum value we can represent is 255 * S.
    • Let’s choose a scale that allows for good precision. A common choice is S = 0.0125.
    • Let’s check the range: 255 * 0.0125 ≈ 3.1875. This is perfect, as it covers 0 to ~3.19.
  2. Quantization Process:

    • We calculate the integer to store: Q = round(π / S)
    • To recover the value, we calculate: Dequantized Value = Q * S

Math for INT8 (Unsigned) Representation of π:

  • Scale (S): 0.0125
  • Quantized Value (Q): round(π / S) = round(3.1415926535 / 0.0125) = round(251.32741228) = 251
  • Dequantized Value: Q * S = 251 * 0.0125 = 3.1375
  • Error: |π - 3.1375| = 0.00409265 (about 0.13% error)

This is a much better approximation than INT4.

What if we want a “cleaner” scale? We can define the scale based on the integer range.

  • Often, the scale is set as (max_value - min_value) / (2^bits - 1).
  • For a range from 0 to π, a common scale would be S = π / 255.
    • Scale (S): π / 255 ≈ 0.01232146
    • Quantized Value (Q): round(π / S) = round(π / (π/255)) = round(255) = 255
    • Dequantized Value: 255 * (π / 255) = π (too good to be true?)

Yes, it’s a trick. The scale is defined by the value we’re trying to represent. In practice, the scale is chosen to cover the entire range of a tensor (e.g., all weights in a neural network layer, which might range from -2.5 to +4.0), not just for a single value like π.

Summary

Pure integers (INT4/INT8) cannot represent fractions. They require fixed-point quantization with a scale factor to represent non-integer values.

The Range-Precision Trade-off: This is the central challenge.

  • A large scale factor gives you a wide range of representable values but poor precision (big steps between values).
  • A small scale factor gives you high precision (small steps) but a narrow range. Choosing the right scale is critical.

This process of quantization (converting high-precision FP32 numbers to low-precision INT8/INT4) is how modern AI models are shrunk down to run efficiently on phones and other edge devices. The model’s weights and activations are converted from FP32 to INT8, drastically reducing memory usage and speeding up calculations, with a usually acceptable small loss of accuracy. INT4 is used for even more aggressive compression.