What is a TFLOP on a GPU?

TFLOPS stands for Tera Floating-Point Operations Per Second — a measure of raw math throughput. For GPUs, the headline number is FP32 TFLOPS (32-bit floating point), calculated as CUDA cores × boost clock (in GHz) × 2. An RTX 5070 with 6,144 cores at 2.5 GHz produces ~30.7 FP32 TFLOPS on paper.

Does more TFLOPS mean more FPS?

Not directly. FPS in games depends on memory bandwidth, architecture efficiency, driver maturity, RT cores, Tensor cores, and game-specific shader optimisation — not just raw FP32 throughput. The classic example: an RX 7600 has ~22 TFLOPS while an RTX 4060 has ~15 TFLOPS, yet they trade blows in gaming benchmarks because Ada's architecture extracts more useful work per TFLOP than RDNA 3.

Why do AMD GPUs have higher TFLOPS but similar FPS to NVIDIA?

AMD's RDNA architecture counts dual-issue FP32 operations more aggressively in its theoretical maximum, while NVIDIA's Ada and Blackwell architectures extract higher utilisation from each clock cycle. AMD's drivers also historically optimise less aggressively for niche titles. The net effect: AMD's TFLOPS number looks bigger on paper but lands at similar real-world FPS in most games.

What is FP32 vs FP16 vs INT8?

These are different floating-point precisions. FP32 (32-bit) is the gaming-relevant number — most shaders use FP32 math. FP16 (16-bit, half precision) doubles throughput on modern GPUs and is used by DLSS, FSR, and AI workloads. INT8 (8-bit integer) is the lowest precision, used heavily in AI inference. A GPU's FP32 number is for gaming; FP16 and INT8 numbers are for AI / compute.

How do I calculate TFLOPS for a GPU?

FP32 TFLOPS = (number of shader cores × boost clock in MHz × 2) ÷ 1,000,000. For example, an RTX 5080 with 10,752 CUDA cores at 2.62 GHz boost = 10,752 × 2,620 × 2 = ~56.3 TFLOPS FP32. The × 2 reflects the fused multiply-add (FMA) instruction that counts as two operations per cycle.

When do TFLOPS actually matter?

TFLOPS is a useful comparison for compute workloads — AI training/inference, scientific computing, video encoding via compute shaders, Blender Cycles GPU renders, and machine-learning research. For gaming, ignore the headline TFLOPS number and look at per-game benchmark FPS in the titles you actually play.

What about ray tracing — is that FP32 too?

No. Ray tracing on modern GPUs runs through dedicated RT cores (NVIDIA RT cores, AMD Ray Accelerators, Intel RT units) — not the FP32 shader pipeline. A GPU with strong FP32 TFLOPS but weak RT cores will perform poorly in ray-traced games. This is another reason TFLOPS alone is a poor gaming predictor.

What about DLSS and frame generation?

DLSS upscaling and frame generation run on Tensor cores (AI accelerators), not FP32 shaders. AMD's FSR uses general FP32 cores, which is why DLSS quality often beats FSR — Tensor cores are purpose-built for the workload. Two GPUs with identical TFLOPS but different Tensor / RT core counts will deliver very different gaming experiences in 2026 titles.

Search for products, pages and posts ...

Search...Search laptops, GPUs...Search

►Home

►Specials

►

GPU TFLOPS Explained — Do They Matter - Editorial

Hardware Explainer · GPU Spec

GPU TFLOPS, explained. — Math throughput is not gaming FPS.

The headline spec on every GPU box — and one of the most misleading numbers in PC hardware. Higher TFLOPS doesn't reliably equal higher FPS, and we've got the benchmarks to prove it.

8 min read
Updated May 2026
Reviewed by Evetech Hardware Team

By the end of this guide, you'll know what FP32 TFLOPS actually measures, why architecture and memory bandwidth matter more for gaming, and the one comparison you should trust instead.

what it measures: Math throughput
the misconception: FP32 ≠ FPS
what matters more: Architecture

What TFLOPS actually measures

TFLOPS stands for Tera Floating-Point Operations Per Second — one trillion floating-point math operations every second. It's a measure of raw arithmetic throughput, derived directly from the GPU's shader core count and clock speed. Nothing in the calculation involves a game engine, a resolution, a render pipeline, or a driver.

That last sentence is why TFLOPS is so often misunderstood. The number on a GPU box describes peak theoretical math throughput in ideal lab conditions, with no memory bottlenecks, no cache misses, no shader complexity. Real games never hit that ceiling.

When NVIDIA or AMD print "30 TFLOPS" or "60 TFLOPS" on marketing material, they're quoting FP32 (single-precision floating point) throughput — the precision most gaming shader workloads use. That's the headline number. Other TFLOPS figures (FP16, FP64, INT8) exist for different workloads and almost never appear on the box.

FP32 vs FP16 vs INT8 — the precision ladder

Modern GPUs handle multiple math precisions, each suited to a different workload. Understanding which one you care about is the first step.

Precision	Typical use	Relative throughput
FP64 (double)	Scientific, CAD, simulation	1/32 to 1/64 of FP32 on gaming GPUs
FP32 (single)	Gaming shaders — the headline number	Baseline (the quoted TFLOPS)
FP16 (half)	DLSS, FSR upscaling, AI training	2× FP32 on modern GPUs
INT8	AI inference, Tensor cores	4× FP32 (via Tensor pipeline)
FP4 / FP8	LLM inference, Blackwell era	8-16× FP32 (specialised)

For a gaming GPU buying decision, FP32 is the relevant number — but only as a rough sanity check against same-architecture cards. For AI work, FP16 and INT8 throughput matter more. For Blender or scientific compute, FP64 throughput matters (and most gaming GPUs are crippled here intentionally).

The TFLOPS calculation formula

The math is simple. For FP32:

FP32 TFLOPS = (Shader cores × Boost clock in MHz × 2) ÷ 1,000,000

The × 2 reflects the fused multiply-add (FMA) instruction — modern shader cores execute a multiply and an add in a single clock cycle, counting as two floating-point operations. The /1,000,000 converts from FLOPS-per-second to TFLOPS.

Worked examples:

RTX 5070: 6,144 cores × 2,512 MHz × 2 = 30,867 GFLOPS = 30.9 TFLOPS
RTX 5080: 10,752 cores × 2,620 MHz × 2 = 56,346 GFLOPS = 56.3 TFLOPS
RTX 5090: 21,760 cores × 2,407 MHz × 2 = 104,750 GFLOPS = 104.8 TFLOPS
RX 9070 XT: 4,096 stream processors × 2,970 MHz × 2 × 2 (dual-issue) = ~48.7 TFLOPS

Notice the AMD calculation includes an extra × 2 for dual-issue FP32 — RDNA 3 and later architectures can co-issue two FP32 operations per cycle under specific conditions. This is partly why AMD's TFLOPS numbers look high relative to real-world gaming. The dual-issue path is workload-dependent; many shaders can't take advantage of it.

Why TFLOPS doesn't equal FPS

This is the heart of the matter. A GPU with more TFLOPS on paper can — and often does — render fewer frames per second than a GPU with fewer TFLOPS. Six reasons:

1. Memory bandwidth bottleneck. A shader core sitting idle while it waits for data from VRAM doesn't accomplish anything, no matter how fast it could compute. A GPU with 30 TFLOPS but slow GDDR6 will be starved before it hits theoretical peak; a GPU with 25 TFLOPS and fast GDDR7 + larger L2 cache often keeps its cores fed and outperforms it.

2. Architecture efficiency. Different GPU generations extract different amounts of useful work per TFLOP. NVIDIA's Ada and Blackwell architectures consistently extract higher real-world utilisation than AMD's RDNA 3 — a TFLOP on Blackwell does more visible work in a game than a TFLOP on RDNA 3.

3. Driver optimisation. A well-optimised driver schedules shader work efficiently, reduces stalls, and applies per-title fixes. NVIDIA has historically had more aggressive per-game driver optimisations than AMD, though this gap has narrowed in 2025-2026.

4. Game-specific shader compilation. Some games extract dual-issue or vector throughput well; some don't. A title compiled with NVIDIA's developer tooling may run substantially better on NVIDIA hardware even at lower TFLOPS.

5. RT cores and Tensor cores aren't FP32. Ray tracing runs on dedicated RT cores; DLSS upscaling and frame generation run on Tensor cores. None of that work is captured by the FP32 TFLOPS figure.

6. ROPs, TMUs, and the rest of the graphics pipeline. A GPU does more than shader math — it rasterises, samples textures, blends pixels, and writes framebuffers. The ROP and TMU counts matter at high resolution.

RDNA vs Ada vs Blackwell — architecture efficiency

Three architectures power most modern GPUs:

AMD RDNA 3 (RX 7000 series). Quoted FP32 TFLOPS includes dual-issue throughput that's hard to fully utilise in gaming shaders. Real-world per-TFLOP gaming efficiency lags Ada by 15-25% in most titles.

AMD RDNA 4 (RX 9000 series). Significantly improved per-TFLOP efficiency over RDNA 3, plus revamped RT cores. Closes the gap with NVIDIA's Blackwell in raster, still trails in ray tracing.

NVIDIA Ada (RTX 40 series). Highly efficient per TFLOP. Strong RT and DLSS performance. 4th-gen Tensor cores enable DLSS 3 frame generation.

NVIDIA Blackwell (RTX 50 series). Latest generation. 5th-gen Tensor cores. DLSS 4 multi-frame generation. Even higher per-TFLOP efficiency in titles that use the newer rendering pipeline.

Intel Arc Battlemage. Intel's second-gen Arc architecture. Quoted TFLOPS is roughly competitive with NVIDIA's xx60 tier, but driver maturity and per-title optimisation lags both NVIDIA and AMD — meaning TFLOPS overstates real-world gaming performance more here than on either competitor.

Memory bandwidth — the actual bottleneck

If shader cores are the engine, memory bandwidth is the fuel line. The fastest engine in the world produces no power if it can't get fuel. Modern GPUs are bandwidth-constrained more often than compute-constrained — especially at 1440p and 4K.

A GPU's effective memory bandwidth (GB/s) is the product of memory bus width (bits) and memory clock speed (MT/s). Recent NVIDIA cards have offset narrower buses with larger L2 caches (NVIDIA's "Infinity Cache" equivalent) — keeping more textures and shader data closer to the cores, reducing trips to VRAM.

Practical implication: when comparing two cards with similar FP32 TFLOPS, the one with higher memory bandwidth (and/or larger cache) almost always wins at higher resolutions. The card with more TFLOPS but starved memory wins on paper and loses in games.

RT cores and Tensor cores — separate silicon, separate performance

Modern GPUs aren't a single homogeneous compute fabric. They contain multiple specialised processing units:

Shader cores (CUDA / Stream Processors) — execute the FP32 work that defines the headline TFLOPS number.
RT cores (NVIDIA) / Ray Accelerators (AMD) / RT Units (Intel) — handle bounding-volume hierarchy traversal and ray-triangle intersection tests. Ray tracing FPS depends on these, not FP32 TFLOPS.
Tensor cores (NVIDIA) / AI Accelerators (AMD) / XMX Units (Intel) — accelerate matrix multiply operations used by DLSS, FSR 4, XeSS, and AI workloads.
ROPs (render output units) — write final pixels to the framebuffer; matter at high resolutions.
TMUs (texture mapping units) — sample textures; matter for texture-heavy titles.

The takeaway: two GPUs can have the same FP32 TFLOPS and drastically different real-world gaming performance because the rest of the silicon doesn't match. NVIDIA's DLSS quality advantage in 2026 isn't because they have more TFLOPS — it's because their Tensor cores are dedicated, fast, and well-supported by the AI upscaling models.

When TFLOPS does matter

TFLOPS isn't useless — it's just frequently misapplied. Three workloads where the headline number does correlate well with real performance:

AI training and inference. Large language models, image generation, machine learning research — these workloads are compute-dense and frequently capable of saturating shader cores. FP16, BF16 and INT8 TFLOPS matter here, but FP32 TFLOPS is a reasonable first-pass proxy.

Scientific computing and simulation. Computational fluid dynamics, molecular dynamics, finite-element analysis — workloads that map cleanly onto shader cores benefit from raw throughput. For FP64 work, look at the FP64 TFLOPS figure (which gaming GPUs deliberately cripple — that's why the Radeon Pro / RTX A-series exist).

3D rendering (Blender Cycles, OctaneRender, V-Ray GPU). Path-traced offline rendering does saturate shader cores. Render time scales close to linearly with FP32 TFLOPS within an architecture, less reliably across architectures.

Common TFLOPS mistakes

Comparing TFLOPS across vendors. The single biggest mistake. A 22 TFLOPS AMD card is not faster than a 15 TFLOPS NVIDIA card by any meaningful gaming metric. Different architectures count TFLOPS differently and extract different per-TFLOP performance.

Comparing TFLOPS across generations. An RTX 4090's 82.6 TFLOPS is a very different beast from an RTX 3090's 35.6 TFLOPS — not just 2.3× faster, because Ada extracts more work per TFLOP than Ampere. Real FPS gap is closer to 1.8×.

Assuming higher TFLOPS = better for ray tracing. RT performance lives on RT cores, not FP32 shaders. An RX 7900 XTX has more FP32 TFLOPS than an RTX 4080 but loses badly in heavy RT workloads — because Ada's 3rd-gen RT cores outpace RDNA 3's Ray Accelerators.

Ignoring memory bandwidth. A card with high TFLOPS but narrow memory bus chokes at 4K. Always check effective bandwidth and cache size alongside TFLOPS.

Comparing console TFLOPS to PC GPU TFLOPS. PS5's 10.3 TFLOPS and Xbox Series X's 12.1 TFLOPS map to roughly RTX 3060 Ti / 3070 gaming performance — not an RTX 4060 (15 TFLOPS) or RX 7600 (22 TFLOPS). Console GPUs are RDNA 2, and the TFLOPS number significantly understates their gaming output thanks to fixed-platform optimisation.

Sapphire Nitro+ 9070 and 9070 XT side-by-side

Key takeaways

TFLOPS = math throughput, calculated as cores × boost clock × 2. FP32 is the headline figure on gaming GPUs.
TFLOPS doesn't directly equal FPS — architecture, memory bandwidth, RT cores, Tensor cores and drivers all matter more.
AMD TFLOPS numbers count dual-issue paths aggressively; cross-vendor comparisons mislead.
Ray tracing runs on RT cores; DLSS / FSR runs on Tensor / AI cores. Neither is captured by FP32 TFLOPS.
For gaming, trust per-game FPS benchmarks in your titles. For AI / compute / Blender, TFLOPS is a useful first comparison.

Frequently asked questions

What is a TFLOP on a GPU?
TFLOPS = Tera Floating-Point Operations Per Second. For gaming GPUs the headline number is FP32 TFLOPS, calculated as cores × boost clock × 2. It measures peak theoretical math throughput, not gaming FPS.
Does more TFLOPS mean more FPS?
Not directly. FPS depends on memory bandwidth, architecture efficiency, RT cores, Tensor cores and driver quality. Same-architecture cards scale with TFLOPS; cross-architecture cards often don't.
Why do AMD GPUs have higher TFLOPS but similar FPS to NVIDIA?
AMD counts dual-issue FP32 in peak numbers — workloads can't always use it. NVIDIA's Ada / Blackwell extracts higher utilisation per cycle, and driver / per-game optimisation is more mature.
What is FP32 vs FP16 vs INT8?
Different math precisions. FP32 is the gaming-relevant number. FP16 doubles throughput, used by DLSS / FSR / AI training. INT8 is for AI inference (Tensor cores). Higher precision = lower throughput.
How do I calculate TFLOPS for a GPU?
FP32 TFLOPS = (cores × boost clock in MHz × 2) ÷ 1,000,000. The × 2 reflects fused multiply-add (FMA). For AMD RDNA 3+, multiply again by 2 for dual-issue paths (workload-dependent in practice).
When do TFLOPS actually matter?
Compute-heavy workloads: AI training/inference, Blender Cycles, scientific computing. For gaming, ignore TFLOPS and trust per-game benchmark FPS in your titles.
What about ray tracing — is that FP32 too?
No. Ray tracing runs on dedicated RT cores (NVIDIA RT cores, AMD Ray Accelerators, Intel RT Units). A GPU's RT FPS is not predicted by FP32 TFLOPS.
What about DLSS and frame generation?
DLSS and frame generation run on Tensor cores (AI accelerators), not FP32 shaders. AMD's FSR runs on general shaders, which is one reason DLSS quality often beats FSR at equivalent TFLOPS.