Hardware Explainer · GPU Spec
GPU TFLOPS, explained. — Math throughput is not gaming FPS.
The headline spec on every GPU box — and one of the most misleading numbers in PC hardware. Higher TFLOPS doesn't reliably equal higher FPS, and we've got the benchmarks to prove it.
- what it measures
- Math throughput
- the misconception
- FP32 ≠ FPS
- what matters more
- Architecture
What TFLOPS actually measures
TFLOPS stands for Tera Floating-Point Operations Per Second — one trillion floating-point math operations every second. It's a measure of raw arithmetic throughput, derived directly from the GPU's shader core count and clock speed. Nothing in the calculation involves a game engine, a resolution, a render pipeline, or a driver.
That last sentence is why TFLOPS is so often misunderstood. The number on a GPU box describes peak theoretical math throughput in ideal lab conditions, with no memory bottlenecks, no cache misses, no shader complexity. Real games never hit that ceiling.
When NVIDIA or AMD print "30 TFLOPS" or "60 TFLOPS" on marketing material, they're quoting FP32 (single-precision floating point) throughput — the precision most gaming shader workloads use. That's the headline number. Other TFLOPS figures (FP16, FP64, INT8) exist for different workloads and almost never appear on the box.
FP32 vs FP16 vs INT8 — the precision ladder
Modern GPUs handle multiple math precisions, each suited to a different workload. Understanding which one you care about is the first step.
| Precision | Typical use | Relative throughput |
|---|---|---|
| FP64 (double) | Scientific, CAD, simulation | 1/32 to 1/64 of FP32 on gaming GPUs |
| FP32 (single) | Gaming shaders — the headline number | Baseline (the quoted TFLOPS) |
| FP16 (half) | DLSS, FSR upscaling, AI training | 2× FP32 on modern GPUs |
| INT8 | AI inference, Tensor cores | 4× FP32 (via Tensor pipeline) |
| FP4 / FP8 | LLM inference, Blackwell era | 8-16× FP32 (specialised) |
For a gaming GPU buying decision, FP32 is the relevant number — but only as a rough sanity check against same-architecture cards. For AI work, FP16 and INT8 throughput matter more. For Blender or scientific compute, FP64 throughput matters (and most gaming GPUs are crippled here intentionally).
The TFLOPS calculation formula
The math is simple. For FP32:
FP32 TFLOPS = (Shader cores × Boost clock in MHz × 2) ÷ 1,000,000
The × 2 reflects the fused multiply-add (FMA) instruction — modern shader cores execute a multiply and an add in a single clock cycle, counting as two floating-point operations. The /1,000,000 converts from FLOPS-per-second to TFLOPS.
Worked examples:
- RTX 5070: 6,144 cores × 2,512 MHz × 2 = 30,867 GFLOPS = 30.9 TFLOPS
- RTX 5080: 10,752 cores × 2,620 MHz × 2 = 56,346 GFLOPS = 56.3 TFLOPS
- RTX 5090: 21,760 cores × 2,407 MHz × 2 = 104,750 GFLOPS = 104.8 TFLOPS
- RX 9070 XT: 4,096 stream processors × 2,970 MHz × 2 × 2 (dual-issue) = ~48.7 TFLOPS
Notice the AMD calculation includes an extra × 2 for dual-issue FP32 — RDNA 3 and later architectures can co-issue two FP32 operations per cycle under specific conditions. This is partly why AMD's TFLOPS numbers look high relative to real-world gaming. The dual-issue path is workload-dependent; many shaders can't take advantage of it.
Why TFLOPS doesn't equal FPS
This is the heart of the matter. A GPU with more TFLOPS on paper can — and often does — render fewer frames per second than a GPU with fewer TFLOPS. Six reasons:
1. Memory bandwidth bottleneck. A shader core sitting idle while it waits for data from VRAM doesn't accomplish anything, no matter how fast it could compute. A GPU with 30 TFLOPS but slow GDDR6 will be starved before it hits theoretical peak; a GPU with 25 TFLOPS and fast GDDR7 + larger L2 cache often keeps its cores fed and outperforms it.
2. Architecture efficiency. Different GPU generations extract different amounts of useful work per TFLOP. NVIDIA's Ada and Blackwell architectures consistently extract higher real-world utilisation than AMD's RDNA 3 — a TFLOP on Blackwell does more visible work in a game than a TFLOP on RDNA 3.
3. Driver optimisation. A well-optimised driver schedules shader work efficiently, reduces stalls, and applies per-title fixes. NVIDIA has historically had more aggressive per-game driver optimisations than AMD, though this gap has narrowed in 2025-2026.
4. Game-specific shader compilation. Some games extract dual-issue or vector throughput well; some don't. A title compiled with NVIDIA's developer tooling may run substantially better on NVIDIA hardware even at lower TFLOPS.
5. RT cores and Tensor cores aren't FP32. Ray tracing runs on dedicated RT cores; DLSS upscaling and frame generation run on Tensor cores. None of that work is captured by the FP32 TFLOPS figure.
6. ROPs, TMUs, and the rest of the graphics pipeline. A GPU does more than shader math — it rasterises, samples textures, blends pixels, and writes framebuffers. The ROP and TMU counts matter at high resolution.
RDNA vs Ada vs Blackwell — architecture efficiency
Three architectures power most modern GPUs:
AMD RDNA 3 (RX 7000 series). Quoted FP32 TFLOPS includes dual-issue throughput that's hard to fully utilise in gaming shaders. Real-world per-TFLOP gaming efficiency lags Ada by 15-25% in most titles.
AMD RDNA 4 (RX 9000 series). Significantly improved per-TFLOP efficiency over RDNA 3, plus revamped RT cores. Closes the gap with NVIDIA's Blackwell in raster, still trails in ray tracing.
NVIDIA Ada (RTX 40 series). Highly efficient per TFLOP. Strong RT and DLSS performance. 4th-gen Tensor cores enable DLSS 3 frame generation.
NVIDIA Blackwell (RTX 50 series). Latest generation. 5th-gen Tensor cores. DLSS 4 multi-frame generation. Even higher per-TFLOP efficiency in titles that use the newer rendering pipeline.
Intel Arc Battlemage. Intel's second-gen Arc architecture. Quoted TFLOPS is roughly competitive with NVIDIA's xx60 tier, but driver maturity and per-title optimisation lags both NVIDIA and AMD — meaning TFLOPS overstates real-world gaming performance more here than on either competitor.
Memory bandwidth — the actual bottleneck
If shader cores are the engine, memory bandwidth is the fuel line. The fastest engine in the world produces no power if it can't get fuel. Modern GPUs are bandwidth-constrained more often than compute-constrained — especially at 1440p and 4K.
A GPU's effective memory bandwidth (GB/s) is the product of memory bus width (bits) and memory clock speed (MT/s). Recent NVIDIA cards have offset narrower buses with larger L2 caches (NVIDIA's "Infinity Cache" equivalent) — keeping more textures and shader data closer to the cores, reducing trips to VRAM.
Practical implication: when comparing two cards with similar FP32 TFLOPS, the one with higher memory bandwidth (and/or larger cache) almost always wins at higher resolutions. The card with more TFLOPS but starved memory wins on paper and loses in games.
RT cores and Tensor cores — separate silicon, separate performance
Modern GPUs aren't a single homogeneous compute fabric. They contain multiple specialised processing units:
- Shader cores (CUDA / Stream Processors) — execute the FP32 work that defines the headline TFLOPS number.
- RT cores (NVIDIA) / Ray Accelerators (AMD) / RT Units (Intel) — handle bounding-volume hierarchy traversal and ray-triangle intersection tests. Ray tracing FPS depends on these, not FP32 TFLOPS.
- Tensor cores (NVIDIA) / AI Accelerators (AMD) / XMX Units (Intel) — accelerate matrix multiply operations used by DLSS, FSR 4, XeSS, and AI workloads.
- ROPs (render output units) — write final pixels to the framebuffer; matter at high resolutions.
- TMUs (texture mapping units) — sample textures; matter for texture-heavy titles.
The takeaway: two GPUs can have the same FP32 TFLOPS and drastically different real-world gaming performance because the rest of the silicon doesn't match. NVIDIA's DLSS quality advantage in 2026 isn't because they have more TFLOPS — it's because their Tensor cores are dedicated, fast, and well-supported by the AI upscaling models.
When TFLOPS does matter
TFLOPS isn't useless — it's just frequently misapplied. Three workloads where the headline number does correlate well with real performance:
AI training and inference. Large language models, image generation, machine learning research — these workloads are compute-dense and frequently capable of saturating shader cores. FP16, BF16 and INT8 TFLOPS matter here, but FP32 TFLOPS is a reasonable first-pass proxy.
Scientific computing and simulation. Computational fluid dynamics, molecular dynamics, finite-element analysis — workloads that map cleanly onto shader cores benefit from raw throughput. For FP64 work, look at the FP64 TFLOPS figure (which gaming GPUs deliberately cripple — that's why the Radeon Pro / RTX A-series exist).
3D rendering (Blender Cycles, OctaneRender, V-Ray GPU). Path-traced offline rendering does saturate shader cores. Render time scales close to linearly with FP32 TFLOPS within an architecture, less reliably across architectures.
Common TFLOPS mistakes
Comparing TFLOPS across vendors. The single biggest mistake. A 22 TFLOPS AMD card is not faster than a 15 TFLOPS NVIDIA card by any meaningful gaming metric. Different architectures count TFLOPS differently and extract different per-TFLOP performance.
Comparing TFLOPS across generations. An RTX 4090's 82.6 TFLOPS is a very different beast from an RTX 3090's 35.6 TFLOPS — not just 2.3× faster, because Ada extracts more work per TFLOP than Ampere. Real FPS gap is closer to 1.8×.
Assuming higher TFLOPS = better for ray tracing. RT performance lives on RT cores, not FP32 shaders. An RX 7900 XTX has more FP32 TFLOPS than an RTX 4080 but loses badly in heavy RT workloads — because Ada's 3rd-gen RT cores outpace RDNA 3's Ray Accelerators.
Ignoring memory bandwidth. A card with high TFLOPS but narrow memory bus chokes at 4K. Always check effective bandwidth and cache size alongside TFLOPS.
Comparing console TFLOPS to PC GPU TFLOPS. PS5's 10.3 TFLOPS and Xbox Series X's 12.1 TFLOPS map to roughly RTX 3060 Ti / 3070 gaming performance — not an RTX 4060 (15 TFLOPS) or RX 7600 (22 TFLOPS). Console GPUs are RDNA 2, and the TFLOPS number significantly understates their gaming output thanks to fixed-platform optimisation.




Key takeaways
- TFLOPS = math throughput, calculated as cores × boost clock × 2. FP32 is the headline figure on gaming GPUs.
- TFLOPS doesn't directly equal FPS — architecture, memory bandwidth, RT cores, Tensor cores and drivers all matter more.
- AMD TFLOPS numbers count dual-issue paths aggressively; cross-vendor comparisons mislead.
- Ray tracing runs on RT cores; DLSS / FSR runs on Tensor / AI cores. Neither is captured by FP32 TFLOPS.
- For gaming, trust per-game FPS benchmarks in your titles. For AI / compute / Blender, TFLOPS is a useful first comparison.
Frequently asked questions
What is a TFLOP on a GPU?
TFLOPS = Tera Floating-Point Operations Per Second. For gaming GPUs the headline number is FP32 TFLOPS, calculated as cores × boost clock × 2. It measures peak theoretical math throughput, not gaming FPS.Does more TFLOPS mean more FPS?
Not directly. FPS depends on memory bandwidth, architecture efficiency, RT cores, Tensor cores and driver quality. Same-architecture cards scale with TFLOPS; cross-architecture cards often don't.Why do AMD GPUs have higher TFLOPS but similar FPS to NVIDIA?
AMD counts dual-issue FP32 in peak numbers — workloads can't always use it. NVIDIA's Ada / Blackwell extracts higher utilisation per cycle, and driver / per-game optimisation is more mature.What is FP32 vs FP16 vs INT8?
Different math precisions. FP32 is the gaming-relevant number. FP16 doubles throughput, used by DLSS / FSR / AI training. INT8 is for AI inference (Tensor cores). Higher precision = lower throughput.How do I calculate TFLOPS for a GPU?
FP32 TFLOPS = (cores × boost clock in MHz × 2) ÷ 1,000,000. The × 2 reflects fused multiply-add (FMA). For AMD RDNA 3+, multiply again by 2 for dual-issue paths (workload-dependent in practice).When do TFLOPS actually matter?
Compute-heavy workloads: AI training/inference, Blender Cycles, scientific computing. For gaming, ignore TFLOPS and trust per-game benchmark FPS in your titles.What about ray tracing — is that FP32 too?
No. Ray tracing runs on dedicated RT cores (NVIDIA RT cores, AMD Ray Accelerators, Intel RT Units). A GPU's RT FPS is not predicted by FP32 TFLOPS.What about DLSS and frame generation?
DLSS and frame generation run on Tensor cores (AI accelerators), not FP32 shaders. AMD's FSR runs on general shaders, which is one reason DLSS quality often beats FSR at equivalent TFLOPS.