The post FlashAttention-4 Hits 1,605 TFLOPS on NVIDIA Blackwell GPUs appeared on BitcoinEthereumNews.com. Alvin Lang Jan 22, 2026 23:03 NVIDIA’s FlashAttentionThe post FlashAttention-4 Hits 1,605 TFLOPS on NVIDIA Blackwell GPUs appeared on BitcoinEthereumNews.com. Alvin Lang Jan 22, 2026 23:03 NVIDIA’s FlashAttention

FlashAttention-4 Hits 1,605 TFLOPS on NVIDIA Blackwell GPUs

2 min read


Alvin Lang
Jan 22, 2026 23:03

NVIDIA’s FlashAttention-4 achieves 71% hardware efficiency on Blackwell chips, delivering 3.6x speedup over FA2 for AI training workloads.

NVIDIA has released FlashAttention-4, the latest optimization for transformer neural networks that squeezes 1,605 TFLOPS out of its Blackwell architecture—capturing 71% of the hardware’s theoretical maximum performance.

The announcement matters for anyone watching AI infrastructure investments. As large language models push toward longer context windows, the attention mechanism’s quadratic memory complexity becomes a brutal bottleneck. FlashAttention-4 attacks this problem directly, and the benchmark numbers suggest meaningful gains for production AI workloads.

What the Numbers Show

On the B200 GPU, FA4 delivers a 3.6x speedup over FlashAttention-2 during forward passes at 32,768 sequence length. Backward pass performance hits 3.15x faster than FA2 under the same conditions. Against existing frameworks, FA4 posts 1.3x improvement over cuDNN and 2.4x over Triton Inference Server implementations.

The memory efficiency gains are equally significant. Standard attention scales at O(N²) with sequence length—meaning doubling your context window quadruples memory requirements. FA4 brings this down to O(N) through tiling and incremental softmax normalization. NVIDIA claims 20x lower memory usage compared to PyTorch baselines.

Hardware-Software Co-Design

FA4 was built specifically for Blackwell’s quirks. The architecture presents an asymmetric scaling problem: compute power roughly doubles while memory bandwidth doesn’t keep pace. Traditional approaches leave tensor cores sitting idle while waiting for data.

The solution leverages Blackwell’s dedicated Tensor Memory (TMEM)—256 KB of on-chip memory per streaming multiprocessor. By storing intermediate calculations directly in TMEM instead of shared memory, FA4 sidesteps the bandwidth bottleneck that would otherwise throttle the faster compute units.

Larger tile sizes (up to 128×128) and deeper pipelines keep the hardware busy. The backward pass—typically the slower half of training—benefits from bypassing register accumulation entirely.

Production Integration

Major inference frameworks including SGLang and vLLM already support FA4 prefill operations. NVIDIA has incorporated these techniques into cuDNN 9.14, making the optimizations accessible to developers without custom kernel work.

For AI companies burning through compute budgets, the efficiency gains translate directly to cost savings. A 3x+ speedup on training passes means either faster iteration cycles or the ability to train larger models within existing infrastructure constraints.

The broader trend here: as transformer models grow, algorithmic efficiency at the kernel level becomes as important as raw hardware capability. FlashAttention-4 represents the current frontier of that optimization work.

Image source: Shutterstock

Source: https://blockchain.news/news/flashattention-4-nvidia-blackwell-gpu-performance

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment?

Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment?

The post Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment? appeared on BitcoinEthereumNews.com. Crypto News 17 September 2025 | 17:39 Is dogecoin really fading? As traders hunt the best crypto to buy now and weigh 2025 picks, Dogecoin (DOGE) still owns the meme coin spotlight, yet upside looks capped, today’s Dogecoin price prediction says as much. Attention is shifting to projects that blend culture with real on-chain tools. Buyers searching “best crypto to buy now” want shipped products, audits, and transparent tokenomics. That frames the true matchup: dogecoin vs. Pepeto. Enter Pepeto (PEPETO), an Ethereum-based memecoin with working rails: PepetoSwap, a zero-fee DEX, plus Pepeto Bridge for smooth cross-chain moves. By fusing story with tools people can use now, and speaking directly to crypto presale 2025 demand, Pepeto puts utility, clarity, and distribution in front. In a market where legacy meme coin leaders risk drifting on sentiment, Pepeto’s execution gives it a real seat in the “best crypto to buy now” debate. First, a quick look at why dogecoin may be losing altitude. Dogecoin Price Prediction: Is Doge Really Fading? Remember when dogecoin made crypto feel simple? In 2013, DOGE turned a meme into money and a loose forum into a movement. A decade on, the nonstop momentum has cooled; the backdrop is different, and the market is far more selective. With DOGE circling ~$0.268, the tape reads bearish-to-neutral for the next few weeks: hold the $0.26 shelf on daily closes and expect choppy range-trading toward $0.29–$0.30 where rallies keep stalling; lose $0.26 decisively and momentum often bleeds into $0.245 with risk of a deeper probe toward $0.22–$0.21; reclaim $0.30 on a clean daily close and the downside bias is likely neutralized, opening room for a squeeze into the low-$0.30s. Source: CoinMarketcap / TradingView Beyond the dogecoin price prediction, DOGE still centers on payments and lacks native smart contracts; ZK-proof verification is proposed,…
Share
BitcoinEthereumNews2025/09/18 00:14
XRPL Validator Reveals Why He Just Vetoed New Amendment

XRPL Validator Reveals Why He Just Vetoed New Amendment

Vet has explained that he has decided to veto the Token Escrow amendment to prevent breaking things
Share
Coinstats2025/09/18 00:28
US Senate Democrats plan to restart discussions on a cryptocurrency market structure bill later today.

US Senate Democrats plan to restart discussions on a cryptocurrency market structure bill later today.

PANews reported on February 4th that, according to Crypto In America, US Senate Democrats plan to reconvene on the afternoon of February 4th to discuss legislation
Share
PANews2026/02/04 23:12