DeepSeek V4 Released with Million-Token Architecture and Closes the Open-Prem Gap
- David Borish

- 1 day ago
- 6 min read

The Efficiency Numbers That Matter
The DeepSeek V4 technical report opens with a problem statement that every enterprise running long-document analysis or multi-step agent workflows will recognize immediately. Attention mechanisms scale with quadratic computational complexity. As context windows grow, inference costs grow faster. For cloud providers, that cost transfers to billing. For organizations running models on their own hardware, it transfers directly to GPU time and hardware utilization.
DeepSeek's V4 series addresses this with a hybrid attention architecture combining two new mechanisms: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses every four key-value tokens into a single cache entry, then applies sparse attention to select the most relevant compressed entries for each query. HCA applies even more aggressive compression at a ratio of 128-to-1, keeping dense attention over the resulting compressed set. The two mechanisms alternate across the model's layers, with sliding window attention added at the local level to preserve fine-grained dependencies among nearby tokens.
The result in a one-million-token context window: DeepSeek-V4-Pro requires 27% of the single-token inference FLOPs of DeepSeek-V3.2 and retains only 10% of V3.2's KV cache. The smaller DeepSeek-V4-Flash, with 284 billion total parameters and 13 billion activated per token, pushes further, reaching 10% of V3.2's FLOPs and 7% of KV cache at the million-token horizon.
These are not theoretical figures under ideal conditions. They reflect the model's production architecture, trained on 33 trillion tokens for V4-Pro and 32 trillion for V4-Flash, with context length extended progressively through training from 4,000 tokens up to the full one million.
What the Model Actually Does
Benchmark performance locates DeepSeek-V4-Pro-Max, the maximum-reasoning mode, in a specific position relative to the current frontier. On SimpleQA world knowledge evaluation, it scores 57.9%, compared to 46.2% for GPT-5.4 and 45.3% for Gemini-3.1-Pro.
On long-context benchmarks requiring full one-million-token processing, it surpasses Gemini-3.1-Pro. On general reasoning, the technical report places it roughly three to six months behind GPT-5.4 and Gemini-3.1-Pro but ahead of GPT-5.2 and Gemini-3.0-Pro.
On agentic tasks, V4-Pro-Max outperforms Claude Sonnet 4.5 in internal evaluations and approaches Claude Opus 4.5. On public benchmarks, it matches models like Kimi-K2.6 and GLM-5.1.
DeepSeek-V4-Flash-Max, the smaller variant under extended thinking budget, reaches comparable performance to GPT-5.2 and Gemini-3.0-Pro on reasoning tasks. For workloads where cost matters more than squeezing out the last few points on a leaderboard, a model delivering frontier-adjacent reasoning at 13B activated parameters per token is a different category of tool than anything available at this price point twelve months ago.
The post-training pipeline explains some of why these models perform as they do. DeepSeek trained specialist models independently for mathematics, coding, agent behavior, and instruction following, then used on-policy distillation to merge their capabilities into a single unified model. Each specialist went through supervised fine-tuning on domain-specific data followed by reinforcement learning via Group Relative Policy Optimization. The unified model learns from all of them simultaneously.
The Infrastructure Changes That Make Self-Hosting Viable
The V4 technical report is unusually candid about the infrastructure work required to make these models trainable and deployable. Several of the engineering decisions have direct implications for on-premises deployment.
DeepSeek built a fine-grained Expert Parallelism scheme that overlaps computation and communication inside a single fused kernel rather than sequencing them. The result is 1.50 to 1.73 times throughput improvement for general inference and up to 1.96 times for latency-sensitive scenarios like reinforcement learning rollouts and agent serving. The system is open-sourced as MegaMoE under the DeepGEMM project.
FP4 quantization-aware training was applied to the expert weights during post-training, meaning the deployed model uses 4-bit precision for its largest parameter block without accuracy degradation trained in. For organizations where GPU memory is the binding constraint, this matters concretely: more model fits in a given hardware configuration.
The KV cache design addresses what would otherwise be a management problem. Different layers in the hybrid attention architecture produce different-sized cache entries with different update rules. DeepSeek designed a two-component cache layout separating a classical compressed KV store from a state cache that handles sliding window attention and uncompressed tail tokens. On-disk KV caching with three storage strategies, including a periodic checkpointing option, allows deployments to trade computation against storage depending on their prefix reuse patterns.
The team also reports resolving training instability in the 1.6-trillion-parameter model through two techniques: Anticipatory Routing, which decouples the routing network update from the backbone network update to prevent feedback loops that cause loss spikes, and SwiGLU Clamping, which constrains gate and linear component values during forward passes to eliminate numerical outliers.
Where This Lands in the Open-Prem V3 Framework
The Open-Prem Inflection Point V3 paper documented a frontier that had already moved: at least nine distinct open-source model families operating at or near the performance level that proprietary APIs commanded a year ago. The central finding was that the supply side of open-source AI had reached sufficient density that organizations were no longer asking whether any open-source model could handle their workloads. They were choosing among nine or more frontier-class options.
DeepSeek V4 now sits at the top of that supply-side table, with a specific capability the other entries do not match: sustained million-token context at on-premises inference costs. Open-Prem V3 documented that self-hosted inference costs between $0.05 and $0.20 per million tokens versus $3 to $15 for proprietary cloud APIs. DeepSeek V4's architecture pushes the compute requirement for long-context inference down substantially, which improves that self-hosted cost figure further for the workloads where context length was previously the binding constraint.
Open-Prem V3 described the emergence of autonomous AI workforces running on local hardware: five-agent hierarchies organized as management structures, local compute supervised by cloud intelligence on ten-minute check-in cycles, Apple Silicon running full-size models at zero marginal inference cost after hardware purchase. The limiting factor in that picture has always been context. An agent that needs to process a 200,000-word contract corpus, a full email archive for a compliance review, or a year's worth of code commits cannot do useful work if the model handling it bottlenecks at 128,000 tokens or incurs prohibitive cost to extend further.
DeepSeek-V4-Pro's ability to handle one million tokens at 27% of the FLOPs previously required for V3.2 changes what an autonomous agent running on local hardware can realistically accomplish in a work session. Combined with the compliance pressure documented in Open-Prem V3, where the EU AI Act reaches full enforcement on August 2, 2026, and where the IBM Cost of a Data Breach Report found an additional $670,000 per incident when employees used unapproved AI tools, the ability to run a frontier-class long-context model on-premises without cloud egress is a meaningful governance advantage.
NemoClaw's security layer, documented in Open-Prem V3, is built for exactly this operational pattern: local inference, deterministic policy enforcement, data that never leaves the building. DeepSeek V4 gives that security-controlled local infrastructure a model that can handle the context lengths enterprise workflows actually require.
What the Release Signals
DeepSeek is releasing V4 as a preview, with model weights available on Hugging Face. The open-source release under a permissive MIT license means organizations can deploy, fine-tune, and run the model without API dependency. The model architecture documentation is detailed enough that organizations building self-hosted inference infrastructure can reason about hardware requirements directly from the published FLOPs and cache figures rather than reverse-engineering them from API costs.
The training stability techniques, the open-sourced MegaMoE kernel, and the TileLang-based fused kernels for efficient inference all suggest DeepSeek is sharing enough of its infrastructure work that the community can reproduce and extend it. For enterprises evaluating whether to build on open-source models for the long term, that track record matters as much as any single benchmark score.
The technical report compares V4-Pro-Max's trajectory to frontier proprietary models and places it three to six months behind the current state of the art in some reasoning categories. That framing is worth examining in context. Three to six months behind a frontier that costs an order of magnitude more per token to access, requires no cloud data transfer, and runs on hardware the organization already owns is not a comparison that favors the proprietary option for most enterprise workloads.
The V1 paper described a directional bet. The V3 update documented that the bet had paid off. DeepSeek V4 adds a dimension to that picture that V3 could not fully account for: not just a competitive open-source model, but a competitive open-source model purpose-built for the context lengths that autonomous agent workloads require at inference economics that on-premises hardware can support.

Comments