The AI infrastructure landscape is undergoing a fundamental transformation. What were once traditional data centers focused on storage and processing have evolved into what NVIDIA's Shruti Koparkar calls "AI token factories" – facilities whose primary output is intelligence manufactured in the form of tokens.
This shift demands a corresponding change in how we evaluate AI infrastructure economics. Yet many enterprises still get trapped by vanity metrics that don't reflect real-world business outcomes.
The Metrics That Mislead
When evaluating AI infrastructure, most organizations focus on familiar metrics:
- Compute cost – What you pay for AI infrastructure
- FLOPS per dollar – Raw computing power per dollar spent
- Peak chip specifications – Theoretical maximum performance
Here's the problem: these are all input metrics. They tell you what you're putting in, but say nothing about what you're getting out. It's like judging a restaurant by the cost of ingredients instead of the quality and quantity of meals served.
The Metric That Matters: Cost Per Token
Cost per token represents your all-in cost to produce each delivered token – usually measured as cost per million tokens. This metric directly accounts for:
- Hardware performance
- Software optimization
- Ecosystem support
- Real-world utilization
Think of it as the "inference iceberg." The compute cost sits above the surface – visible and easy to compare. But the real value drivers lie beneath: token output optimization, which determines your actual business outcomes.
The Hidden Factors That Drive Token Output
Instead of asking surface-level questions like "What's the cost per GPU hour?", smart infrastructure evaluations dig deeper:
Performance Questions:
- What's the cost per million tokens for large-scale mixture-of-experts (MoE) models?
- What's the delivered token output per megawatt?
- Can the scale-up interconnect handle the "all-to-all" traffic of MoE models?
Optimization Questions:
- Is FP4 precision supported while maintaining accuracy?
- Does the runtime support speculative decoding or multi-token prediction?
- Does the serving layer support disaggregated serving and KV-cache optimizations?
Real-World Impact: The Numbers Don't Lie
Consider this comparison between NVIDIA's Hopper and Blackwell platforms running the DeepSeek-R1 model:
| Metric | Hopper (HGX H200) | Blackwell (GB300 NVL72) | Improvement |
|---|---|---|---|
| Cost per GPU per Hour | $1.41 | $2.65 | 2x higher cost |
| FLOPS per Dollar | 2.8 PFLOPS | 5.6 PFLOPS | 2x better |
| Tokens per Second per GPU | 90 | 6,000 | 65x better |
| Cost per Million Tokens | $4.20 | $0.12 | 35x lower |
The story these numbers tell is striking. While Blackwell appears 2x more expensive per GPU hour, it delivers 35x lower cost per token – the metric that actually impacts your bottom line.
Why This Matters for Your AI Strategy
Optimizing for cost per token drives two critical business outcomes:
1. Minimize Operating Costs
Lower cost per token directly improves profit margins on every AI interaction your business serves.
2. Maximize Revenue Potential
More tokens per second means more intelligence generated from the same infrastructure investment, enabling more AI-powered products and services.
The Prompt Engineering Connection
For those of us in the AI prompts community, this shift toward cost per token analysis has immediate implications. When building AI applications or choosing platforms for prompt-heavy workloads, understanding the true economics helps us:
- Select the most cost-effective platforms for our specific use cases
- Design prompts that balance quality with token efficiency
- Build sustainable AI products that can scale profitably
Making the Right Infrastructure Choice
The key takeaway isn't just about choosing NVIDIA (though their analysis makes a compelling case). It's about fundamentally changing how we evaluate AI infrastructure.
Move beyond the surface-level metrics that are easy to compare but don't reflect business reality. Instead, focus on the integrated optimizations across hardware, software, and ecosystem that drive real token output.
Because in the age of AI token factories, the enterprises that win will be those that produce intelligence most efficiently – not those with the flashiest specs on paper.
Source: Analysis based on "Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters" by Shruti Koparkar, NVIDIA Blog