"We should self-host to save money on API costs." I hear this regularly from technical leaders. Sometimes they're right. More often, the math doesn't work out the way they expect. LLM inference economics are counterintuitive, and the wrong choice can cost tens of thousands of dollars.
This article provides actual cost breakdowns-with specific numbers-for both cloud API and self-hosted inference. We'll show you when each approach makes sense, how to calculate your break-even point, and the hidden costs that derail infrastructure decisions.
The Cost Landscape in 2026
Before diving into specific numbers, let's establish the current landscape. LLM inference costs have dropped dramatically over the past two years:
- Cloud APIs: GPT-class models that cost $30-60 per million tokens in 2024 now cost $2-15 per million tokens. Claude Sonnet costs $3 per million input tokens, $15 per million output tokens.
- GPU hardware: NVIDIA H100 and H200 GPUs have become more available, though still expensive. A100 prices have stabilized. Consumer-grade GPUs (RTX 4090) offer surprising inference capability.
- Open models: Llama 3.3, Mistral Large, and DeepSeek V3 approach GPT quality for many tasks. Smaller models (7B-70B parameters) handle routine tasks competently.
The key insight: API costs have fallen faster than most organizations' utilization has grown. Many projects that would have justified self-hosting two years ago no longer do.
Cloud API Costs: The Full Picture
Cloud API pricing is straightforward to understand but easy to miscalculate. Let's break it down.
Per-Token Pricing (February 2026)
Current rates for major providers (per million tokens):
| Model | Input Cost | Output Cost | Notes |
|---|---|---|---|
| GPTo | $2.50 | $10.00 | General-purpose flagship |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Best for coding tasks |
| Claude 3.5 Haiku | $0.80 | $4.00 | Fast, cost-effective |
| GPTo-mini | $0.15 | $0.60 | Lightweight tasks |
| Gemini 1.5 Pro | $1.25 | $5.00 | 1M context window |
| Gemini 1.5 Flash | $0.075 | $0.30 | Fastest, cheapest |
Calculating Actual Usage
Token counts surprise people. A "simple" customer service interaction might involve:
- System prompt: 500 tokens
- Conversation history: 2,000 tokens
- Retrieved context (RAG): 3,000 tokens
- User message: 100 tokens
- Assistant response: 500 tokens
Total: 6,100 tokens per interaction (5,600 input + 500 output)
At Claude Sonnet pricing: $0.017 + $0.0075 = $0.024 per interaction
At 10,000 interactions per day: $240/day = $7,200/month
What This Actually Costs
Let's model three usage scenarios:
Light usage (1,000 requests/day, 5K tokens/request):
- GPTo: ~$80/month
- Claude Sonnet: ~$100/month
- Gemini Flash: ~$15/month
Medium usage (10,000 requests/day, 6K tokens/request):
- GPTo: ~$2,400/month
- Claude Sonnet: ~$3,000/month
- Gemini Flash: ~$450/month
Heavy usage (100,000 requests/day, 8K tokens/request):
- GPTo: ~$36,000/month
- Claude Sonnet: ~$45,000/month
- Gemini Flash: ~$6,000/month
Output tokens cost 3-5x more than input tokens, yet most cost calculations underestimate output length. A detailed response, code generation, or structured output easily reaches 1,000+ tokens. Always measure actual output lengths before projecting costs.
Self-Hosted Inference Costs: The Complete Accounting
Self-hosted inference appears cheaper when you only look at GPU time. The full cost picture includes many more factors.
Hardware Options
Cloud GPU instances (per hour):
- AWS p4d.24xlarge (8x A100 40GB): ~$32.77/hour
- AWS p5.48xlarge (8x H100 80GB): ~$98.32/hour
- GCP a2-highgpu-8g (8x A100 40GB): ~$29.38/hour
- Lambda Labs A100 (1x A100 80GB): ~$1.29/hour
Dedicated/owned hardware:
- NVIDIA H100 SXM: ~$30,000 (when available)
- NVIDIA A100 80GB: ~$15,000
- NVIDIA RTX 4090: ~$1,600
Model Requirements
GPU memory requirements for different model sizes:
- 7B parameters (Llama 3.1 7B, Mistral 7B): 14GB at FP16, 7GB at INT8 → Fits on single RTX 4090
- 13B parameters: 26GB at FP16, 13GB at INT8 → Fits on single A100 40GB
- 70B parameters (Llama 3.3 70B): 140GB at FP16, 70GB at INT8 → Requires 2x A100 80GB or 8x RTX 4090
- 405B parameters (Llama 3.1 405B): 810GB at FP16 → Requires 8x H100 80GB minimum
The Hidden Costs
Hardware is just the beginning. Full self-hosted costs include:
1. Infrastructure overhead:
- Load balancing and request routing
- Monitoring and alerting
- Logging and debugging infrastructure
- Model versioning and rollback capability
- Typically adds 20-40% to base compute costs
2. Operations labor:
- GPU expertise isn't cheap: $150-250K/year for ML infrastructure engineers
- At minimum, part-time attention from an engineer: ~$50K/year equivalent
- On-call rotation for production systems
3. Reliability engineering:
- Redundancy (you need spare capacity for failover)
- Multi-region deployment for availability
- Health checks and automatic recovery
4. Model management:
- Downloading, converting, and optimizing models
- Keeping up with new model releases
- Tuning inference parameters (batch size, quantization, etc.)
5. Security and compliance:
- Network isolation
- Access control and audit logging
- Compliance certification (if applicable)
Realistic Self-Hosted Cost Calculation
Let's calculate the true cost of running Llama 70B for medium-scale inference:
Hardware: 2x A100 80GB on AWS
- Instance cost: ~$8/hour × 730 hours = $5,840/month
- Reserved instances (1 year): ~$3,500/month
Infrastructure (40% overhead): ~$1,400/month
Operations (part-time engineer): ~$4,000/month
Total: ~$8,900/month
Throughput at this cost: With optimized serving (vLLM, TensorRT-LLM), expect ~50-100 tokens/second sustained throughput. That's ~130-260 million tokens per month.
Effective cost: ~$0.03-0.07 per 1,000 tokens
Compare to Claude Sonnet at $0.003-0.015 per 1,000 tokens. Self-hosted Llama 70B costs 2-20x more per token than Claude Sonnet-even with conservative labor estimates.
This calculation assumes high utilization. Most self-hosted deployments run at 20-40% utilization-traffic is bursty, not constant. At 30% utilization, your effective per-token cost triples. APIs only charge for what you use.
When Self-Hosting Makes Sense
Despite the economics often favoring APIs, self-hosting is the right choice in several scenarios:
1. Extreme Volume
At very high volumes, self-hosting eventually wins. The crossover point depends on model size and utilization, but roughly:
- 7B models: Self-hosting wins above ~$2,000/month in API costs
- 70B models: Self-hosting wins above ~$15,000/month in API costs
- Frontier-class models: Self-hosting rarely wins-APIs include ongoing model improvements
Critical caveat: These numbers assume high utilization. If your traffic is bursty, the crossover point is much higher.
2. Latency Requirements
API calls include network latency. For applications requiring sub-100ms responses, self-hosting with local inference may be necessary. Common scenarios:
- Real-time gaming or interactive applications
- High-frequency trading or automated decision systems
- Inline code completion with strict latency budgets
3. Data Sovereignty
When data cannot leave your infrastructure for regulatory or contractual reasons:
- Healthcare with PHI constraints
- Financial services with strict data handling requirements
- Government or defense applications
- Contracts that prohibit third-party data processing
Note: Many cloud APIs now offer data residency guarantees and HIPAA/SOC2 compliance. Check current offerings before assuming self-hosting is required.
4. Fine-Tuned Models
If you've fine-tuned a model on proprietary data and the fine-tuning provides significant quality improvement, self-hosting that model makes sense. API providers offer fine-tuning, but serving your own fine-tuned model gives more control.
5. Offline or Air-Gapped Environments
Environments without reliable internet access require local inference. Common in:
- Edge deployments
- Secure facilities
- Remote operations
When APIs Win
Cloud APIs are the better choice in most scenarios:
1. Variable or Unpredictable Traffic
APIs charge per request. Self-hosting charges for capacity whether you use it or not. If your traffic varies significantly-seasonal businesses, products in growth phase, experimental applications-APIs provide better economics.
2. Need for Frontier Capabilities
Self-hosted open models lag frontier APIs by 6-18 months in capability. If you need the best available models for complex reasoning, coding, or analysis, APIs are the only option.
3. Limited ML Infrastructure Expertise
The operational burden of self-hosting is significant. Organizations without ML infrastructure expertise should strongly prefer APIs. The time spent debugging CUDA errors, optimizing batch sizes, and managing model versions is time not spent on your actual product.
4. Rapid Iteration
APIs let you switch models instantly. Self-hosting requires redeployment. During product development when you're iterating on prompts and comparing models, API flexibility is valuable.
5. Multi-Model Strategies
Modern AI applications often use multiple models: a fast model for simple tasks, a capable model for complex reasoning, specialized models for specific domains. Managing this diversity is trivial with APIs, complex with self-hosting.
The Break-Even Calculation
Here's a framework for calculating your break-even point:
Step 1: Calculate Current API Costs
Track actual usage for at least one month. Note:
- Total tokens (input and output separately)
- Peak vs. average utilization
- Request distribution across models
Step 2: Estimate Self-Hosting Costs
For each model you'd self-host:
- Hardware cost (cloud GPU instances or owned hardware amortized)
- Infrastructure overhead (40% multiplier)
- Operations labor (minimum $4K/month for part-time attention)
- Adjust for your realistic utilization rate
Step 3: Compare Total Cost of Ownership
Self-hosting makes sense when:
(API monthly cost) > (Self-hosted monthly cost) × 1.5
The 1.5x multiplier accounts for the hidden friction costs: debugging time, missed features, slower iteration. If the math is close, APIs usually win on total value.
Step 4: Consider the Non-Financial Factors
- Do you have the expertise to operate GPU infrastructure?
- Is latency a critical constraint?
- Are there data residency requirements?
- How important is access to the latest models?
Hybrid Strategies
The best approach for many organizations is hybrid: APIs for most workloads, self-hosting for specific use cases.
Common Hybrid Patterns
Tiered by complexity: Self-host a small model (7B-13B) for simple tasks like classification, summarization, and extraction. Use APIs for complex reasoning and generation.
Tiered by volume: Self-host for high-volume, predictable workloads. Use APIs for overflow and burst capacity.
Tiered by sensitivity: Self-host for sensitive data that can't leave your infrastructure. Use APIs for everything else.
Edge Inference for Latency-Sensitive Tasks
Small models (7B) run surprisingly well on consumer hardware. For latency-sensitive applications, consider:
- Local inference on user devices (mobile, laptop)
- Edge servers for geographic latency reduction
- Hybrid: local for fast initial response, API for complex follow-up
Practical Recommendations
If you're under $5,000/month in API costs: Don't self-host. The operational overhead exceeds any possible savings. Focus on prompt optimization and model selection to reduce API costs.
If you're at $5,000-$20,000/month: Analyze your workload. If a significant portion is simple tasks (classification, extraction, summarization), self-hosting a small model might make sense. Keep APIs for complex tasks.
If you're above $20,000/month: Self-hosting probably makes sense for at least part of your workload. Invest in the infrastructure expertise to do it well. Consider a hybrid approach.
Regardless of spend: Always have API fallback capability. Self-hosted infrastructure fails. Having an API backup prevents outages from becoming incidents.
- API costs have dropped faster than most people realize-recalculate your assumptions
- Self-hosting hidden costs (labor, infrastructure, utilization) typically 2-3x raw GPU costs
- APIs win for most workloads under $10-15K/month
- Self-hosting wins for extreme volume, strict latency, or data sovereignty requirements
- Hybrid strategies (small models self-hosted, frontier models via API) often optimal
- Always maintain API fallback capability regardless of primary infrastructure