"We should self-host to save money on API costs." I hear this regularly from technical leaders. Sometimes they're right. More often, the math doesn't work out the way they expect. LLM inference economics are counterintuitive, and the wrong choice can cost tens of thousands of dollars.

This article provides actual cost breakdowns-with specific numbers-for both cloud API and self-hosted inference. We'll show you when each approach makes sense, how to calculate your break-even point, and the hidden costs that derail infrastructure decisions.

The Cost Landscape in 2026

Before diving into specific numbers, let's establish the current landscape. LLM inference costs have dropped dramatically over the past two years:

The key insight: API costs have fallen faster than most organizations' utilization has grown. Many projects that would have justified self-hosting two years ago no longer do.

Cloud API Costs: The Full Picture

Cloud API pricing is straightforward to understand but easy to miscalculate. Let's break it down.

Per-Token Pricing (February 2026)

Current rates for major providers (per million tokens):

Model Input Cost Output Cost Notes
GPTo $2.50 $10.00 General-purpose flagship
Claude 3.5 Sonnet $3.00 $15.00 Best for coding tasks
Claude 3.5 Haiku $0.80 $4.00 Fast, cost-effective
GPTo-mini $0.15 $0.60 Lightweight tasks
Gemini 1.5 Pro $1.25 $5.00 1M context window
Gemini 1.5 Flash $0.075 $0.30 Fastest, cheapest

Calculating Actual Usage

Token counts surprise people. A "simple" customer service interaction might involve:

Total: 6,100 tokens per interaction (5,600 input + 500 output)

At Claude Sonnet pricing: $0.017 + $0.0075 = $0.024 per interaction

At 10,000 interactions per day: $240/day = $7,200/month

What This Actually Costs

Let's model three usage scenarios:

Light usage (1,000 requests/day, 5K tokens/request):

Medium usage (10,000 requests/day, 6K tokens/request):

Heavy usage (100,000 requests/day, 8K tokens/request):

The Output Token Trap

Output tokens cost 3-5x more than input tokens, yet most cost calculations underestimate output length. A detailed response, code generation, or structured output easily reaches 1,000+ tokens. Always measure actual output lengths before projecting costs.

Self-Hosted Inference Costs: The Complete Accounting

Self-hosted inference appears cheaper when you only look at GPU time. The full cost picture includes many more factors.

Hardware Options

Cloud GPU instances (per hour):

Dedicated/owned hardware:

Model Requirements

GPU memory requirements for different model sizes:

The Hidden Costs

Hardware is just the beginning. Full self-hosted costs include:

1. Infrastructure overhead:

2. Operations labor:

3. Reliability engineering:

4. Model management:

5. Security and compliance:

Realistic Self-Hosted Cost Calculation

Let's calculate the true cost of running Llama 70B for medium-scale inference:

Hardware: 2x A100 80GB on AWS

Infrastructure (40% overhead): ~$1,400/month

Operations (part-time engineer): ~$4,000/month

Total: ~$8,900/month

Throughput at this cost: With optimized serving (vLLM, TensorRT-LLM), expect ~50-100 tokens/second sustained throughput. That's ~130-260 million tokens per month.

Effective cost: ~$0.03-0.07 per 1,000 tokens

Compare to Claude Sonnet at $0.003-0.015 per 1,000 tokens. Self-hosted Llama 70B costs 2-20x more per token than Claude Sonnet-even with conservative labor estimates.

The Utilization Trap

This calculation assumes high utilization. Most self-hosted deployments run at 20-40% utilization-traffic is bursty, not constant. At 30% utilization, your effective per-token cost triples. APIs only charge for what you use.

When Self-Hosting Makes Sense

Despite the economics often favoring APIs, self-hosting is the right choice in several scenarios:

1. Extreme Volume

At very high volumes, self-hosting eventually wins. The crossover point depends on model size and utilization, but roughly:

Critical caveat: These numbers assume high utilization. If your traffic is bursty, the crossover point is much higher.

2. Latency Requirements

API calls include network latency. For applications requiring sub-100ms responses, self-hosting with local inference may be necessary. Common scenarios:

3. Data Sovereignty

When data cannot leave your infrastructure for regulatory or contractual reasons:

Note: Many cloud APIs now offer data residency guarantees and HIPAA/SOC2 compliance. Check current offerings before assuming self-hosting is required.

4. Fine-Tuned Models

If you've fine-tuned a model on proprietary data and the fine-tuning provides significant quality improvement, self-hosting that model makes sense. API providers offer fine-tuning, but serving your own fine-tuned model gives more control.

5. Offline or Air-Gapped Environments

Environments without reliable internet access require local inference. Common in:

When APIs Win

Cloud APIs are the better choice in most scenarios:

1. Variable or Unpredictable Traffic

APIs charge per request. Self-hosting charges for capacity whether you use it or not. If your traffic varies significantly-seasonal businesses, products in growth phase, experimental applications-APIs provide better economics.

2. Need for Frontier Capabilities

Self-hosted open models lag frontier APIs by 6-18 months in capability. If you need the best available models for complex reasoning, coding, or analysis, APIs are the only option.

3. Limited ML Infrastructure Expertise

The operational burden of self-hosting is significant. Organizations without ML infrastructure expertise should strongly prefer APIs. The time spent debugging CUDA errors, optimizing batch sizes, and managing model versions is time not spent on your actual product.

4. Rapid Iteration

APIs let you switch models instantly. Self-hosting requires redeployment. During product development when you're iterating on prompts and comparing models, API flexibility is valuable.

5. Multi-Model Strategies

Modern AI applications often use multiple models: a fast model for simple tasks, a capable model for complex reasoning, specialized models for specific domains. Managing this diversity is trivial with APIs, complex with self-hosting.

The Break-Even Calculation

Here's a framework for calculating your break-even point:

Step 1: Calculate Current API Costs

Track actual usage for at least one month. Note:

Step 2: Estimate Self-Hosting Costs

For each model you'd self-host:

Step 3: Compare Total Cost of Ownership

Self-hosting makes sense when:

(API monthly cost) > (Self-hosted monthly cost) × 1.5

The 1.5x multiplier accounts for the hidden friction costs: debugging time, missed features, slower iteration. If the math is close, APIs usually win on total value.

Step 4: Consider the Non-Financial Factors

Hybrid Strategies

The best approach for many organizations is hybrid: APIs for most workloads, self-hosting for specific use cases.

Common Hybrid Patterns

Tiered by complexity: Self-host a small model (7B-13B) for simple tasks like classification, summarization, and extraction. Use APIs for complex reasoning and generation.

Tiered by volume: Self-host for high-volume, predictable workloads. Use APIs for overflow and burst capacity.

Tiered by sensitivity: Self-host for sensitive data that can't leave your infrastructure. Use APIs for everything else.

Edge Inference for Latency-Sensitive Tasks

Small models (7B) run surprisingly well on consumer hardware. For latency-sensitive applications, consider:

Practical Recommendations

If you're under $5,000/month in API costs: Don't self-host. The operational overhead exceeds any possible savings. Focus on prompt optimization and model selection to reduce API costs.

If you're at $5,000-$20,000/month: Analyze your workload. If a significant portion is simple tasks (classification, extraction, summarization), self-hosting a small model might make sense. Keep APIs for complex tasks.

If you're above $20,000/month: Self-hosting probably makes sense for at least part of your workload. Invest in the infrastructure expertise to do it well. Consider a hybrid approach.

Regardless of spend: Always have API fallback capability. Self-hosted infrastructure fails. Having an API backup prevents outages from becoming incidents.

Key Takeaways
  • API costs have dropped faster than most people realize-recalculate your assumptions
  • Self-hosting hidden costs (labor, infrastructure, utilization) typically 2-3x raw GPU costs
  • APIs win for most workloads under $10-15K/month
  • Self-hosting wins for extreme volume, strict latency, or data sovereignty requirements
  • Hybrid strategies (small models self-hosted, frontier models via API) often optimal
  • Always maintain API fallback capability regardless of primary infrastructure