The Real Cost of Running LLMs: Cloud API vs. Self-Hosted Inference

"We should self-host to save money on API costs." I hear this regularly from technical leaders. Sometimes they're right. More often, the math doesn't work out the way they expect. LLM inference economics are counterintuitive, and the wrong choice can cost tens of thousands of dollars.

This article provides actual cost breakdowns-with specific numbers-for both cloud API and self-hosted inference. We'll show you when each approach makes sense, how to calculate your break-even point, and the hidden costs that derail infrastructure decisions.

The Cost Landscape in 2026

Before diving into specific numbers, let's establish the current landscape. LLM inference costs have dropped dramatically over the past two years:

Cloud APIs: GPT-class models that cost $30-60 per million tokens in 2024 now cost $2-15 per million tokens. Claude Sonnet costs $3 per million input tokens, $15 per million output tokens.
GPU hardware: NVIDIA H100 and H200 GPUs have become more available, though still expensive. A100 prices have stabilized. Consumer-grade GPUs (RTX 4090) offer surprising inference capability.
Open models: Llama 3.3, Mistral Large, and DeepSeek V3 approach GPT quality for many tasks. Smaller models (7B-70B parameters) handle routine tasks competently.

The key insight: API costs have fallen faster than most organizations' utilization has grown. Many projects that would have justified self-hosting two years ago no longer do.

Cloud API Costs: The Full Picture

Cloud API pricing is straightforward to understand but easy to miscalculate. Let's break it down.

Per-Token Pricing (February 2026)

Current rates for major providers (per million tokens):

Model	Input Cost	Output Cost	Notes
GPTo	$2.50	$10.00	General-purpose flagship
Claude 3.5 Sonnet	$3.00	$15.00	Best for coding tasks
Claude 3.5 Haiku	$0.80	$4.00	Fast, cost-effective
GPTo-mini	$0.15	$0.60	Lightweight tasks
Gemini 1.5 Pro	$1.25	$5.00	1M context window
Gemini 1.5 Flash	$0.075	$0.30	Fastest, cheapest

Calculating Actual Usage

Token counts surprise people. A "simple" customer service interaction might involve:

System prompt: 500 tokens
Conversation history: 2,000 tokens
Retrieved context (RAG): 3,000 tokens
User message: 100 tokens
Assistant response: 500 tokens

Total: 6,100 tokens per interaction (5,600 input + 500 output)

At Claude Sonnet pricing: $0.017 + $0.0075 = $0.024 per interaction

At 10,000 interactions per day: $240/day = $7,200/month

What This Actually Costs

Let's model three usage scenarios:

Light usage (1,000 requests/day, 5K tokens/request):

GPTo: ~$80/month
Claude Sonnet: ~$100/month
Gemini Flash: ~$15/month

Medium usage (10,000 requests/day, 6K tokens/request):

GPTo: ~$2,400/month
Claude Sonnet: ~$3,000/month
Gemini Flash: ~$450/month

Heavy usage (100,000 requests/day, 8K tokens/request):

GPTo: ~$36,000/month
Claude Sonnet: ~$45,000/month
Gemini Flash: ~$6,000/month

The Output Token Trap

Output tokens cost 3-5x more than input tokens, yet most cost calculations underestimate output length. A detailed response, code generation, or structured output easily reaches 1,000+ tokens. Always measure actual output lengths before projecting costs.

Self-Hosted Inference Costs: The Complete Accounting

Self-hosted inference appears cheaper when you only look at GPU time. The full cost picture includes many more factors.

Hardware Options

Cloud GPU instances (per hour):

AWS p4d.24xlarge (8x A100 40GB): ~$32.77/hour
AWS p5.48xlarge (8x H100 80GB): ~$98.32/hour
GCP a2-highgpu-8g (8x A100 40GB): ~$29.38/hour
Lambda Labs A100 (1x A100 80GB): ~$1.29/hour

Dedicated/owned hardware:

NVIDIA H100 SXM: ~$30,000 (when available)
NVIDIA A100 80GB: ~$15,000
NVIDIA RTX 4090: ~$1,600

Model Requirements

GPU memory requirements for different model sizes:

7B parameters (Llama 3.1 7B, Mistral 7B): 14GB at FP16, 7GB at INT8 → Fits on single RTX 4090
13B parameters: 26GB at FP16, 13GB at INT8 → Fits on single A100 40GB
70B parameters (Llama 3.3 70B): 140GB at FP16, 70GB at INT8 → Requires 2x A100 80GB or 8x RTX 4090
405B parameters (Llama 3.1 405B): 810GB at FP16 → Requires 8x H100 80GB minimum

The Hidden Costs

Hardware is just the beginning. Full self-hosted costs include:

1. Infrastructure overhead:

Load balancing and request routing
Monitoring and alerting
Logging and debugging infrastructure
Model versioning and rollback capability
Typically adds 20-40% to base compute costs

2. Operations labor:

GPU expertise isn't cheap: $150-250K/year for ML infrastructure engineers
At minimum, part-time attention from an engineer: ~$50K/year equivalent
On-call rotation for production systems

3. Reliability engineering:

Redundancy (you need spare capacity for failover)
Multi-region deployment for availability
Health checks and automatic recovery

4. Model management:

Downloading, converting, and optimizing models
Keeping up with new model releases
Tuning inference parameters (batch size, quantization, etc.)

5. Security and compliance:

Network isolation
Access control and audit logging
Compliance certification (if applicable)

Realistic Self-Hosted Cost Calculation

Let's calculate the true cost of running Llama 70B for medium-scale inference:

Hardware: 2x A100 80GB on AWS

Instance cost: ~$8/hour × 730 hours = $5,840/month
Reserved instances (1 year): ~$3,500/month

Infrastructure (40% overhead): ~$1,400/month

Operations (part-time engineer): ~$4,000/month

Total: ~$8,900/month

Throughput at this cost: With optimized serving (vLLM, TensorRT-LLM), expect ~50-100 tokens/second sustained throughput. That's ~130-260 million tokens per month.

Effective cost: ~$0.03-0.07 per 1,000 tokens

Compare to Claude Sonnet at $0.003-0.015 per 1,000 tokens. Self-hosted Llama 70B costs 2-20x more per token than Claude Sonnet-even with conservative labor estimates.

The Utilization Trap

This calculation assumes high utilization. Most self-hosted deployments run at 20-40% utilization-traffic is bursty, not constant. At 30% utilization, your effective per-token cost triples. APIs only charge for what you use.

When Self-Hosting Makes Sense

Despite the economics often favoring APIs, self-hosting is the right choice in several scenarios:

1. Extreme Volume

At very high volumes, self-hosting eventually wins. The crossover point depends on model size and utilization, but roughly:

7B models: Self-hosting wins above ~$2,000/month in API costs
70B models: Self-hosting wins above ~$15,000/month in API costs
Frontier-class models: Self-hosting rarely wins-APIs include ongoing model improvements

Critical caveat: These numbers assume high utilization. If your traffic is bursty, the crossover point is much higher.

2. Latency Requirements

API calls include network latency. For applications requiring sub-100ms responses, self-hosting with local inference may be necessary. Common scenarios:

Real-time gaming or interactive applications
High-frequency trading or automated decision systems
Inline code completion with strict latency budgets

3. Data Sovereignty

When data cannot leave your infrastructure for regulatory or contractual reasons:

Healthcare with PHI constraints
Financial services with strict data handling requirements
Government or defense applications
Contracts that prohibit third-party data processing

Note: Many cloud APIs now offer data residency guarantees and HIPAA/SOC2 compliance. Check current offerings before assuming self-hosting is required.

4. Fine-Tuned Models

If you've fine-tuned a model on proprietary data and the fine-tuning provides significant quality improvement, self-hosting that model makes sense. API providers offer fine-tuning, but serving your own fine-tuned model gives more control.

5. Offline or Air-Gapped Environments

Environments without reliable internet access require local inference. Common in:

Edge deployments
Secure facilities
Remote operations

When APIs Win

Cloud APIs are the better choice in most scenarios:

1. Variable or Unpredictable Traffic

APIs charge per request. Self-hosting charges for capacity whether you use it or not. If your traffic varies significantly-seasonal businesses, products in growth phase, experimental applications-APIs provide better economics.

2. Need for Frontier Capabilities

Self-hosted open models lag frontier APIs by 6-18 months in capability. If you need the best available models for complex reasoning, coding, or analysis, APIs are the only option.

3. Limited ML Infrastructure Expertise

The operational burden of self-hosting is significant. Organizations without ML infrastructure expertise should strongly prefer APIs. The time spent debugging CUDA errors, optimizing batch sizes, and managing model versions is time not spent on your actual product.

4. Rapid Iteration

APIs let you switch models instantly. Self-hosting requires redeployment. During product development when you're iterating on prompts and comparing models, API flexibility is valuable.

5. Multi-Model Strategies

Modern AI applications often use multiple models: a fast model for simple tasks, a capable model for complex reasoning, specialized models for specific domains. Managing this diversity is trivial with APIs, complex with self-hosting.

The Break-Even Calculation

Here's a framework for calculating your break-even point:

Step 1: Calculate Current API Costs

Track actual usage for at least one month. Note:

Total tokens (input and output separately)
Peak vs. average utilization
Request distribution across models

Step 2: Estimate Self-Hosting Costs

For each model you'd self-host:

Hardware cost (cloud GPU instances or owned hardware amortized)
Infrastructure overhead (40% multiplier)
Operations labor (minimum $4K/month for part-time attention)
Adjust for your realistic utilization rate

Step 3: Compare Total Cost of Ownership

Self-hosting makes sense when:


            (API monthly cost) > (Self-hosted monthly cost) × 1.5

The 1.5x multiplier accounts for the hidden friction costs: debugging time, missed features, slower iteration. If the math is close, APIs usually win on total value.

Step 4: Consider the Non-Financial Factors

Do you have the expertise to operate GPU infrastructure?
Is latency a critical constraint?
Are there data residency requirements?
How important is access to the latest models?

Hybrid Strategies

The best approach for many organizations is hybrid: APIs for most workloads, self-hosting for specific use cases.

Common Hybrid Patterns

Tiered by complexity: Self-host a small model (7B-13B) for simple tasks like classification, summarization, and extraction. Use APIs for complex reasoning and generation.

Tiered by volume: Self-host for high-volume, predictable workloads. Use APIs for overflow and burst capacity.

Tiered by sensitivity: Self-host for sensitive data that can't leave your infrastructure. Use APIs for everything else.

Edge Inference for Latency-Sensitive Tasks

Small models (7B) run surprisingly well on consumer hardware. For latency-sensitive applications, consider:

Local inference on user devices (mobile, laptop)
Edge servers for geographic latency reduction
Hybrid: local for fast initial response, API for complex follow-up

Practical Recommendations

If you're under $5,000/month in API costs: Don't self-host. The operational overhead exceeds any possible savings. Focus on prompt optimization and model selection to reduce API costs.

If you're at $5,000-$20,000/month: Analyze your workload. If a significant portion is simple tasks (classification, extraction, summarization), self-hosting a small model might make sense. Keep APIs for complex tasks.

If you're above $20,000/month: Self-hosting probably makes sense for at least part of your workload. Invest in the infrastructure expertise to do it well. Consider a hybrid approach.

Regardless of spend: Always have API fallback capability. Self-hosted infrastructure fails. Having an API backup prevents outages from becoming incidents.

Key Takeaways

API costs have dropped faster than most people realize-recalculate your assumptions
Self-hosting hidden costs (labor, infrastructure, utilization) typically 2-3x raw GPU costs
APIs win for most workloads under $10-15K/month
Self-hosting wins for extreme volume, strict latency, or data sovereignty requirements
Hybrid strategies (small models self-hosted, frontier models via API) often optimal
Always maintain API fallback capability regardless of primary infrastructure