January 11, 2026

Unbounded Consumption — Protecting Against Resource Exhaustion and AI-DoS

As AI becomes core infrastructure in 2026, attackers have learned to weaponize model compute costs. LLM10: Unbounded Consumption highlights how adversaries can trigger “Denial‑of‑Wallet” attacks by forcing models into expensive reasoning loops, overloading context windows, or causing agents to spam tool calls. This post outlines essential safeguards including hard token caps, token‑based rate limiting, execution timeouts, recursion limits, and semantic caching. The emerging best practice—model cascading—routes each request to the cheapest model capable of handling it, dramatically reducing risk and spend. The goal: keep your AI powerful, affordable, and resilient.

We have reached the final pillar of our AI Security Framework. While previous risks focused on what the AI says or how it is trained, LLM10: Unbounded Consumption is about the raw survival of your application.

In the world of 2026, tokens are the new currency. Unlike traditional software where a "Denial of Service" (DoS) attack might just crash a server, an attack on an LLM can result in a "Denial of Wallet" (DoW)—where an attacker doesn't just knock you offline, they bankrupt you in the process.

The New Attack Surface: "Denial of Wallet"

Traditional DoS attacks flood your network with traffic. AI-specific consumption attacks are much more surgical. They exploit the fact that processing a single complex prompt can cost $100\times$ more than a simple one.

1. The Reasoning Loop Exploitation

Attackers craft "recursive" or "impossible" prompts that force models optimized for multi-step reasoning into an infinite loop.

  • The Attack: "Plan a trip from Anchorage to Paris, but only fly on airlines that exclusively fly East, and each connecting flight must land in a city further West than the previous one."
  • The Result: The AI's "long-term reasoning" agent may burn thousands of tokens trying to solve an unsolvable logic puzzle.

2. Context Window Padding

Modern models (like Gemini 1.5 Pro) have context windows of 1M+ tokens.

  • The Attack: An attacker sends a prompt that includes a massive, 800,000-token block of "nonsense" text followed by a simple question.
  • The Result: You are billed for nearly a million tokens of input for a single query. Repeat this 100 times, and your monthly budget is gone in minutes.

3. Tool-Call Amplification

If your AI is an agent that can call external APIs (like a weather or stock tool), an attacker can trick it into calling that tool 1,000 times in a single session, incurring costs on both your AI provider and your tool providers.

Defending the Bottom Line: 2026 Resource Controls

In 2026, treating "Availability" as a security boundary means treating tokens, compute, and execution time as protected resources.

1. Hard Token Caps (The "Circuit Breaker")

Never allow an "unlimited" response.

  • Input Limits: Set a maximum character or token count for user prompts. If a user tries to paste a 500-page book into your chat, reject it at the gateway.
  • Output Limits: Use the max_tokens parameter in every API call. This prevents the model from "rambling" indefinitely during a hallucination or attack.

2. Multi-Tiered Rate Limiting

Standard rate limiting (requests per minute) isn't enough for AI. You need Token-Based Rate Limiting.

  • The Strategy: Limit users to X tokens per day. Once they hit their quota, downgrade them to a smaller, cheaper model (like Gemini Flash or an SLM) or stop their service until the next window.

3. Reasoning Timeouts and Depth Counters

For agentic workflows, you must put a "leash" on the AI's autonomy.

  • Execution Timeouts: If an AI agent hasn't reached a conclusion in 30 seconds, terminate the process.
  • Recursion Depth: Limit how many times an agent can "call itself" or "search again" to resolve a single query.

4. Semantic Caching

Why pay for the same answer twice?

  • The Solution: Use a Semantic Cache (like GPTCache or a Redis-based vector cache). Before sending a prompt to the expensive LLM, check if a similar question has been answered recently. If the "semantic distance" between the new question and a cached one is near zero, serve the cached answer for free.

Summary of Resource Protection Layers

Defense Layer        Technical Control                                      Primary Goal
Input Layer               
Length Validation & PII Scrubbing      Prevents "Context Padding"
Orchestration         Semantic Caching                                Reduces redundant API spend
Model Layer             Max Token Caps / Model Cascading  Prevents "Rambling" and DoW
Agent Layer             Depth Limits & Timeouts                      Stops "Reasoning Loops"

The "Model Cascading" Strategy (2026 Best Practice)

Leading organizations no longer use their "smartest" (and most expensive) model for everything. They use a Cascade:

  1. Tier 1 (Classifier): A tiny, local model (like Gemma 2B) analyzes the prompt. Is it a simple question?
  2. Tier 2 (Worker): If simple, a mid-range model (Gemini Flash) answers.
  3. Tier 3 (Expert): Only if the task requires deep reasoning is it passed to the "Premium" model.This reduces average costs by up to 70% and serves as a natural buffer against consumption attacks.

Technical Checklist: Is Your AI "Wallet-Safe"?

  • [ ] Have we set max_tokens on every single LLM API call?
  • [ ] Is there a hard limit on user input length (e.g., 4,000 characters)?
  • [ ] Are we monitoring Cost-per-User and not just Total API Spend?
  • [ ] Does our agent architecture have a "Max Iterations" count (e.g., limit of 5 tool calls)?
  • [ ] Have we implemented a semantic cache for common queries?

Conclusion: Completing the Framework

Congratulations! You’ve navigated the full 2026 AI Security Framework. From preventing Prompt Injection to stopping Unbounded Consumption, you now have the blueprint to build AI that is not just powerful, but resilient.

Security in the AI era isn't a one-time setup—it's a continuous process of Monitoring, Red-Teaming, and Refining.

More blogs