Skip to main content
ShuttleAI uses multipliers to fairly price access to different models and context sizes. There are two types: request multipliers and context multipliers.

Request multipliers

Every model has a request multiplier — a number that determines how many of your per-minute requests each API call consumes. Most models have a 1x multiplier (one API call = one request). Larger or more expensive models have a higher multiplier to reflect their compute cost.

How they work

Your plan gives you a set number of requests per minute (RPM):
PlanRPM
Free2
Basic10
Premium30
Scale80
When you call a model, the multiplier is deducted from your available RPM:
1

You make an API call

You send a request to a model with a 2x multiplier.
2

Multiplier is applied

The system deducts 2 requests from your per-minute quota instead of 1.
3

Remaining capacity

If you’re on the Premium plan (30 RPM), you now have 28 RPM remaining for that minute.

Examples

Basic plan (10 RPM)

ModelMultiplierEffective calls/min
ShuttleAI Auto1x10
GPT-5.21.5x~6
Claude Haiku 4.51x10

Premium plan (30 RPM)

ModelMultiplierEffective calls/min
ShuttleAI Auto1x30
GPT-5.21.5x20
Claude Opus 4.62x15
Claude Sonnet 4.62x15
Claude Haiku 4.51x30

Scale plan (80 RPM)

ModelMultiplierEffective calls/min
ShuttleAI Auto1x80
GPT-5.21.5x~53
Claude Opus 4.62x40
Claude Sonnet 4.62x40
Claude Haiku 4.51x80

Checking multipliers

You can check any model’s multiplier using the verbose models endpoint:
curl https://api.shuttleai.com/v1/models/verbose \
  -H "Authorization: Bearer shuttle-xxx"
Each model includes a request_multiplier field:
{
  "id": "claude-opus-4.6",
  "plan": "premium",
  "request_multiplier": 2.0
}
You can also see multipliers displayed on each model card at shuttleai.com/models.

Tips for managing multipliers

ShuttleAI Auto has a 1x multiplier and intelligently routes to the best model for each task. It’s the most cost-efficient way to use the API.
Use lighter models (1x) for simple tasks and save your budget for frontier model calls (2x) when you need maximum quality.
The Dashboard shows your request usage in real time, including breakdowns by model. Use this to understand how multipliers affect your usage patterns.
If you’re consistently hitting rate limits, consider upgrading your plan. The jump from Basic (10 RPM) to Premium (30 RPM) gives you 3x the capacity.

Rate limit errors

If you exceed your RPM (accounting for multipliers), the API returns a 429 Too Many Requests error:
{
  "error": {
    "message": "Rate limit exceeded. Please try again later.",
    "type": "rate_limit_error",
    "code": 429
  }
}
Wait until the next minute window to retry, or upgrade your plan for higher limits.

Context window limits & scaling

Your plan determines the maximum number of tokens you can send in a single request. This is a hard limit — requests that exceed it are rejected.
PlanMax context
Free8K tokens
Basic16K tokens
Premium36K tokens
Scale128K tokens
If your request exceeds your plan’s context limit, the API will return an error. You cannot go over your plan’s max — upgrade your plan for a higher limit.

Context scaling (paid plans only)

For paid plans (Basic, Premium, Scale), a context scaling multiplier is applied on top of the model’s request multiplier. This is calculated per 16K token block:
scaleFactor = ceil(promptTokens / 16,000)
totalCost = model_request_multiplier × scaleFactor
This means larger prompts cost more requests from your per-minute quota:
Prompt sizeScale factorWith 1x modelWith 2x model
1–16K tokens1x1 request2 requests
16K–32K tokens2x2 requests4 requests
32K–48K tokens3x3 requests6 requests
48K–64K tokens4x4 requests8 requests
The Free plan does not have context scaling — you pay only the model’s base request multiplier, but you’re limited to 8K tokens max.

Examples

  • A 10K prompt on the Basic plan with a 1x model = 1 request (10K is within the first 16K block)
  • A 20K prompt on the Premium plan with a 1x model = 2 requests (ceil(20K / 16K) = 2)
  • A 100K prompt on the Scale plan with a 2x model = 14 requests (ceil(100K / 16K) = 7, × 2x model = 14)

Why context scaling exists

Processing more tokens requires more compute. Context scaling ensures that heavy token usage is priced fairly while keeping costs low for everyday requests.

Tips for managing context costs

Only include the context the model actually needs. Trim unnecessary history, instructions, or documents.
If you regularly need large context windows, upgrade to a plan with a higher limit. The Scale plan (128K) gives you the most room.
Instead of sending the full chat history every time, periodically summarize older messages to reduce token count.
The Dashboard shows your request usage. Watch for requests that consume more than expected — it may be context scaling.