Learn how prompt caching cuts text-to-SQL LLM costs by 80%. Real math, implementation patterns, and cost-reduction strategies for analytics platforms.
If you're running text-to-SQL workloads at scale—whether embedded in dashboards, powering self-serve BI, or automating query generation—your LLM costs are probably higher than they need to be. Every time a user asks a question about their data, you're sending the same database schema, table definitions, and system instructions to Claude, GPT-4, or another model. You're paying full price for that context. Every. Single. Time.
Prompt caching flips that equation. By caching the static portions of your prompts—the schema, the instructions, the few-shot examples—you pay 90% less for cached tokens on reads and cut latency by 75%. For text-to-SQL workloads specifically, the math is brutal in your favor: an 80% cost reduction is not aspirational. It's achievable with straightforward engineering.
This article walks through the mechanics, the math, and the patterns that make it work.
Prompt caching is a feature offered by major LLM providers—Anthropic's Claude, OpenAI, and Amazon Bedrock—that stores portions of your prompt in the model's KV cache on the provider's infrastructure. Once cached, subsequent requests that use the same cached tokens are charged at a fraction of the normal rate.
Here's the practical implication: if your prompt contains a 5,000-token database schema and 2,000 tokens of system instructions, and 100 users query that same database in a day, you're paying full price for those 7,000 tokens 100 times. With caching, you pay full price once, then 10% of that price for the next 99 reads.
For text-to-SQL specifically, this is a game-changer because:
The math: if your average text-to-SQL prompt is 8,000 tokens (schema + instructions) plus 500 tokens (user query), and you run 1,000 queries per day against the same database:
That's not theoretical. That's what the providers deliver, and it's what you see in production systems.
The 80% figure isn't magic—it's a direct consequence of how LLM providers price cached tokens. Let's break down the actual numbers.
Anthropic's prompt caching infrastructure offers a 90% discount on cached token reads. Here's what that means in practice:
For a 10,000-token schema cached and read 100 times:
Without caching, the same 100 queries would cost: 100 × 10,000 × $3/1M = $3.00. The cached version costs $0.0597—a 98% reduction for that specific prompt segment.
OpenAI's prompt caching works differently but delivers similar savings. OpenAI caches prompts over 1024 tokens and charges 50% of the input token price for cached reads:
The discount is lower than Anthropic's, but the principle is identical: repeated context is cheaper on the second and subsequent uses.
Amazon Bedrock's prompt caching for Claude models mirrors Anthropic's 90% discount structure, since Bedrock runs Anthropic's models. Cache reads cost 10% of normal input token rates, making it attractive for workloads with high repetition.
The 80% cost reduction in the title is actually conservative. Here's why:
In real systems running embedded analytics or self-serve BI on top of D23's managed Apache Superset, teams report 80–85% cost reductions once caching is properly implemented. Some hit 90% with highly repetitive workloads.
To implement caching effectively, you need to understand what parts of your prompt are static (cacheable) and what parts change with every request.
Database schema: Your entire schema definition—tables, columns, data types, primary/foreign keys, indexes. This is typically 2,000–8,000 tokens depending on database size.
TABLE: customers
- id (INTEGER, PRIMARY KEY)
- name (VARCHAR)
- email (VARCHAR)
- signup_date (DATE)
- country (VARCHAR)
TABLE: orders
- id (INTEGER, PRIMARY KEY)
- customer_id (INTEGER, FOREIGN KEY -> customers.id)
- amount (DECIMAL)
- order_date (DATE)
- status (VARCHAR: 'pending', 'completed', 'cancelled')
System instructions: How the model should approach query generation, handle ambiguity, optimize for performance, handle NULL values, etc. Typically 1,000–2,000 tokens.
Few-shot examples: Sample natural language questions paired with correct SQL queries. These teach the model your style and constraints. Typically 1,500–3,000 tokens.
Data dictionary or documentation: Descriptions of what each table and column represents, common joins, business logic. Typically 500–2,000 tokens.
Constraints and rules: "Never use SELECT *", "Always filter by created_date for performance", "Use this specific schema for PII data." Typically 300–800 tokens.
Total static context: 5,300–15,800 tokens depending on database complexity.
User query: The natural language question. "What's my revenue by product category this month?" Typically 50–300 tokens.
Runtime context: Current date, user permissions, filters applied by the application, session state. Typically 100–500 tokens.
Recent data or context: If you include recent rows for context, those change. Typically 0–1,000 tokens depending on your approach.
Total dynamic context: 150–1,800 tokens per request.
A well-structured text-to-SQL prompt for caching looks like this:
[CACHE_CONTROL: static_schema]
You are a SQL expert. Generate SQL queries based on natural language questions.
[Database Schema]
[Full schema definition - 5,000+ tokens]
[Instructions]
[How to handle joins, NULL values, performance - 1,500 tokens]
[Examples]
Q: What's our top 10 customers by revenue?
A: SELECT customer_id, SUM(amount) as total_revenue FROM orders WHERE status = 'completed' GROUP BY customer_id ORDER BY total_revenue DESC LIMIT 10;
[More examples - 2,000 tokens]
[END_STATIC_CACHE]
[DYNAMIC_CONTEXT]
Current date: 2025-01-15
User timezone: UTC
User permissions: can access customers, orders, products tables
[USER_QUESTION]
What was our average order value in January?
[/USER_QUESTION]
The cache control markers tell the LLM provider what to cache. Everything between the markers is cached on the first request; on subsequent requests, only the dynamic context and user question are sent fresh.
Caching isn't automatic—you need to structure your requests correctly. Here's how it works across providers.
Anthropics's official documentation on prompt caching uses a cache_control parameter in the API request. Here's a Python example:
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
schema = """TABLE: customers...[5000 tokens]..."""
instructions = """You are a SQL expert...[1500 tokens]..."""
examples = """Q: What's revenue?...[2000 tokens]..."""
user_query = "What's our average order value this month?"
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": schema + instructions + examples,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": user_query
}
]
)
print(response.content[0].text)
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Input tokens: {response.usage.input_tokens}")On the first request, cache_creation_input_tokens will be high (the schema, instructions, and examples). On subsequent requests with the same system prompt, cache_read_input_tokens will be high, and the cost will drop to 10% of normal.
OpenAI's caching works similarly but requires the cached content to be over 1024 tokens:
import openai
client = openai.OpenAI(api_key="your-api-key")
schema = """TABLE: customers...[5000 tokens]..."""
instructions = """You are a SQL expert...[1500 tokens]..."""
examples = """Q: What's revenue?...[2000 tokens]..."""
user_query = "What's our average order value this month?"
response = client.chat.completions.create(
model="gpt-4o",
max_tokens=1024,
system=[
{
"type": "text",
"text": schema + instructions + examples,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": user_query
}
]
)
print(response.choices[0].message.content)
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")Amazon Bedrock's prompt caching is configured via the request body:
import boto3
client = boto3.client('bedrock-runtime')
schema = """TABLE: customers...[5000 tokens]..."""
instructions = """You are a SQL expert...[1500 tokens]..."""
examples = """Q: What's revenue?...[2000 tokens]..."""
user_query = "What's our average order value this month?"
response = client.invoke_model(
modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
body={
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": schema + instructions + examples,
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": user_query
}
]
}
],
"max_tokens": 1024
}
)
output = json.loads(response['body'].read())
print(output['content'][0]['text'])
print(f"Cache creation tokens: {output['usage'].get('cache_creation_input_tokens', 0)}")
print(f"Cache read tokens: {output['usage'].get('cache_read_input_tokens', 0)}")Let's walk through three realistic scenarios and calculate actual cost savings.
Setup:
Without caching:
With caching (90% cache hit rate):
Savings: $2,826 - $316.20 = $2,509.80 per month (89% reduction)
Setup:
Without caching:
With caching (92% cache hit rate):
Savings: $113,850 - $12,610.20 = $101,239.80 per month (89% reduction)
Setup:
Without caching:
With caching (95% cache hit rate, OpenAI 50% discount):
Savings: $8,062.50 - $4,125.60 = $3,936.90 per month (51% reduction with OpenAI)
Note: OpenAI's 50% discount is lower than Anthropic's 90%, so savings are more modest. For PE firms, switching to Anthropic or Bedrock could push savings to 80%+.
Once you understand the fundamentals, there are sophisticated patterns that amplify savings.
Prompt caching infrastructure for LLM cost and latency reduction describes a multi-tier approach: cache the schema once, then cache schema + instructions + examples as a second tier, then cache schema + instructions + examples + recent context as a third tier.
This is useful when:
Implementation: Use separate cache_control markers for each tier, and structure your prompts to reuse lower tiers.
Your database schema will change. When it does, your cache becomes stale. Handle this by:
Example:
schema_version = "2025-01-15-v2"
schema_prompt = f"""SCHEMA VERSION: {schema_version}
TABLE: customers...[schema definition]
"""
# On schema change, increment version and new cache is created
if detect_schema_change():
schema_version = "2025-01-20-v3"Instead of caching all examples, cache a base set of high-value examples, then add user-specific or context-specific examples dynamically. This keeps the cached portion large (high savings) while allowing personalization.
base_examples = """[cached examples - 2000 tokens]"""
dynamic_examples = f"""[user-specific examples - 300 tokens]"""
# Cache the base, send dynamic fresh
response = client.messages.create(
system=[
{"type": "text", "text": schema + instructions + base_examples, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": dynamic_examples}
],
messages=[{"role": "user", "content": user_query}]
)If you run queries across multiple databases with similar schemas, you can cache a generic schema definition and then add database-specific details dynamically. This amortizes the cache across all databases.
generic_schema = """[Common schema patterns - 5000 tokens, cached]"""
db_specific = """[Database X specifics - 1000 tokens, dynamic]"""
response = client.messages.create(
system=[
{"type": "text", "text": generic_schema, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": db_specific}
],
messages=[{"role": "user", "content": user_query}]
)If you're building text-to-SQL into a self-serve BI or embedded analytics product, prompt caching integrates seamlessly with platforms like D23's managed Apache Superset. Here's why it matters:
D23 provides managed Superset hosting with API-first architecture, which means:
When you integrate caching with D23's platform, you're not just saving on LLM costs—you're improving dashboard load times (cached reads are faster) and reducing infrastructure load on your Superset instance.
Once caching is live, you need visibility into what's working.
Cache hit rate: Percentage of requests that hit the cache.
cache_read_input_tokens / (cache_read_input_tokens + input_tokens)Cost per query: Total LLM cost divided by number of queries.
Cache efficiency ratio: Tokens saved via caching divided by total tokens.
cache_read_input_tokens / total_input_tokensLatency improvement: Time to first token (TTFT) and total latency with vs. without caching.
Log every request with:
import logging
import json
logger = logging.getLogger(__name__)
def log_query_metrics(response, user_id, query_text):
metrics = {
"timestamp": datetime.now().isoformat(),
"user_id": user_id,
"query_text": query_text,
"cache_creation_tokens": response.usage.cache_creation_input_tokens,
"cache_read_tokens": response.usage.cache_read_input_tokens,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"total_cost": calculate_cost(response.usage),
"cache_hit": response.usage.cache_read_input_tokens > 0
}
logger.info(json.dumps(metrics))
def calculate_cost(usage):
# Anthropic pricing
cache_write_cost = usage.cache_creation_input_tokens * (3 / 1_000_000)
cache_read_cost = usage.cache_read_input_tokens * (0.30 / 1_000_000)
input_cost = usage.input_tokens * (3 / 1_000_000)
output_cost = usage.output_tokens * (15 / 1_000_000)
return cache_write_cost + cache_read_cost + input_cost + output_costBuild a dashboard (ideally in D23's Superset interface) that shows:
Problem: If you cache user-specific filters, timestamps, or session tokens, the cache becomes nearly useless because every request is different.
Solution: Separate static (schema, instructions) from dynamic (user query, filters, context). Cache only the static portion.
Problem: If your first request of the day caches the schema, and then the cache expires (some providers have TTLs), you lose the benefit.
Solution: Implement cache warm-up—pre-populate caches during off-peak hours or when the application starts.
def warm_up_cache():
for database in databases:
schema = fetch_schema(database)
instructions = fetch_instructions()
examples = fetch_examples()
# Make a dummy request to populate the cache
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=10,
system=[{"type": "text", "text": schema + instructions + examples, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": "SELECT 1"}]
)
logger.info(f"Cache warmed for {database}")Problem: If you cache too much dynamic content in an attempt to maximize cache size, you reduce cache hit rates because the cached content changes frequently.
Solution: Cache only content that's truly static or changes infrequently. The sweet spot is typically schema + instructions + base examples (5,000–10,000 tokens).
Problem: Different providers have different cache lifetimes. Anthropic caches for 5 minutes by default; OpenAI for up to 24 hours. If you don't understand the TTL, you might think the cache is working when it's actually expiring.
Solution: Check provider documentation and monitor cache hit rates. If hit rates drop unexpectedly, investigate TTL.
Problem: Prompt caching reduces input token costs, but output tokens (the generated SQL) are still full price. If your queries are generating very long outputs, the savings are less dramatic.
Solution: Optimize query generation to produce concise SQL. Use output constraints in your prompt: "Generate the most efficient SQL query in under 500 tokens."
Prompt caching isn't the only way to reduce LLM costs for text-to-SQL. Here's how it compares:
Fine-tuning involves training a smaller model on your specific domain (SQL generation). It reduces inference costs by using a cheaper model.
Prompt caching:
Verdict: Caching is faster to implement and more cost-effective. Fine-tuning is complementary and worth exploring after caching is live.
Query deduplication means caching actual SQL query results and returning cached results for identical queries.
Prompt caching:
Verdict: These are complementary. Use both: prompt caching for LLM costs, query result caching for database load.
Cheaper models like Claude 3 Haiku or GPT-4o Mini cost 80–90% less than flagship models.
Prompt caching:
Verdict: You can combine these strategies. Use a flagship model with caching for the best accuracy and cost. Or use a cheaper model with caching for extreme cost reduction (95%+).
Here's a step-by-step implementation roadmap:
The 80% cost reduction in prompt caching for text-to-SQL isn't aspirational marketing—it's physics. When 80–90% of your prompt is static schema and instructions, and you're caching at 90% discount rates, the math is straightforward.
For data and analytics leaders, engineering teams, and platform builders embedding text-to-SQL into dashboards, self-serve BI, or analytics products, prompt caching is the single highest-ROI optimization available. It requires minimal engineering effort, delivers immediate cost savings, and improves user experience through faster queries.
The path is clear: structure your prompts to separate static from dynamic content, implement caching with your LLM provider, monitor the results, and iterate. Within a month, you'll see 80%+ cost reductions on LLM inference. Within three months, you'll have optimized the implementation to the point where text-to-SQL is no longer a significant cost driver.
For teams using D23's managed Apache Superset platform, prompt caching integrates naturally with embedded analytics and self-serve BI workflows, amplifying the cost and latency benefits. Whether you're a startup scaling query generation, an enterprise standardizing analytics across teams, or a PE firm tracking portfolio KPIs, caching is the lever that makes text-to-SQL economically viable at scale.
Start with the basics. Measure your savings. Then optimize. The 80% is waiting.