Build reliable eval suites for analytics LLMs. Learn text-to-SQL, summarization testing, and production-grade evaluation frameworks for data teams.
You've trained a text-to-SQL model. It converts natural language into queries. You've integrated it into your analytics platform. Now what? How do you know it actually works in production? How do you measure drift, catch hallucinations, and prove to stakeholders that the model is safe to ship?
This is the problem most data and engineering teams face when building LLM-powered analytics features. The stakes are high—a bad SQL query doesn't just waste compute; it can serve incorrect metrics to decision-makers, erode trust in your platform, and create compliance headaches.
This guide walks you through building a practical evaluation suite for analytics-specific LLM use cases. We'll cover the frameworks, metrics, and testing strategies that work at scale. Whether you're embedding AI-powered analytics into your product or building internal dashboards with text-to-SQL capabilities, this framework applies.
When you search for LLM evaluation best practices, you'll find plenty of guidance on general-purpose models. But analytics use cases are fundamentally different from chatbots or summarization tasks.
In a general LLM application, "good" output is often subjective. A summary can be helpful in multiple ways. A translation can be accurate yet creative. But in analytics, there's a ground truth. A SQL query either returns the correct result set or it doesn't. A metric calculation is either accurate or it's misleading your entire organization.
This binary nature of correctness creates unique evaluation challenges. You can't rely on human raters scoring outputs on a 1-5 scale. You need deterministic, repeatable tests that validate both the query structure and the results it produces.
Additionally, analytics LLM use cases often involve multi-step processes. Text-to-SQL is just the first step. The model might need to:
Evaluating each step independently—and then evaluating the entire pipeline—requires a different framework than general LLM evaluation.
When building evaluation suites for analytics LLMs, you're working with two complementary evaluation approaches: property evals and correctness evals. Understanding when to use each is critical.
Property Evals test whether your LLM output has certain characteristics, independent of a ground-truth answer. These include:
Property evals are fast, deterministic, and don't require a ground-truth dataset. You can run them on every model output in production. They catch obvious mistakes—malformed SQL, schema violations, dangerous operations—before queries ever execute.
Correctness Evals compare the model's output against a known correct answer. These include:
Correctness evals require building a test dataset with known answers. This is more work upfront, but it's the only way to truly validate that your model produces accurate results.
According to practical frameworks for LLM system evaluations, the key question when planning your eval suite is: "Which properties matter for my use case, and which require ground truth?"
For analytics, the answer is: you need both. Property evals catch systemic failures; correctness evals catch subtle semantic errors.
Your evaluation dataset is the foundation of your eval suite. It needs to be representative, comprehensive, and maintainable.
Start by categorizing the types of questions your users will ask:
For each category, you need test cases that cover:
For each test case, you need a reference answer. This is the ground truth against which you'll measure your model.
For SQL evaluation, the reference answer is a manually-written SQL query that you've verified by running it against your database and inspecting the results. Have a domain expert (ideally someone who knows your data well) write the reference query. Have a second person review it. Document assumptions and edge cases.
Store these reference answers in version control alongside your test cases. As your schema evolves, update both the test cases and reference answers.
How many test cases do you need? There's no magic number, but consider:
Start small (50 cases), run your eval suite, identify gaps, and expand. You'll likely find that 80% of your test cases cover 20% of the failure modes, and you'll keep adding edge cases as you discover them in production.
Property evals are your first line of defense. They're fast, deterministic, and catch obvious mistakes.
The simplest property eval: does the generated SQL parse?
import sqlparse
def eval_syntax(generated_sql: str) -> dict:
try:
parsed = sqlparse.parse(generated_sql)
if not parsed or not parsed[0].tokens:
return {"passed": False, "reason": "Empty or invalid parse"}
return {"passed": True, "reason": "Valid SQL syntax"}
except Exception as e:
return {"passed": False, "reason": f"Parse error: {str(e)}"}This catches typos, incomplete queries, and malformed SQL before it ever reaches your database.
Does the query reference tables and columns that exist in your schema?
def eval_schema_adherence(generated_sql: str, schema: dict) -> dict:
"""Validate that query references only existing tables/columns."""
parsed = sqlparse.parse(generated_sql)[0]
# Extract table and column references
referenced_tables = extract_tables(parsed)
referenced_columns = extract_columns(parsed)
valid_tables = set(schema.keys())
valid_columns = {col for table in schema.values() for col in table['columns']}
invalid_tables = referenced_tables - valid_tables
invalid_columns = referenced_columns - valid_columns
if invalid_tables or invalid_columns:
return {
"passed": False,
"invalid_tables": list(invalid_tables),
"invalid_columns": list(invalid_columns)
}
return {"passed": True}This prevents queries that reference columns you've renamed or tables that don't exist.
Does the query contain dangerous operations?
DANGEROUS_KEYWORDS = {
'DROP', 'DELETE', 'TRUNCATE', 'ALTER', 'CREATE',
'INSERT', 'UPDATE', 'GRANT', 'REVOKE'
}
def eval_query_safety(generated_sql: str) -> dict:
"""Ensure query doesn't contain write operations or dangerous commands."""
tokens = sqlparse.parse(generated_sql)[0].tokens
dangerous_found = []
for token in tokens:
if token.ttype is sqlparse.tokens.Keyword and token.value.upper() in DANGEROUS_KEYWORDS:
dangerous_found.append(token.value)
if dangerous_found:
return {"passed": False, "dangerous_keywords": dangerous_found}
return {"passed": True}This is a hard boundary—no write operations, ever, in a read-only analytics context.
Will the query be efficient?
def eval_query_efficiency(generated_sql: str, schema: dict) -> dict:
"""Check for common performance anti-patterns."""
warnings = []
parsed = sqlparse.parse(generated_sql)[0]
# Check for full table scans (no WHERE clause)
if 'WHERE' not in generated_sql.upper():
largest_table = max(schema.values(), key=lambda t: t.get('row_count', 0))
if largest_table.get('row_count', 0) > 1_000_000:
warnings.append(f"Large table scan without WHERE clause")
# Check for expensive joins
join_count = generated_sql.upper().count('JOIN')
if join_count > 3:
warnings.append(f"Multiple joins ({join_count}) may be inefficient")
# Check for functions on indexed columns
if 'LOWER(' in generated_sql.upper() or 'UPPER(' in generated_sql.upper():
warnings.append("Functions on columns may prevent index usage")
return {
"passed": len(warnings) == 0,
"warnings": warnings
}These heuristics aren't foolproof—sometimes a full table scan is correct—but they catch obvious inefficiencies.
Property evals catch structural problems. Correctness evals catch semantic errors—queries that parse and execute but return wrong answers.
The most direct correctness eval: does the model's query return the same results as the reference query?
def eval_result_correctness(
generated_sql: str,
reference_sql: str,
db_connection
) -> dict:
"""Execute both queries and compare result sets."""
try:
generated_results = pd.read_sql(generated_sql, db_connection)
reference_results = pd.read_sql(reference_sql, db_connection)
except Exception as e:
return {"passed": False, "reason": f"Execution error: {str(e)}"}
# Normalize column order and names
generated_results.columns = [c.lower() for c in generated_results.columns]
reference_results.columns = [c.lower() for c in reference_results.columns]
# Sort by all columns for deterministic comparison
generated_results = generated_results.sort_values(by=list(generated_results.columns)).reset_index(drop=True)
reference_results = reference_results.sort_values(by=list(reference_results.columns)).reset_index(drop=True)
# Compare
if generated_results.equals(reference_results):
return {"passed": True}
else:
return {
"passed": False,
"reason": "Result sets differ",
"generated_shape": generated_results.shape,
"reference_shape": reference_results.shape,
"sample_diff": identify_differences(generated_results, reference_results)
}This is binary: either the results match or they don't. It's the gold standard for correctness.
For KPI queries, you might accept small numerical differences (due to rounding, floating-point precision, etc.).
def eval_metric_accuracy(
generated_sql: str,
reference_sql: str,
db_connection,
tolerance: float = 0.01 # 1% tolerance
) -> dict:
"""Compare metric values with tolerance for numerical differences."""
try:
generated_result = pd.read_sql(generated_sql, db_connection)
reference_result = pd.read_sql(reference_sql, db_connection)
except Exception as e:
return {"passed": False, "reason": f"Execution error: {str(e)}"}
# Extract metric values (assume single row, single column)
if len(generated_result) != 1 or len(reference_result) != 1:
return {"passed": False, "reason": "Expected single-row results"}
gen_value = float(generated_result.iloc[0, 0])
ref_value = float(reference_result.iloc[0, 0])
# Handle zero values
if ref_value == 0:
if gen_value == 0:
return {"passed": True}
else:
return {"passed": False, "reason": f"Expected 0, got {gen_value}"}
# Calculate percentage difference
pct_diff = abs(gen_value - ref_value) / abs(ref_value)
if pct_diff <= tolerance:
return {"passed": True, "pct_diff": pct_diff}
else:
return {
"passed": False,
"generated_value": gen_value,
"reference_value": ref_value,
"pct_diff": pct_diff,
"tolerance": tolerance
}This allows for acceptable numerical variance while catching major errors.
Sometimes two different queries can both be correct answers to the same question. For example:
SELECT AVG(order_total) FROM ordersSELECT SUM(order_total) / COUNT(*) FROM ordersBoth queries return the same result, but the SQL is different. Evaluating semantic correctness requires either:
For production systems, execution equivalence is most reliable. If two queries return the same results, they're semantically equivalent (for that dataset).
As your analytics platform grows, manual evaluation becomes unsustainable. You need automated frameworks.
Several open-source and commercial evaluation frameworks exist. According to comprehensive comparisons of LLM evaluation frameworks, the most relevant for analytics use cases include:
RAGAS (Retrieval-Augmented Generation Assessment) was originally designed for RAG pipelines, but its metrics translate well to analytics. It evaluates:
While RAGAS focuses on text output, you can adapt its evaluation approach to SQL generation.
TruLens provides a framework for evaluating LLM applications end-to-end. For analytics, you'd use TruLens to:
DeepEval offers a lightweight evaluation framework that integrates with your CI/CD pipeline. You can define custom metrics for your analytics use case and run them automatically on every model update.
Beyond open-source tools, platforms like Humanloop and Arize provide enterprise-grade evaluation infrastructure with monitoring, alerting, and version control for your evaluation datasets.
Let's walk through a concrete example: evaluating a text-to-SQL model for a SaaS analytics platform.
Your platform has:
You create 250 test cases:
For each test case, you have:
Your CI/CD pipeline runs:
Property Evals (instant, before execution):
Correctness Evals (on a test database):
Regression Testing:
Monitoring:
Your evaluation shows:
You use these insights to:
Text-to-SQL is just one analytics use case. Many platforms also use LLMs to summarize query results, generate insights, or explain trends.
Evaluating summarization is harder than evaluating SQL—there's more subjectivity in what makes a "good" summary.
However, you can still implement deterministic evaluations:
Does the summary accurately reflect the data?
def eval_summary_accuracy(
summary: str,
query_results: pd.DataFrame,
reference_summary: str
) -> dict:
"""Check if summary contains factually correct claims about the data."""
# Extract numeric claims from summary
claims = extract_numeric_claims(summary)
# Verify each claim against the data
verified_claims = []
for claim in claims:
metric, expected_value = parse_claim(claim)
actual_value = extract_metric_from_results(metric, query_results)
if is_approximately_equal(actual_value, expected_value, tolerance=0.05):
verified_claims.append({"claim": claim, "verified": True})
else:
verified_claims.append({"claim": claim, "verified": False, "actual": actual_value})
accuracy = sum(1 for c in verified_claims if c["verified"]) / len(verified_claims)
return {
"passed": accuracy > 0.9, # 90% of claims must be accurate
"accuracy": accuracy,
"failed_claims": [c for c in verified_claims if not c["verified"]]
}Does the summary cover the key insights in the data?
def eval_summary_completeness(
summary: str,
query_results: pd.DataFrame,
key_metrics: list
) -> dict:
"""Check if summary mentions important metrics and trends."""
mentioned_metrics = []
missing_metrics = []
for metric in key_metrics:
if metric_mentioned_in_summary(metric, summary):
mentioned_metrics.append(metric)
else:
missing_metrics.append(metric)
coverage = len(mentioned_metrics) / len(key_metrics)
return {
"passed": coverage > 0.8, # 80% of key metrics mentioned
"coverage": coverage,
"mentioned": mentioned_metrics,
"missing": missing_metrics
}For subjective qualities like tone and clarity, you might use another LLM as a judge:
async def eval_summary_quality_with_llm(
summary: str,
query_results: pd.DataFrame,
llm_client
) -> dict:
"""Use an LLM to evaluate summary quality on subjective dimensions."""
prompt = f"""
Evaluate this summary of analytics data on the following dimensions:
1. Clarity: Is the summary easy to understand?
2. Actionability: Does it suggest what to do with the insights?
3. Tone: Is the tone appropriate for a business audience?
Summary: {summary}
Data: {query_results.to_string()}
Respond in JSON format with scores 1-5 for each dimension.
"""
response = await llm_client.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
scores = json.loads(response.choices[0].message.content)
return {
"passed": all(score >= 4 for score in scores.values()),
"scores": scores
}This approach is less deterministic than SQL evaluation, but it captures important quality dimensions.
Once your eval suite is in place, you need to monitor it continuously. Model performance degrades over time due to:
Track these metrics over time:
def track_eval_metrics(eval_results: list, timestamp: datetime) -> dict:
"""Aggregate evaluation results for monitoring."""
total = len(eval_results)
passed = sum(1 for r in eval_results if r['passed'])
failures_by_category = {}
for result in eval_results:
if not result['passed']:
category = result.get('category', 'unknown')
failures_by_category[category] = failures_by_category.get(category, 0) + 1
return {
"timestamp": timestamp,
"total_tests": total,
"passed": passed,
"pass_rate": passed / total,
"failures_by_category": failures_by_category,
"needs_investigation": passed / total < 0.95
}Set up alerts for:
When an alert fires, your process should be:
As you discover failures in production, add them to your eval suite:
This creates a virtuous cycle: each production failure becomes a permanent test that prevents future regressions.
If you're building analytics features on managed Apache Superset, evaluation frameworks integrate naturally:
For teams building self-serve BI platforms with text-to-SQL capabilities, the evaluation framework becomes part of your platform's quality gates. Every model update is gated on eval suite pass rate before deployment.
Based on practical frameworks for LLM system evaluations and industry experience:
Property evals are fast and deterministic. Implement syntax, schema, and safety checks first. They catch 80% of failures with 5% of the effort.
Don't try to create 500 test cases upfront. Start with 50, run evaluation, identify gaps, and expand. You'll learn what matters as you go.
Treat your eval dataset like code. Version control it. Document changes. When your schema evolves, update your test cases and reference answers together.
Your model shouldn't see your eval test cases during training. Otherwise, you're not measuring generalization.
Evaluation shouldn't stop at deployment. Track eval results for every query in production. Use this data to identify drift and trigger retraining.
A feature isn't done until it passes your eval suite. Eval suite pass rate is a key metric for model quality.
Evaluation at scale requires tooling. Use frameworks like TruLens, RAGAS, or DeepEval instead of building from scratch. These tools handle versioning, monitoring, and reporting.
Evaluating LLMs for analytics is fundamentally different from evaluating general-purpose models. The stakes are higher—incorrect metrics erode trust in your platform. The correctness requirements are stricter—there's often a ground truth. And the use cases are more complex—multi-step pipelines with multiple failure modes.
But this also means evaluation is more tractable. You can define deterministic tests. You can build datasets with known answers. You can measure progress objectively.
The framework outlined here—property evals for structural correctness, correctness evals for semantic accuracy, and continuous monitoring for drift—is battle-tested in production analytics platforms. It catches failures before they reach users. It gives you confidence to ship new models. And it creates a feedback loop that continuously improves your system.
Start with a small eval suite (50-100 test cases). Implement property evals first, then add correctness evals. Integrate evaluation into your CI/CD pipeline. Monitor results in production. Expand your test dataset as you discover edge cases. Over time, you'll build a robust quality gate that keeps your analytics platform reliable and trustworthy.
For teams building on managed platforms like D23, evaluation frameworks integrate seamlessly with your analytics infrastructure. Whether you're evaluating text-to-SQL models, building embedded analytics, or implementing self-serve BI with AI assistance, the principles in this guide apply. Evaluation is how you ensure that AI-powered analytics deliver accurate insights, not just plausible-sounding ones.
The investment in building a comprehensive evaluation suite pays dividends in confidence, reliability, and user trust. It's the difference between shipping an LLM feature and shipping one you're proud to stand behind.