Benchmarks

The Benchmarks component is a testing and evaluation framework for AI agents operating on Timbr’s ontology-based semantic layer.

It allows users to create datasets of test questions with known correct answers and evaluate how effectively agents generate SQL and responses using the virtual knowledge graph. Benchmarks measure accuracy, reasoning, and consistency, providing detailed insights into AI performance across NL2SQL and agent-based workflows.

Benchmarks can be accessed from the main navigation.

Interface Layout

The Benchmarks interface is organised into three tabs:

Tab	Purpose
Benchmarks	Create, edit, delete, and run benchmark definitions
Running Benchmarks	Monitor benchmarks that are currently executing
Benchmark History	View completed runs, inspect results, and rerun benchmarks

Each tab shares a common filter bar for narrowing results by user and agent, and a Group by Agent toggle that reorganises all table views hierarchically.

Benchmarks Tab

The Benchmarks tab lists all saved benchmark definitions. Each row shows the agent it is assigned to, its name, description, question count, last updated timestamp, and the user who last modified it.

Filters

User - Filter benchmark definitions by the user who created or last modified them.
Agent - Filter to benchmarks assigned to a specific agent.
Group by Agent - Reorganises the table into agent groups.

Actions per Benchmark

Each benchmark row has three action buttons:

Edit - Opens the benchmark editor to modify its details or questions.
Run - Opens the run configuration dialog to execute the benchmark.
Delete - Permanently removes the benchmark definition.

Creating a Benchmark

Click New Benchmark in the top-right corner to open the multi-step creation wizard.

Step 1: Details

Field	Required	Description
Benchmark Name	✓	A unique identifier for the benchmark. Cannot be changed after creation.
Agent	✓	The AI agent this benchmark will test. Cannot be changed after creation.
Description	-	Optional description of what the benchmark is testing.

Step 2: Questions

The Questions step lists all test questions in the benchmark. Each question can be added, edited, or deleted individually. Use the Import button to bulk-import questions from a JSON or CSV file.

Step 3: Question Form

Each individual question has the following fields:

Field	Required	Description
Key	✓	A unique identifier for the question (e.g., `Q1`, `Q2`). Auto-incremented if left blank.
Question	✓	The natural language test question.
Correct SQL	-	The expected SQL query that the agent should generate.
Expected Answer	-	The known correct answer to the question.
Correct Concept	-	The ontology concept or view the agent should select when answering.
Correct Ontology	-	The ontology the agent should target. Filtered by the benchmark's selected agent.

Click Save to store the benchmark.

Importing Questions

The Import button on the Questions step opens the import dialog. Questions can be imported from a JSON or CSV file by uploading a file or pasting content directly.

JSON format - An object where each key is a question ID and the value contains question fields:

{
  "Q1": {
    "question": "How many customers are there?",
    "correct_sql": "SELECT COUNT(*) FROM customer",
    "expected_answer": "42"
  }
}

CSV format - A header row followed by one question per row. The question column is required; all other columns are optional:

key,question,correct_sql,expected_answer
Q1,How many customers?,SELECT COUNT(*) FROM customer,42

Import modes:

Mode	Behaviour
Replace All	Removes all existing questions and replaces them with the imported set
Merge	Adds imported questions alongside existing ones; duplicate keys are auto-suffixed

Running a Benchmark

Click Run on any benchmark to open the run configuration dialog.

Run Configuration

Setting	Options	Default	Description
Benchmark Mechanism	Deterministic / LLM Judge / Full	Full	Determines how answers are scored. Deterministic uses algorithmic comparison of SQL, answers, ontology, and concepts. LLM Judge uses an AI model to evaluate correctness. Full applies both methods.
Benchmark Execution	Generate SQL Only / Full Execution	Full Execution	Generate SQL Only produces SQL without running it. Full Execution generates and runs the SQL query.
Number of Iterations	1–10	1	How many times each question is executed. Running multiple iterations helps identify inconsistent results.

Click Run to start the benchmark. The dialog closes immediately and execution begins in the background. Progress can be followed in the Running Benchmarks tab.

Background execution

Benchmarks run as detached background processes. The platform supports up to a configurable maximum number of parallel benchmark runs at a time. The current user does not need to remain on the page for the run to complete.

Running Benchmarks Tab

The Running Benchmarks tab shows all benchmarks that are currently in progress.

Each row displays the benchmark name, assigned agent, the user who initiated the run, the start time, the total question count, and how many questions have completed so far.

Use the Refresh button to poll for the latest status.

Benchmark History Tab

The Benchmark History tab shows all completed benchmark runs.

Each row displays: benchmark name, agent, run by user, start and end times, duration, question count, the number of correct and incorrect answers, the overall correct rate percentage, and the LLM type and model used.

Filters

User - Filter by the user who ran the benchmark.
Agent - Filter to runs for a specific agent.
Limit - Set the maximum number of history records to display (100 / 200 / 500 / 1000).

Actions per Run

Each history row has two action buttons:

View - Opens the detailed results modal for that run.
Rerun - Opens the run configuration dialog pre-filled with the original settings, with an additional option to filter which questions to rerun.

Rerun Question Filter

When rerunning a benchmark, an additional Questions to Rerun setting is available:

Option	Description
All Questions	Reruns the entire benchmark
Failed Only	Reruns only questions that were scored as incorrect
Inconsistent Only	Reruns only questions that produced inconsistent results across iterations

Viewing Results

Click View on any history row to open the detailed results modal.

View Mode

Results can be inspected in two modes, toggled at the top of the modal:

Table - A structured view with one row per question and expandable sections for reasoning and scoring details.
JSON - A raw hierarchical view of the full result object, useful for technical inspection.

Table View Columns

Column	Contents
Test Input	Question key and question text. Long questions can be expanded with "Read Full".
Execution Result	Status badge (correct / incorrect), schema match indicator, and the generated answer text.
Selected Schema & SQL	The ontology and concept/entity the agent selected, with a warning icon if the selection was incorrect. A View SQL button opens the generated SQL in a read-only editor.
AI Reasoning	The agent's reasoning status badge, and expandable sections for Concept Reason (how the concept was identified) and SQL Reason (how the SQL was generated).
Scoring Audit	The scoring method used, tokens consumed, and expandable sections for Score Reasoning and Score Breakdown (key–value detail of how the score was calculated).

Exporting Results

Click Export in the results modal to download the run results in one of two formats:

JSON - The full result object including metadata and all question outcomes.
CSV - A tabular export with one row per question and a summary appended at the end.

Benchmarks

Interface Layout​

Benchmarks Tab​

Filters​

Actions per Benchmark​

Creating a Benchmark​

Step 1: Details​

Step 2: Questions​

Step 3: Question Form​

Importing Questions​

Running a Benchmark​

Run Configuration​

Running Benchmarks Tab​

Benchmark History Tab​

Filters​

Actions per Run​

Rerun Question Filter​

Viewing Results​

View Mode​

Table View Columns​

Exporting Results​

Interface Layout

Benchmarks Tab

Filters

Actions per Benchmark

Creating a Benchmark

Step 1: Details

Step 2: Questions

Step 3: Question Form

Importing Questions

Running a Benchmark

Run Configuration

Running Benchmarks Tab

Benchmark History Tab

Filters

Actions per Run

Rerun Question Filter

Viewing Results

View Mode

Table View Columns

Exporting Results