Benchmarks
The Benchmarks component is a testing and evaluation framework for AI agents operating on Timbr’s ontology-based semantic layer.
It allows users to create datasets of test questions with known correct answers and evaluate how effectively agents generate SQL and responses using the virtual knowledge graph. Benchmarks measure accuracy, reasoning, and consistency, providing detailed insights into AI performance across NL2SQL and agent-based workflows.
Benchmarks can be accessed from the main navigation.
Interface Layout
The Benchmarks interface is organised into three tabs:
| Tab | Purpose |
|---|---|
| Benchmarks | Create, edit, delete, and run benchmark definitions |
| Running Benchmarks | Monitor benchmarks that are currently executing |
| Benchmark History | View completed runs, inspect results, and rerun benchmarks |
Each tab shares a common filter bar for narrowing results by user and agent, and a Group by Agent toggle that reorganises all table views hierarchically.
Benchmarks Tab
The Benchmarks tab lists all saved benchmark definitions. Each row shows the agent it is assigned to, its name, description, question count, last updated timestamp, and the user who last modified it.
Filters
- User - Filter benchmark definitions by the user who created or last modified them.
- Agent - Filter to benchmarks assigned to a specific agent.
- Group by Agent - Reorganises the table into agent groups.
Actions per Benchmark
Each benchmark row has three action buttons:
- Edit - Opens the benchmark editor to modify its details or questions.
- Run - Opens the run configuration dialog to execute the benchmark.
- Delete - Permanently removes the benchmark definition.
Creating a Benchmark
Click New Benchmark in the top-right corner to open the multi-step creation wizard.
Step 1: Details
| Field | Required | Description |
|---|---|---|
| Benchmark Name | ✓ | A unique identifier for the benchmark. Cannot be changed after creation. |
| Agent | ✓ | The AI agent this benchmark will test. Cannot be changed after creation. |
| Description | - | Optional description of what the benchmark is testing. |
Step 2: Questions
The Questions step lists all test questions in the benchmark. Each question can be added, edited, or deleted individually. Use the Import button to bulk-import questions from a JSON or CSV file.
Step 3: Question Form
Each individual question has the following fields:
| Field | Required | Description |
|---|---|---|
| Key | ✓ | A unique identifier for the question (e.g., Q1, Q2). Auto-incremented if left blank. |
| Question | ✓ | The natural language test question. |
| Correct SQL | - | The expected SQL query that the agent should generate. |
| Expected Answer | - | The known correct answer to the question. |
| Correct Concept | - | The ontology concept or view the agent should select when answering. |
| Correct Ontology | - | The ontology the agent should target. Filtered by the benchmark's selected agent. |
Click Save to store the benchmark.
Importing Questions
The Import button on the Questions step opens the import dialog. Questions can be imported from a JSON or CSV file by uploading a file or pasting content directly.
JSON format - An object where each key is a question ID and the value contains question fields:
{
"Q1": {
"question": "How many customers are there?",
"correct_sql": "SELECT COUNT(*) FROM customer",
"expected_answer": "42"
}
}
CSV format - A header row followed by one question per row. The question column is required; all other columns are optional:
key,question,correct_sql,expected_answer
Q1,How many customers?,SELECT COUNT(*) FROM customer,42
Import modes:
| Mode | Behaviour |
|---|---|
| Replace All | Removes all existing questions and replaces them with the imported set |
| Merge | Adds imported questions alongside existing ones; duplicate keys are auto-suffixed |
Running a Benchmark
Click Run on any benchmark to open the run configuration dialog.
Run Configuration
| Setting | Options | Default | Description |
|---|---|---|---|
| Benchmark Mechanism | Deterministic / LLM Judge / Full | Full | Determines how answers are scored. Deterministic uses algorithmic comparison of SQL, answers, ontology, and concepts. LLM Judge uses an AI model to evaluate correctness. Full applies both methods. |
| Benchmark Execution | Generate SQL Only / Full Execution | Full Execution | Generate SQL Only produces SQL without running it. Full Execution generates and runs the SQL query. |
| Number of Iterations | 1–10 | 1 | How many times each question is executed. Running multiple iterations helps identify inconsistent results. |
Click Run to start the benchmark. The dialog closes immediately and execution begins in the background. Progress can be followed in the Running Benchmarks tab.
Benchmarks run as detached background processes. The platform supports up to a configurable maximum number of parallel benchmark runs at a time. The current user does not need to remain on the page for the run to complete.
Running Benchmarks Tab
The Running Benchmarks tab shows all benchmarks that are currently in progress.
Each row displays the benchmark name, assigned agent, the user who initiated the run, the start time, the total question count, and how many questions have completed so far.
Use the Refresh button to poll for the latest status.
Benchmark History Tab
The Benchmark History tab shows all completed benchmark runs.
Each row displays: benchmark name, agent, run by user, start and end times, duration, question count, the number of correct and incorrect answers, the overall correct rate percentage, and the LLM type and model used.
Filters
- User - Filter by the user who ran the benchmark.
- Agent - Filter to runs for a specific agent.
- Limit - Set the maximum number of history records to display (100 / 200 / 500 / 1000).
Actions per Run
Each history row has two action buttons:
- View - Opens the detailed results modal for that run.
- Rerun - Opens the run configuration dialog pre-filled with the original settings, with an additional option to filter which questions to rerun.
Rerun Question Filter
When rerunning a benchmark, an additional Questions to Rerun setting is available:
| Option | Description |
|---|---|
| All Questions | Reruns the entire benchmark |
| Failed Only | Reruns only questions that were scored as incorrect |
| Inconsistent Only | Reruns only questions that produced inconsistent results across iterations |
Viewing Results
Click View on any history row to open the detailed results modal.
View Mode
Results can be inspected in two modes, toggled at the top of the modal:
- Table - A structured view with one row per question and expandable sections for reasoning and scoring details.
- JSON - A raw hierarchical view of the full result object, useful for technical inspection.
Table View Columns
| Column | Contents |
|---|---|
| Test Input | Question key and question text. Long questions can be expanded with "Read Full". |
| Execution Result | Status badge (correct / incorrect), schema match indicator, and the generated answer text. |
| Selected Schema & SQL | The ontology and concept/entity the agent selected, with a warning icon if the selection was incorrect. A View SQL button opens the generated SQL in a read-only editor. |
| AI Reasoning | The agent's reasoning status badge, and expandable sections for Concept Reason (how the concept was identified) and SQL Reason (how the SQL was generated). |
| Scoring Audit | The scoring method used, tokens consumed, and expandable sections for Score Reasoning and Score Breakdown (key–value detail of how the score was calculated). |
Exporting Results
Click Export in the results modal to download the run results in one of two formats:
- JSON - The full result object including metadata and all question outcomes.
- CSV - A tabular export with one row per question and a summary appended at the end.