Skip to main content

Benchmarks

The Benchmarks component is a testing and evaluation framework for AI agents operating on Timbr’s ontology-based semantic layer.

It allows users to create datasets of test questions with known correct answers and evaluate how effectively agents generate SQL and responses using the virtual knowledge graph. Benchmarks measure accuracy, reasoning, and consistency, providing detailed insights into AI performance across NL2SQL and agent-based workflows.

Benchmarks can be accessed from the main navigation.


Interface Layout

The Benchmarks interface is organised into three tabs:

TabPurpose
BenchmarksCreate, edit, delete, and run benchmark definitions
Running BenchmarksMonitor benchmarks that are currently executing
Benchmark HistoryView completed runs, inspect results, and rerun benchmarks

Each tab shares a common filter bar for narrowing results by user and agent, and a Group by Agent toggle that reorganises all table views hierarchically.


Benchmarks Tab

The Benchmarks tab lists all saved benchmark definitions. Each row shows the agent it is assigned to, its name, description, question count, last updated timestamp, and the user who last modified it.

Filters

  • User - Filter benchmark definitions by the user who created or last modified them.
  • Agent - Filter to benchmarks assigned to a specific agent.
  • Group by Agent - Reorganises the table into agent groups.

Actions per Benchmark

Each benchmark row has three action buttons:

  • Edit - Opens the benchmark editor to modify its details or questions.
  • Run - Opens the run configuration dialog to execute the benchmark.
  • Delete - Permanently removes the benchmark definition.

Creating a Benchmark

Click New Benchmark in the top-right corner to open the multi-step creation wizard.

Step 1: Details

FieldRequiredDescription
Benchmark NameA unique identifier for the benchmark. Cannot be changed after creation.
AgentThe AI agent this benchmark will test. Cannot be changed after creation.
Description-Optional description of what the benchmark is testing.

Step 2: Questions

The Questions step lists all test questions in the benchmark. Each question can be added, edited, or deleted individually. Use the Import button to bulk-import questions from a JSON or CSV file.

Step 3: Question Form

Each individual question has the following fields:

FieldRequiredDescription
KeyA unique identifier for the question (e.g., Q1, Q2). Auto-incremented if left blank.
QuestionThe natural language test question.
Correct SQL-The expected SQL query that the agent should generate.
Expected Answer-The known correct answer to the question.
Correct Concept-The ontology concept or view the agent should select when answering.
Correct Ontology-The ontology the agent should target. Filtered by the benchmark's selected agent.

Click Save to store the benchmark.

Importing Questions

The Import button on the Questions step opens the import dialog. Questions can be imported from a JSON or CSV file by uploading a file or pasting content directly.

JSON format - An object where each key is a question ID and the value contains question fields:

{
"Q1": {
"question": "How many customers are there?",
"correct_sql": "SELECT COUNT(*) FROM customer",
"expected_answer": "42"
}
}

CSV format - A header row followed by one question per row. The question column is required; all other columns are optional:

key,question,correct_sql,expected_answer
Q1,How many customers?,SELECT COUNT(*) FROM customer,42

Import modes:

ModeBehaviour
Replace AllRemoves all existing questions and replaces them with the imported set
MergeAdds imported questions alongside existing ones; duplicate keys are auto-suffixed

Running a Benchmark

Click Run on any benchmark to open the run configuration dialog.

Run Configuration

SettingOptionsDefaultDescription
Benchmark MechanismDeterministic / LLM Judge / FullFullDetermines how answers are scored. Deterministic uses algorithmic comparison of SQL, answers, ontology, and concepts. LLM Judge uses an AI model to evaluate correctness. Full applies both methods.
Benchmark ExecutionGenerate SQL Only / Full ExecutionFull ExecutionGenerate SQL Only produces SQL without running it. Full Execution generates and runs the SQL query.
Number of Iterations1–101How many times each question is executed. Running multiple iterations helps identify inconsistent results.

Click Run to start the benchmark. The dialog closes immediately and execution begins in the background. Progress can be followed in the Running Benchmarks tab.

Background execution

Benchmarks run as detached background processes. The platform supports up to a configurable maximum number of parallel benchmark runs at a time. The current user does not need to remain on the page for the run to complete.


Running Benchmarks Tab

The Running Benchmarks tab shows all benchmarks that are currently in progress.

Each row displays the benchmark name, assigned agent, the user who initiated the run, the start time, the total question count, and how many questions have completed so far.

Use the Refresh button to poll for the latest status.


Benchmark History Tab

The Benchmark History tab shows all completed benchmark runs.

Each row displays: benchmark name, agent, run by user, start and end times, duration, question count, the number of correct and incorrect answers, the overall correct rate percentage, and the LLM type and model used.

Filters

  • User - Filter by the user who ran the benchmark.
  • Agent - Filter to runs for a specific agent.
  • Limit - Set the maximum number of history records to display (100 / 200 / 500 / 1000).

Actions per Run

Each history row has two action buttons:

  • View - Opens the detailed results modal for that run.
  • Rerun - Opens the run configuration dialog pre-filled with the original settings, with an additional option to filter which questions to rerun.

Rerun Question Filter

When rerunning a benchmark, an additional Questions to Rerun setting is available:

OptionDescription
All QuestionsReruns the entire benchmark
Failed OnlyReruns only questions that were scored as incorrect
Inconsistent OnlyReruns only questions that produced inconsistent results across iterations

Viewing Results

Click View on any history row to open the detailed results modal.

View Mode

Results can be inspected in two modes, toggled at the top of the modal:

  • Table - A structured view with one row per question and expandable sections for reasoning and scoring details.
  • JSON - A raw hierarchical view of the full result object, useful for technical inspection.

Table View Columns

ColumnContents
Test InputQuestion key and question text. Long questions can be expanded with "Read Full".
Execution ResultStatus badge (correct / incorrect), schema match indicator, and the generated answer text.
Selected Schema & SQLThe ontology and concept/entity the agent selected, with a warning icon if the selection was incorrect. A View SQL button opens the generated SQL in a read-only editor.
AI ReasoningThe agent's reasoning status badge, and expandable sections for Concept Reason (how the concept was identified) and SQL Reason (how the SQL was generated).
Scoring AuditThe scoring method used, tokens consumed, and expandable sections for Score Reasoning and Score Breakdown (key–value detail of how the score was calculated).

Exporting Results

Click Export in the results modal to download the run results in one of two formats:

  • JSON - The full result object including metadata and all question outcomes.
  • CSV - A tabular export with one row per question and a summary appended at the end.