Metrics and Mathematical Foundations

This section describes the mathematical logic behind the framework’s core scoring methods.

Semantic search in the strict semantic-match bot

The strict semantic-match bot uses a lightweight lexical proxy for semantic retrieval. For each question, the system constructs a token set after lower-casing and stripping light punctuation.

Let

\[T(q) = \text{token set of question } q\]

For an input question \(q\) and a candidate FAQ question \(q_i\), the system builds binary vectors over the union vocabulary

\[V = T(q) \cup T(q_i)\]

and then assigns

\[\begin{split}x_j = \begin{cases} 1 & \text{if token } v_j \in T(q) \\ 0 & \text{otherwise} \end{cases} \qquad y_j = \begin{cases} 1 & \text{if token } v_j \in T(q_i) \\ 0 & \text{otherwise} \end{cases}\end{split}\]

The similarity score is cosine similarity:

\[\cos(\theta) = \frac{x \cdot y}{\lVert x \rVert_2 \lVert y \rVert_2}\]

where \(x \cdot y\) is the dot product and \(\lVert x \rVert_2\) is the Euclidean norm. The bot selects the FAQ row with the maximum cosine similarity and returns its stored answer.

This method is computationally simple and highly interpretable. It does not learn a semantic embedding space, but it approximates semantic relatedness when paraphrases share important vocabulary.

Full-context prompting

The full-context bot does not retrieve a single FAQ row. Instead it concatenates the entire domain-knowledge document into one prompt template and asks a generative model to answer.

Formally, the model receives a prompt

\[p = f(k, q)\]

where \(k\) is the domain-knowledge text, \(q\) is the user question, and \(f\) is the prompt template function. The model then samples or decodes an answer

\[a = M(p)\]

where \(M\) is the configured LLM backend. This approach trades retrieval precision for contextual completeness.

Deterministic metrics

Exact match

Exact match is a binary indicator after normalization. If \(\hat{a}\) is the generated answer and \(a^*\) is the expected answer, then

\[\begin{split}\operatorname{EM}(\hat{a}, a^*) = \begin{cases} 1 & \text{if } N(\hat{a}) = N(a^*) \\ 0 & \text{otherwise} \end{cases}\end{split}\]

where \(N(\cdot)\) lower-cases text, trims leading and trailing spaces, and collapses repeated whitespace.

Keyword recall

The framework computes a token-set recall over the expected answer. Let \(E\) be the token set of the expected answer and \(G\) be the token set of the generated answer. Then

\[\operatorname{Recall}(G, E) = \frac{|G \cap E|}{|E|}\]

when \(|E| > 0\), and \(0\) otherwise.

This score rewards answers that preserve expected factual content, even when the wording is not identical.

Answer length

Answer length is reported as a simple scalar

\[L(\hat{a}) = \text{number of characters in } \hat{a}\]

This metric is not a quality score by itself. Instead it acts as a communication descriptor that can reveal truncation, verbosity, or excessively terse answers.

Politeness heuristic

The politeness metric is a lightweight communication heuristic. Let \(m_1, \dots, m_k\) be a small set of politeness markers such as “please” or “happy to help”. The metric counts how many markers appear in the answer, then scales and clips the result:

\[P(\hat{a}) = \min\left(\frac{1}{2} \sum_{i=1}^{k} \mathbf{1}[m_i \subseteq \hat{a}], 1\right)\]

This metric is intentionally simple. It provides a rough communication-quality signal, not a linguistic model of tone.

Latency

The evaluator records end-to-end latency for one bot call:

\[T = t_{\text{end}} - t_{\text{start}}\]

The result is reported in milliseconds. This is an operational-performance metric rather than a semantic-quality metric.

LLM-as-a-judge metrics

Some evaluation properties are difficult to encode as a deterministic formula. Relevance, faithfulness, safety, and robustness often require holistic judgment over the relation between:

the input question,
the expected answer, and
the generated answer.

For those cases, the framework constructs a judge prompt:

\[j = g(q, a^*, \hat{a})\]

and asks a judge model to return structured JSON containing a score and a rationale. If the judge returns

{"score": s, "reason": r}

then the metric score is simply \(s\), while \(r\) is stored as supporting detail.

This turns qualitative review into a repeatable programmatic step. The output is still model-based and therefore not purely objective, but it is auditable because the framework can persist the judge output and, when available, the judge reasoning trace.

Interpreting metric families together

No single metric is sufficient. A business-facing evaluation should inspect several axes at once:

Correctness: exact match and keyword recall.
Communication: answer length and politeness.
Operations: latency.
Semantic review: judge scores for relevance, faithfulness, safety, and robustness.

The framework therefore saves each metric at row level so that teams can diagnose disagreements between metric families. For example, a response may score low on exact match but high on judge-based relevance if it uses a paraphrase that preserves meaning.