Metrics and Mathematical Foundations
====================================

This section describes the mathematical logic behind the framework's core scoring methods.

Semantic search in the strict semantic-match bot
------------------------------------------------

The strict semantic-match bot uses a lightweight lexical proxy for semantic retrieval. For each question, the system constructs a token set after lower-casing and stripping light punctuation.

Let

.. math::

   T(q) = 	\text{token set of question } q

For an input question :math:`q` and a candidate FAQ question :math:`q_i`, the system builds binary vectors over the union vocabulary

.. math::

   V = T(q) \cup T(q_i)

and then assigns

.. math::

   x_j = \begin{cases}
   1 & \text{if token } v_j \in T(q) \\
   0 & \text{otherwise}
   \end{cases}
   \qquad
   y_j = \begin{cases}
   1 & \text{if token } v_j \in T(q_i) \\
   0 & \text{otherwise}
   \end{cases}

The similarity score is cosine similarity:

.. math::

   \cos(\theta) = \frac{x \cdot y}{\lVert x \rVert_2 \lVert y \rVert_2}

where :math:`x \cdot y` is the dot product and :math:`\lVert x \rVert_2` is the Euclidean norm. The bot selects the FAQ row with the maximum cosine similarity and returns its stored answer.

This method is computationally simple and highly interpretable. It does **not** learn a semantic embedding space, but it approximates semantic relatedness when paraphrases share important vocabulary.

Full-context prompting
----------------------

The full-context bot does not retrieve a single FAQ row. Instead it concatenates the entire domain-knowledge document into one prompt template and asks a generative model to answer.

Formally, the model receives a prompt

.. math::

   p = f(k, q)

where :math:`k` is the domain-knowledge text, :math:`q` is the user question, and :math:`f` is the prompt template function. The model then samples or decodes an answer

.. math::

   a = M(p)

where :math:`M` is the configured LLM backend. This approach trades retrieval precision for contextual completeness.

Deterministic metrics
---------------------

Exact match
^^^^^^^^^^^

Exact match is a binary indicator after normalization. If :math:`\hat{a}` is the generated answer and :math:`a^*` is the expected answer, then

.. math::

   \operatorname{EM}(\hat{a}, a^*) =
   \begin{cases}
   1 & \text{if } N(\hat{a}) = N(a^*) \\
   0 & \text{otherwise}
   \end{cases}

where :math:`N(\cdot)` lower-cases text, trims leading and trailing spaces, and collapses repeated whitespace.

Keyword recall
^^^^^^^^^^^^^^

The framework computes a token-set recall over the expected answer. Let :math:`E` be the token set of the expected answer and :math:`G` be the token set of the generated answer. Then

.. math::

   \operatorname{Recall}(G, E) = \frac{|G \cap E|}{|E|}

when :math:`|E| > 0`, and :math:`0` otherwise.

This score rewards answers that preserve expected factual content, even when the wording is not identical.

Answer length
^^^^^^^^^^^^^

Answer length is reported as a simple scalar

.. math::

   L(\hat{a}) = \text{number of characters in } \hat{a}

This metric is not a quality score by itself. Instead it acts as a communication descriptor that can reveal truncation, verbosity, or excessively terse answers.

Politeness heuristic
^^^^^^^^^^^^^^^^^^^^

The politeness metric is a lightweight communication heuristic. Let :math:`m_1, \dots, m_k` be a small set of politeness markers such as “please” or “happy to help”. The metric counts how many markers appear in the answer, then scales and clips the result:

.. math::

   P(\hat{a}) = \min\left(\frac{1}{2} \sum_{i=1}^{k} \mathbf{1}[m_i \subseteq \hat{a}], 1\right)

This metric is intentionally simple. It provides a rough communication-quality signal, not a linguistic model of tone.

Latency
^^^^^^^

The evaluator records end-to-end latency for one bot call:

.. math::

   T = t_{\text{end}} - t_{\text{start}}

The result is reported in milliseconds. This is an operational-performance metric rather than a semantic-quality metric.

LLM-as-a-judge metrics
----------------------

Some evaluation properties are difficult to encode as a deterministic formula. Relevance, faithfulness, safety, and robustness often require holistic judgment over the relation between:

* the input question,
* the expected answer, and
* the generated answer.

For those cases, the framework constructs a judge prompt:

.. math::

   j = g(q, a^*, \hat{a})

and asks a judge model to return structured JSON containing a score and a rationale. If the judge returns

.. code-block:: json

   {"score": s, "reason": r}

then the metric score is simply :math:`s`, while :math:`r` is stored as supporting detail.

This turns qualitative review into a repeatable programmatic step. The output is still model-based and therefore not purely objective, but it is auditable because the framework can persist the judge output and, when available, the judge reasoning trace.

Interpreting metric families together
-------------------------------------

No single metric is sufficient. A business-facing evaluation should inspect several axes at once:

* **Correctness:** exact match and keyword recall.
* **Communication:** answer length and politeness.
* **Operations:** latency.
* **Semantic review:** judge scores for relevance, faithfulness, safety, and robustness.

The framework therefore saves each metric at row level so that teams can diagnose disagreements between metric families. For example, a response may score low on exact match but high on judge-based relevance if it uses a paraphrase that preserves meaning.