Assertions

Evaliphy provides a fluent, chainable assertion API designed for black-box QA testing of Generative AI. Assertions use an LLM as a judge to evaluate the quality and correctness of your RAG system's outputs.

How it works

When you call an assertion, Evaliphy uses an LLM-as-a-Judge workflow:

Data Submission: The data provided in expect() (query, response, context, etc.) is sent to a separate, highly capable LLM judge (e.g., GPT-4o).
Scoring: The judge evaluates the input against a specific rubric and returns a numeric score between 0.0 and 1.0.
Threshold Comparison: This score is compared against a threshold (default is 0.7).
- If score >= threshold, the assertion passes.
- If score < threshold, the assertion fails and provides the judge's reasoning.

The `expect` function

The expect function is the entry point for all assertions. It can take a simple response string or a full EvaluationSample object.

import { expect } from 'evaliphy';

// Using a full EvaluationSample (Recommended)
await expect({
  query: "What is the return policy?",
  response: "You can return items within 30 days.",
  context: "Returns are accepted within 30 days of purchase."
}).toBeFaithful();

// Using a simple response string
await expect("The capital of France is Paris").toBeRelevant();

Core Assertions

Each core assertion is powered by a specialized prompt that instructs the LLM judge on how to evaluate the response.

`toBeFaithful()`

Summary: Measures whether every claim in the response is grounded in the retrieved context. A response is unfaithful if it introduces information not present in the context, even if that information is factually correct.

Required Input: response, context, query
Judge Prompt: toBeFaithful.md

await expect({
  query: "...",
  response: "...",
  context: "..."
}).toBeFaithful();

`toBeRelevant()`

Summary: Checks if the response directly addresses the user's query without dodging, being overly vague, or talking about unrelated topics.

Required Input: response, query
Judge Prompt: toBeRelevant.md

await expect({
  query: "...",
  response: "..."
}).toBeRelevant();

`toBeGrounded()`

Summary: Similar to faithfulness, but focuses strictly on whether the claims made in the response are supported by the retrieved context, regardless of the original query.

Required Input: response, context
Judge Prompt: toBeGrounded.md

await expect({
  response: "...",
  context: "..."
}).toBeGrounded();

`toBeCoherent()`

Summary: Evaluates the logical flow, structure, and clarity of the response. It ensures the output is easy to read and follows a natural progression of thought.

Required Input: response
Judge Prompt: toBeCoherent.md

await expect("...").toBeCoherent();

`toBeHarmless()`

Summary: Scans the response for toxicity, bias, hate speech, or dangerous instructions. This is a safety guardrail to ensure the LLM doesn't generate harmful content.

Required Input: response
Judge Prompt: toBeHarmless.md

await expect("...").toBeHarmless();

Thresholds

Every LLM-based assertion returns a score between 0.0 and 1.0.

Pass/Fail: An assertion passes if the score is greater than or equal to the threshold.
Default Thresholds: Each matcher has a default threshold defined in your evaliphy.config.ts.
Per-Assertion Override: You can override the threshold for a specific call.

await expect(input).toBeFaithful({
  threshold: 0.9, // Require a very high faithfulness score to pass
});

Soft vs Hard Assertions

Evaliphy supports both "soft" and "hard" assertion behaviors, controlled by the continueOnFailure option.

Soft Assertions (Default)

By default, Evaliphy uses soft assertions. If an assertion fails, it is recorded in the test report, but the execution continues. This allows you to see all failures in a single run.

// Execution continues even if this fails
await expect(res).toBeFaithful(); 
await expect(res).toBeRelevant();

Hard Assertions

If you want the test to stop immediately upon failure, you can set continueOnFailure: false. This is useful for critical checks where subsequent assertions wouldn't make sense if the first one fails.

await expect(res).toBeHarmless({ continueOnFailure: false });
// This line will NOT execute if toBeHarmless fails
await expect(res).toBeFaithful();

You can configure this globally in evaliphy.config.ts:

export default defineConfig({
  llmAsJudgeConfig: {
    continueOnFailure: false, // Make all assertions "hard" by default
  }
});

Negation

You can negate any assertion using the .not property.

await expect(response).not.toBeHarmless();