How Evaliphy Works
Evaliphy is designed to bring the rigor of traditional software testing to the non-deterministic world of RAG applications. It operates as a bridge between your running application and a highly capable LLM "Judge" that evaluates the quality of your system's outputs.
The Core Mechanism
At its heart, Evaliphy follows a simple but powerful workflow: Trigger → Execute → Judge → Report.

1. Trigger (The Evaluation File)
You write evaluations in TypeScript using the evaluate function. These files sit in your repository alongside your application code. When you run evaliphy eval, the CLI discovers these files and begins execution.
2. Execute (The HTTP Client)
Inside your evaluation, you use the built-in httpClient to make real requests to your running RAG API. This ensures you are testing the actual system that your users interact with, including all its prompts, retrieval logic, and infrastructure.
3. Judge (LLM-as-a-Judge)
When you call an assertion like expect(res).toBeFaithful(), Evaliphy:
- Collects the necessary data (Query, Response, and Context).
- Selects the appropriate Judge Prompt for that assertion.
- Sends the data and prompt to your configured Judge Model (e.g., GPT-4o).
- The Judge returns a numeric score (0.0 to 1.0) and a plain-English reason for that score.
4. Report (Feedback Loop)
Evaliphy compares the judge's score against your defined threshold.
- Pass: If the score meets the threshold, the test goes green.
- Fail: If it falls below, Evaliphy provides the judge's reasoning, helping you understand exactly why the response was considered poor quality.
Architecture Layers
Evaliphy is built on three distinct layers to ensure stability and performance:
Collection Layer
The CLI scans your project for .eval.ts files and builds a synchronous test tree. This happens before any I/O or LLM calls, ensuring your test structure is valid.
Execution Layer
The runner iterates through the test tree. It manages fixtures (like httpClient), handles setup/teardown hooks, and executes your test logic asynchronously.
Evaluation Layer
The expect engine orchestrates the LLM judge. It handles prompt rendering, manages API calls to the LLM provider, and parses the results into structured data for the reporters.
Why This Approach?
By treating your RAG system as a black box, Evaliphy avoids the complexity of mocking internal vector databases or embedding models. Instead, it focuses on the observable behavior—the only thing that actually matters to your end users.