Evaliphy is currently in beta. It is not recommended for production use yet. Please try it out and share your feedback.
🚀 Evaliphy Beta is now liveRead the docs

Evaliphy enables QAs to test RAG pipelines with simplicity.

Evaliphy is an evaluation SDK built for QA engineers — not ML researchers, not AI specialists. If you've written end-to-end tests before, picking this up will feel familiar. Built-in judges, real API testing, and CI-ready reports. No ML background needed, no prompt wrangling, no glue code.

Read the Docs

If you can write a test, you can evaluate AI. It is as simple as writing Playwright for UI test.

Stop fighting with Python notebooks, complex ML metrics, and brittle API calls. Evaliphy gives you a fluent, type-safe API to test RAG pipelines as black boxes.

return-policy.eval.ts
import { evaluate, expect }  from 'evaliphy';

const sample = {
  query:  "What is the return policy?",
  expectedContext:  "Items can be returned within 30 days."
};

evaluate("Return Policy Chat", async ({ httpClient }) => {
  // 1. Hit your RAG endpoint
   const res = await httpClient.post('/api/chat', { message: sample.query });
   const data = await res.json();

  // 2. Assert in plain English
   await expect({
    query: sample.query,
    response: data.answer,
    context: data.retrieved_chunks
  }).toBeFaithful();
   await expect(data.answer).toBeRelevant();
});
Evaluation Report
Evaliphy Report Screenshot

Built for Quality Engineers, not Data Scientists.

Understandable Metrics

Forget "Contextual Precision" and "Cosine Similarity." Assert against what actually matters:toBeFaithful(),toBeRelevant(), andtoBeGrounded().

Explicit, Traceable Data Flow

No magic background context. Pass your golden data, CSV rows, or database records directly into the assertions so you always know exactly what is being tested.

Battle-Tested Prompts

We spent hundreds of hours benchmarking LLM-as-a-judge prompts so you don't have to. Just provide your API key, and Evaliphy handles the prompting, parsing, and retry logic.

Runs Where You Run

It’s just Node.js. Run your RAG evaluations in GitHub Actions, GitLab CI, or Jenkins using the standardnpx evaliphy run command.

Two-Phase Architecture. Infinite Reliability.

01

Configure Once

Set your LLM judge models (e.g., gpt-4o-mini) and confidence thresholds globally in evaliphy.config.ts.

02

Collect & Execute

Evaliphy builds a deterministic test tree, then executes your HTTP calls and RAG pipelines in parallel.

03

Evaluate & Report

The built-in LLM judge evaluates the responses against your assertions and returns human-readable failure reasons.

Join the Beta Program

We are currently in open beta. We’re looking for QA teams and software engineers building RAG applications to help us refine the API and expand our matcher library.

  • Free for commercial use during Beta
  • Direct access to the core engineering team
  • Influence the v1.0 roadmap
Star on GitHub

Ready to test your RAG pipeline?

Start evaluating your AI in under 5 minutes.

$npm install -g evaliphy
$npx evaliphy init