Evaliphy is an evaluation SDK built for QA engineers — not ML researchers, not AI specialists. If you've written end-to-end tests before, picking this up will feel familiar. Built-in judges, real API testing, and CI-ready reports. No ML background needed, no prompt wrangling, no glue code.
Stop fighting with Python notebooks, complex ML metrics, and brittle API calls. Evaliphy gives you a fluent, type-safe API to test RAG pipelines as black boxes.
import { evaluate, expect } from 'evaliphy';
const sample = {
query: "What is the return policy?",
expectedContext: "Items can be returned within 30 days."
};
evaluate("Return Policy Chat", async ({ httpClient }) => {
// 1. Hit your RAG endpoint
const res = await httpClient.post('/api/chat', { message: sample.query });
const data = await res.json();
// 2. Assert in plain English
await expect({
query: sample.query,
response: data.answer,
context: data.retrieved_chunks
}).toBeFaithful();
await expect(data.answer).toBeRelevant();
});
Forget "Contextual Precision" and "Cosine Similarity." Assert against what actually matters:toBeFaithful(),toBeRelevant(), andtoBeGrounded().
No magic background context. Pass your golden data, CSV rows, or database records directly into the assertions so you always know exactly what is being tested.
We spent hundreds of hours benchmarking LLM-as-a-judge prompts so you don't have to. Just provide your API key, and Evaliphy handles the prompting, parsing, and retry logic.
It’s just Node.js. Run your RAG evaluations in GitHub Actions, GitLab CI, or Jenkins using the standardnpx evaliphy run command.
Set your LLM judge models (e.g., gpt-4o-mini) and confidence thresholds globally in evaliphy.config.ts.
Evaliphy builds a deterministic test tree, then executes your HTTP calls and RAG pipelines in parallel.
The built-in LLM judge evaluates the responses against your assertions and returns human-readable failure reasons.
We are currently in open beta. We’re looking for QA teams and software engineers building RAG applications to help us refine the API and expand our matcher library.
Start evaluating your AI in under 5 minutes.