Introducing Evaliphy: Testing RAG without the ML headache
If you're a QA engineer right now, you've probably been handed a shiny new "AI Chatbot" or "RAG pipeline" to test. And if you're like me, you immediately realized that all of your usual tools are suddenly useless.
In a normal web app, clicking a button always returns the same result. In a RAG (Retrieval-Augmented Generation) app, asking the exact same question twice might give you two completely different sentences. How are you supposed to write an expect(response).toEqual(...) for that?
When I looked around for tools to solve this, everything I found was built for Data Scientists. They were all Python notebooks throwing around terms like "Cosine Similarity," "Contextual Precision," and "Recall." I don't want to calculate the mathematical distance between two vectors. I just want to hit our API, read the JSON response, and write a test that fails if the bot hallucinates.
That’s why I built Evaliphy.
What is Evaliphy?
Evaliphy is a testing SDK built specifically for QA engineers to evaluate RAG applications[not limited to RAG, will be expanded to other AI systems soon]. It treats your AI completely as a black box. You don't need to instrument your vector database or understand how the embedding model works.
If you know how to write an API test in Playwright or Jest, you already know how to use Evaliphy.
Why built for QAs?
The industry keeps trying to force QAs to learn machine learning just to test a chatbot. But QA engineers are actually the best people to test AI because we care about the observable behavior—the exact thing the user actually sees.
Evaliphy translates complex ML metrics into plain-English assertions. Instead of calculating "Answer Relevancy," you just write:
await expect(res).toBeRelevant();
await expect(res).toBeFaithful(context);
Under the hood, Evaliphy uses LLM-as-a-Judge (sending the data to a model like GPT-4o) to evaluate the meaning of the text, rather than doing a brittle string comparison. It scores the response and gives you a human-readable reason if it fails.
Key Features
Here is what we focused on for the initial release:
- TypeScript Native: No Python required. Write your evals in the exact same language and repo as your frontend/backend tests.
- Black-Box First: You use a built-in
httpClientto hit your real, running API endpoints. You test the actual system in motion. - Human-Readable Reports: When a test fails, you don't just get a
false. You get a plain-English explanation from the judge explaining exactly why the response was bad. - CI/CD Ready: It runs in your terminal, exits with standard status codes, and plugs straight into GitHub Actions or GitLab CI.
Getting Started
You can pull it down and try it against your own API in a few minutes.
Install the CLI globally (or save it to your devDependencies):
npm install -g evaliphy
Initialize the config in your test directory:
evaliphy init
Write your first .eval.ts file, run evaliphy eval, and watch the assertions do the heavy lifting.
I'm going to be posting more deep dives here soon about how to handle things like testing conversational memory, writing custom rubrics, and setting up your golden datasets.
If you're tired of flaky string-matching tests for your AI features, give it a shot.