Introduction
Evaliphy simplifies end-to-end AI testing by treating your AI system like a black box.
If you already test APIs with assertions, Evaliphy will feel familiar. You write assertions, run tests in CI, and get human-readable reports.
No Python notebooks. No research stack. No vendor lock-in.
Works with any AI system: RAG, agents, chatbots, summarizers, and content generation flows.
Why Evaliphy Exists
Most AI evaluation tools are excellent for research and model optimization. But production teams need a simpler workflow they can run as part of shipping.
The Real Problem
You have:
- ✅ AI features in production
- ✅ A CI/CD pipeline
- ✅ Existing test practices for APIs and apps
- ❌ No consistent way to test AI behavior before release
Existing tools ask you to:
- Leave your normal engineering workflow
- Learn ML-heavy setup patterns
- Interpret metrics-heavy outputs that are hard to act on quickly
The Evaliphy Approach
Evaliphy lets you:
- Test your real API over HTTP
- Write assertions with a familiar mental model
- Run tests in CI/CD like your existing suites
- Review clear, human-readable reports
Think of it like this:
- Playwright tests UI behavior with assertions
- Evaliphy tests AI behavior with assertions
Same philosophy. Different target.
The Four Pillars
- Familiar mental model: assertions that feel like your API tests.
- No vendor lock-in: open source, your data, your results.
- No ML overhead: no notebooks, no fine-tuning setup, no research infrastructure.
- Human-readable reports: pass/fail with reasoning teams can act on quickly.
Who It's For
Use Evaliphy if you:
- Ship AI systems in production
- Need reliable quality checks in CI/CD
- Want a workflow that matches existing engineering practices
- Want to catch regressions before users do
Use something else if you:
- Are researching new evaluation metrics
- Need to fine-tune models
- Prefer notebook-first experimentation workflows
Quick Comparison
| Aspect | Evaliphy | Research Tools | Prompt Testing |
|---|---|---|---|
| Mental Model | Assertions (like API tests) | Research and optimization | Prompt iteration |
| Workflow | CI/CD pipeline | Notebook and experiments | CLI/Web testing |
| Setup Time | Minutes | Hours | Minutes |
| ML Knowledge Required | None | Significant | Minimal |
| Vendor Lock-In | None (open source) | Possible | Possible |
| Best For | Production AI testing | Benchmarking and fine-tuning | Prompt engineering |
When to Use What
Evaliphy -> You want production-ready AI testing in CI
- Example: "We shipped an AI support assistant and need quality checks on every PR."
- Your team: Product and engineering teams shipping user-facing AI features
- Your workflow: GitHub → CI → Deploy
- Best for: Any black-box AI API (RAG, agents, chatbots, summarizers, generation)
Research tools -> You are optimizing models and metrics
- Example: "We're experimenting with different retrieval strategies and benchmarking them."
- Your team: ML engineers, data scientists, researchers
- Your workflow: Jupyter notebooks → experiments → optimization
Prompt testing tools -> You are iterating prompt templates quickly
- Example: "Testing different prompt templates for our chatbot."
- Your team: Anyone experimenting with prompts
- Your workflow: Prompt engineering → testing → iteration
After beta
The beta exists to validate one thing — does this workflow actually work for QA engineers in practice? Based on what we learn, the roadmap includes:
- Broader assertion library covering more AI quality dimensions
- CI integrations for GitHub Actions, GitLab, and Jenkins out of the box
- Multi-turn conversation evaluation for chat-based systems
- Custom assertion support with your own prompts and scoring logic
- Comparison runs to diff two versions of your AI system against each other
Nothing on this list is promised. It reflects where we think Evaliphy should go based on the problems we set out to solve.
Next steps
Evaliphy is in early beta. The API will change. There will be rough edges.
If you are an engineering team with AI features in production or in development, we would genuinely love to hear whether this solves a real problem for you.
npm install -g evaliphy
evaliphy init my-project
- Something does not work — open an issue
- An assertion you need does not exist — tell us
- Something feels wrong about the workflow — we want to know
The beta is the conversation. The 1.0 comes after.