Introduction

Evaliphy simplifies end-to-end AI testing by treating your AI system like a black box.

If you already test APIs with assertions, Evaliphy will feel familiar. You write assertions, run tests in CI, and get human-readable reports.

No Python notebooks. No research stack. No vendor lock-in.

Works with any AI system: RAG, agents, chatbots, summarizers, and content generation flows.

Why Evaliphy Exists

Most AI evaluation tools are excellent for research and model optimization. But production teams need a simpler workflow they can run as part of shipping.

The Real Problem

You have:

✅ AI features in production
✅ A CI/CD pipeline
✅ Existing test practices for APIs and apps
❌ No consistent way to test AI behavior before release

Existing tools ask you to:

Leave your normal engineering workflow
Learn ML-heavy setup patterns
Interpret metrics-heavy outputs that are hard to act on quickly

The Evaliphy Approach

Evaliphy lets you:

Test your real API over HTTP
Write assertions with a familiar mental model
Run tests in CI/CD like your existing suites
Review clear, human-readable reports

Think of it like this:

Playwright tests UI behavior with assertions
Evaliphy tests AI behavior with assertions

Same philosophy. Different target.

The Four Pillars

Familiar mental model: assertions that feel like your API tests.
No vendor lock-in: open source, your data, your results.
No ML overhead: no notebooks, no fine-tuning setup, no research infrastructure.
Human-readable reports: pass/fail with reasoning teams can act on quickly.

Who It's For

Use Evaliphy if you:

Ship AI systems in production
Need reliable quality checks in CI/CD
Want a workflow that matches existing engineering practices
Want to catch regressions before users do

Use something else if you:

Are researching new evaluation metrics
Need to fine-tune models
Prefer notebook-first experimentation workflows

Quick Comparison

Aspect	Evaliphy	Research Tools	Prompt Testing
Mental Model	Assertions (like API tests)	Research and optimization	Prompt iteration
Workflow	CI/CD pipeline	Notebook and experiments	CLI/Web testing
Setup Time	Minutes	Hours	Minutes
ML Knowledge Required	None	Significant	Minimal
Vendor Lock-In	None (open source)	Possible	Possible
Best For	Production AI testing	Benchmarking and fine-tuning	Prompt engineering

When to Use What

Evaliphy -> You want production-ready AI testing in CI

Example: "We shipped an AI support assistant and need quality checks on every PR."
Your team: Product and engineering teams shipping user-facing AI features
Your workflow: GitHub → CI → Deploy
Best for: Any black-box AI API (RAG, agents, chatbots, summarizers, generation)

Research tools -> You are optimizing models and metrics

Example: "We're experimenting with different retrieval strategies and benchmarking them."
Your team: ML engineers, data scientists, researchers
Your workflow: Jupyter notebooks → experiments → optimization

Prompt testing tools -> You are iterating prompt templates quickly

Example: "Testing different prompt templates for our chatbot."
Your team: Anyone experimenting with prompts
Your workflow: Prompt engineering → testing → iteration

After beta

The beta exists to validate one thing — does this workflow actually work for QA engineers in practice? Based on what we learn, the roadmap includes:

Broader assertion library covering more AI quality dimensions
CI integrations for GitHub Actions, GitLab, and Jenkins out of the box
Multi-turn conversation evaluation for chat-based systems
Custom assertion support with your own prompts and scoring logic
Comparison runs to diff two versions of your AI system against each other

Nothing on this list is promised. It reflects where we think Evaliphy should go based on the problems we set out to solve.

Next steps

Evaliphy is in early beta. The API will change. There will be rough edges.

If you are an engineering team with AI features in production or in development, we would genuinely love to hear whether this solves a real problem for you.

npm install -g evaliphy
evaliphy init my-project

Something does not work — open an issue
An assertion you need does not exist — tell us
Something feels wrong about the workflow — we want to know

The beta is the conversation. The 1.0 comes after.