Evaliphy is currently in beta. It is not recommended for production use yet. Please try it out and share your feedback.

Introduction

Evaliphy simplifies end-to-end AI testing by treating your AI system like a black box.

If you already test APIs with assertions, Evaliphy will feel familiar. You write assertions, run tests in CI, and get human-readable reports.

No Python notebooks. No research stack. No vendor lock-in.

Works with any AI system: RAG, agents, chatbots, summarizers, and content generation flows.

Why Evaliphy Exists

Most AI evaluation tools are excellent for research and model optimization. But production teams need a simpler workflow they can run as part of shipping.

The Real Problem

You have:

  • ✅ AI features in production
  • ✅ A CI/CD pipeline
  • ✅ Existing test practices for APIs and apps
  • ❌ No consistent way to test AI behavior before release

Existing tools ask you to:

  • Leave your normal engineering workflow
  • Learn ML-heavy setup patterns
  • Interpret metrics-heavy outputs that are hard to act on quickly

The Evaliphy Approach

Evaliphy lets you:

  • Test your real API over HTTP
  • Write assertions with a familiar mental model
  • Run tests in CI/CD like your existing suites
  • Review clear, human-readable reports

Think of it like this:

  • Playwright tests UI behavior with assertions
  • Evaliphy tests AI behavior with assertions

Same philosophy. Different target.

The Four Pillars

  1. Familiar mental model: assertions that feel like your API tests.
  2. No vendor lock-in: open source, your data, your results.
  3. No ML overhead: no notebooks, no fine-tuning setup, no research infrastructure.
  4. Human-readable reports: pass/fail with reasoning teams can act on quickly.

Who It's For

Use Evaliphy if you:

  • Ship AI systems in production
  • Need reliable quality checks in CI/CD
  • Want a workflow that matches existing engineering practices
  • Want to catch regressions before users do

Use something else if you:

  • Are researching new evaluation metrics
  • Need to fine-tune models
  • Prefer notebook-first experimentation workflows

Quick Comparison

AspectEvaliphyResearch ToolsPrompt Testing
Mental ModelAssertions (like API tests)Research and optimizationPrompt iteration
WorkflowCI/CD pipelineNotebook and experimentsCLI/Web testing
Setup TimeMinutesHoursMinutes
ML Knowledge RequiredNoneSignificantMinimal
Vendor Lock-InNone (open source)PossiblePossible
Best ForProduction AI testingBenchmarking and fine-tuningPrompt engineering

When to Use What

Evaliphy -> You want production-ready AI testing in CI

  • Example: "We shipped an AI support assistant and need quality checks on every PR."
  • Your team: Product and engineering teams shipping user-facing AI features
  • Your workflow: GitHub → CI → Deploy
  • Best for: Any black-box AI API (RAG, agents, chatbots, summarizers, generation)

Research tools -> You are optimizing models and metrics

  • Example: "We're experimenting with different retrieval strategies and benchmarking them."
  • Your team: ML engineers, data scientists, researchers
  • Your workflow: Jupyter notebooks → experiments → optimization

Prompt testing tools -> You are iterating prompt templates quickly

  • Example: "Testing different prompt templates for our chatbot."
  • Your team: Anyone experimenting with prompts
  • Your workflow: Prompt engineering → testing → iteration

After beta

The beta exists to validate one thing — does this workflow actually work for QA engineers in practice? Based on what we learn, the roadmap includes:

  • Broader assertion library covering more AI quality dimensions
  • CI integrations for GitHub Actions, GitLab, and Jenkins out of the box
  • Multi-turn conversation evaluation for chat-based systems
  • Custom assertion support with your own prompts and scoring logic
  • Comparison runs to diff two versions of your AI system against each other

Nothing on this list is promised. It reflects where we think Evaliphy should go based on the problems we set out to solve.

Next steps

Evaliphy is in early beta. The API will change. There will be rough edges.

If you are an engineering team with AI features in production or in development, we would genuinely love to hear whether this solves a real problem for you.

npm install -g evaliphy
evaliphy init my-project
  • Something does not work — open an issue
  • An assertion you need does not exist — tell us
  • Something feels wrong about the workflow — we want to know

The beta is the conversation. The 1.0 comes after.