Evaliphy is currently in beta. It is not recommended for production use yet. Please try it out and share your feedback.

Quick Start

Get up and running with Evaliphy in minutes. Prerequisites Before you start, make sure you have:

  • Node.js 24 or higher
  • An OpenAI API key (or any OpenAI-compatible provider)
  • A running RAG application with an HTTP endpoint

1. Initialise your project

The easiest way to start is using the Evaliphy CLI. It creates a recommended project structure with everything you need.

npm install -g evaliphy
npx evaliphy init my-eval-project
cd my-eval-project
npm install

This creates the following structure:

my-eval-project/
  evals/
    example.eval.ts       — a sample evaluation to get you started
  evaliphy.config.ts      — main configuration file
  package.json            — project dependencies and scripts
  tsconfig.json           — TypeScript configuration

2. Set your API key

Evaliphy uses an LLM to judge your RAG responses. Add your API key to your environment before running evaluations.

export OPENAI_API_KEY=your-api-key-here

Or add it to a .env file at the root of your project:

OPENAI_API_KEY=your-api-key-here

3. Configure Evaliphy

Open evaliphy.config.ts and point it at your RAG application:

import { defineConfig } from '@evaliphy/sdk';

export default defineConfig({
  http: {
    baseUrl: 'https://api.your-rag-app.com',  // your RAG API base URL
    timeout: 10000,
  },
  llmAsJudgeConfig: {
    model: 'gpt-4o-mini',
    provider: {
      type: 'openai',
      apiKey: process.env.OPENAI_API_KEY,
    },
  },
});

Evaliphy uses gpt-4o-mini by default. You can use any OpenAI-compatible provider including OpenRouter, Azure OpenAI, or a local model.

4. Write your first evaluation

Open evals/example.eval.ts and replace the contents with a real evaluation against your RAG API:

import { evaluate, expect } from 'evaliphy';

evaluate('answer quality', async ({ httpClient }) => {
  const query = 'What is your refund policy?';
  const context = "A detailed text explaining return policy"

  // 1. call your RAG application
  const data = await httpClient.post('/chat', { message: query });
  const llmReply = await data.json();

  // 2. assert the response is relevant to the query
  await expect(query, context, llmReply.answer).toBeRelevant(); // default threshold is 0.7

  // 3. assert the response is faithful to the retrieved context
  await expect({
    query,
    context,
    response: llmReply.answer,
  }).toBeFaithful({ threshold: 0.8 });
});

What each assertion checks:

toBeRelevant() — does the response actually address the query toBeFaithful() — does the response stay grounded in the retrieved context without hallucinating

5. Run your evaluations

npm test

Or directly via the CLI:

npx evaliphy eval

Evaliphy will:

Discover all .eval.ts files in your evals directory Call your RAG API for each evaluation Score each response using the built-in LLM judge Print results to the console Write a full report to the report/ directory

What a passing run looks like:

  ✓  answer quality — toBeRelevant    (score: 0.91)
  ✓  answer quality — toBeFaithful    (score: 0.87)

  2 passed, 0 failed
  Report written to report/report-[runId].html

What a failing run looks like:

```bash  
  ✓  answer quality — toBeRelevant    (score: 0.91)
  ✗  answer quality — toBeFaithful    (score: 0.52)
       The response introduces information not found in the retrieved context.

  1 passed, 1 failed
  Report written to results/report-[runId].html

When an assertion fails, Evaliphy tells you the score, the threshold it was measured against, and the judge's reasoning in plain English — so you know exactly what to fix.

Next steps

Add more evaluations covering different queries and edge cases Explore the full assertion library — toBeGrounded, toBeCoherent, and more Set up CI integration to catch regressions automatically on every deploy Customise thresholds and models per assertion if your use case needs it