Quick Start

Get up and running with Evaliphy in minutes. Prerequisites Before you start, make sure you have:

Node.js 24 or higher
An OpenAI API key (or any OpenAI-compatible provider)
A running RAG application with an HTTP endpoint

1. Initialise your project

The easiest way to start is using the Evaliphy CLI. It creates a recommended project structure with everything you need.

npm install -g evaliphy
npx evaliphy init my-eval-project
cd my-eval-project
npm install

This creates the following structure:

my-eval-project/
  evals/
    example.eval.ts       — a sample evaluation to get you started
  evaliphy.config.ts      — main configuration file
  package.json            — project dependencies and scripts
  tsconfig.json           — TypeScript configuration

2. Set your API key

Evaliphy uses an LLM to judge your RAG responses. Add your API key to your environment before running evaluations.

export OPENAI_API_KEY=your-api-key-here

Or add it to a .env file at the root of your project:

OPENAI_API_KEY=your-api-key-here

3. Configure Evaliphy

Open evaliphy.config.ts and point it at your RAG application:

import { defineConfig } from '@evaliphy/sdk';

export default defineConfig({
  http: {
    baseUrl: 'https://api.your-rag-app.com',  // your RAG API base URL
    timeout: 10000,
  },
  llmAsJudgeConfig: {
    model: 'gpt-4o-mini',
    provider: {
      type: 'openai',
      apiKey: process.env.OPENAI_API_KEY,
    },
  },
});

Evaliphy uses gpt-4o-mini by default. You can use any OpenAI-compatible provider including OpenRouter, Azure OpenAI, or a local model.

4. Write your first evaluation

Open evals/example.eval.ts and replace the contents with a real evaluation against your RAG API:

import { evaluate, expect } from 'evaliphy';

evaluate('answer quality', async ({ httpClient }) => {
  const query = 'What is your refund policy?';
  const context = "A detailed text explaining return policy"

  // 1. call your RAG application
  const data = await httpClient.post('/chat', { message: query });
  const llmReply = await data.json();

  // 2. assert the response is relevant to the query
  await expect(query, context, llmReply.answer).toBeRelevant(); // default threshold is 0.7

  // 3. assert the response is faithful to the retrieved context
  await expect({
    query,
    context,
    response: llmReply.answer,
  }).toBeFaithful({ threshold: 0.8 });
});

What each assertion checks:

toBeRelevant() — does the response actually address the query toBeFaithful() — does the response stay grounded in the retrieved context without hallucinating

5. Run your evaluations

npm test

Or directly via the CLI:

npx evaliphy eval

Evaliphy will:

Discover all .eval.ts files in your evals directory Call your RAG API for each evaluation Score each response using the built-in LLM judge Print results to the console Write a full report to the report/ directory

What a passing run looks like:

  ✓  answer quality — toBeRelevant    (score: 0.91)
  ✓  answer quality — toBeFaithful    (score: 0.87)

  2 passed, 0 failed
  Report written to report/report-[runId].html

What a failing run looks like:

```bash  
  ✓  answer quality — toBeRelevant    (score: 0.91)
  ✗  answer quality — toBeFaithful    (score: 0.52)
       The response introduces information not found in the retrieved context.

  1 passed, 1 failed
  Report written to results/report-[runId].html

When an assertion fails, Evaliphy tells you the score, the threshold it was measured against, and the judge's reasoning in plain English — so you know exactly what to fix.

Next steps

Add more evaluations covering different queries and edge cases Explore the full assertion library — toBeGrounded, toBeCoherent, and more Set up CI integration to catch regressions automatically on every deploy Customise thresholds and models per assertion if your use case needs it