Quick Start
Get up and running with Evaliphy in minutes. Prerequisites Before you start, make sure you have:
- Node.js 24 or higher
- An OpenAI API key (or any OpenAI-compatible provider)
- A running RAG application with an HTTP endpoint
1. Initialise your project
The easiest way to start is using the Evaliphy CLI. It creates a recommended project structure with everything you need.
npm install -g evaliphy
npx evaliphy init my-eval-project
cd my-eval-project
npm install
This creates the following structure:
my-eval-project/
evals/
example.eval.ts — a sample evaluation to get you started
evaliphy.config.ts — main configuration file
package.json — project dependencies and scripts
tsconfig.json — TypeScript configuration
2. Set your API key
Evaliphy uses an LLM to judge your RAG responses. Add your API key to your environment before running evaluations.
export OPENAI_API_KEY=your-api-key-here
Or add it to a .env file at the root of your project:
OPENAI_API_KEY=your-api-key-here
3. Configure Evaliphy
Open evaliphy.config.ts and point it at your RAG application:
import { defineConfig } from '@evaliphy/sdk';
export default defineConfig({
http: {
baseUrl: 'https://api.your-rag-app.com', // your RAG API base URL
timeout: 10000,
},
llmAsJudgeConfig: {
model: 'gpt-4o-mini',
provider: {
type: 'openai',
apiKey: process.env.OPENAI_API_KEY,
},
},
});
Evaliphy uses gpt-4o-mini by default. You can use any OpenAI-compatible provider including OpenRouter, Azure OpenAI, or a local model.
4. Write your first evaluation
Open evals/example.eval.ts and replace the contents with a real evaluation against your RAG API:
import { evaluate, expect } from 'evaliphy';
evaluate('answer quality', async ({ httpClient }) => {
const query = 'What is your refund policy?';
const context = "A detailed text explaining return policy"
// 1. call your RAG application
const data = await httpClient.post('/chat', { message: query });
const llmReply = await data.json();
// 2. assert the response is relevant to the query
await expect(query, context, llmReply.answer).toBeRelevant(); // default threshold is 0.7
// 3. assert the response is faithful to the retrieved context
await expect({
query,
context,
response: llmReply.answer,
}).toBeFaithful({ threshold: 0.8 });
});
What each assertion checks:
toBeRelevant() — does the response actually address the query toBeFaithful() — does the response stay grounded in the retrieved context without hallucinating
5. Run your evaluations
npm test
Or directly via the CLI:
npx evaliphy eval
Evaliphy will:
Discover all .eval.ts files in your evals directory Call your RAG API for each evaluation Score each response using the built-in LLM judge Print results to the console Write a full report to the report/ directory
What a passing run looks like:
✓ answer quality — toBeRelevant (score: 0.91)
✓ answer quality — toBeFaithful (score: 0.87)
2 passed, 0 failed
Report written to report/report-[runId].html
What a failing run looks like:
```bash
✓ answer quality — toBeRelevant (score: 0.91)
✗ answer quality — toBeFaithful (score: 0.52)
The response introduces information not found in the retrieved context.
1 passed, 1 failed
Report written to results/report-[runId].html
When an assertion fails, Evaliphy tells you the score, the threshold it was measured against, and the judge's reasoning in plain English — so you know exactly what to fix.
Next steps
Add more evaluations covering different queries and edge cases Explore the full assertion library — toBeGrounded, toBeCoherent, and more Set up CI integration to catch regressions automatically on every deploy Customise thresholds and models per assertion if your use case needs it