LLM As Judge in Evaliphy

Evaliphy uses an LLM as a judge to score your RAG system's responses. Before you rely on these scores to make decisions, it is worth understanding how they are produced, what influences them, and how to improve their accuracy for your specific use case.

How scoring works

When you call an assertion like toBeFaithful(), Evaliphy:

Takes your query, response, and context as inputs
Loads a prompt template for that assertion
Fills the template with your inputs
Sends the rendered prompt to your configured judge model
Parses the model's response into a numeric score between 0.0 and 1.0
Compares the score against the threshold to determine pass or fail

The quality of the score depends on two things — the judge model you use, and the prompt that instructs it.

Built-in prompts

Evaliphy ships with default prompts for every built-in assertion:

Assertion	What it measures
`toBeFaithful()`	Whether the response is grounded in the retrieved context
`toBeRelevant()`	Whether the response addresses the query
`toBeGrounded()`	Whether claims in the response are supported by the context
`toBeCoherent()`	Whether the response is logically consistent
`toBeHarmless()`	Whether the response contains harmful or toxic content

These prompts are written to work reasonably well across a broad range of RAG applications. They are a good starting point.

However, they are general by design. A prompt written for a customer support RAG system will score differently on a legal document retrieval system, a medical knowledge base, or a code assistant. The built-in prompts do not know your domain, your users, or what "good" looks like in your specific context.

When built-in prompts may not be enough

You may notice scoring feels off in situations like these:

Your domain has specialised terminology — the judge may not recognise domain-specific language as correct or faithful
Your context is structured data — tables, JSON, or code snippets behave differently than prose paragraphs
Your responses are intentionally brief — a one-word answer to a yes/no question may score poorly on coherence even though it is correct
Your use case has strict faithfulness requirements — the default threshold may be too lenient or too strict for your standards
Your language is not English — built-in prompts are written in English and perform best with English inputs

In these cases, custom prompts will give you significantly more accurate and meaningful scores.

Using custom prompts

1. Create a prompts directory

Create a folder in your project to store your custom prompt files:

my-eval-project/
  evals/
  prompts/           ← add this
  evaliphy.config.ts

2. Point Evaliphy to your prompts directory

Add promptsDir to your config file:

import { defineConfig } from '@evaliphy/sdk';

export default defineConfig({
  http: {
    baseUrl: 'http://localhost:8080',
  },
  llmAsJudgeConfig: {
    model: 'gpt-4o-mini',
    provider: {
      type: 'openai',
      apiKey: process.env.OPENAI_API_KEY,
    },
    promptsDir: './prompts',    // ← add this
  },
});

3. Name your prompt files correctly

Evaliphy resolves prompts by filename. The filename must match the assertion name exactly:

Assertion	Expected filename
`toBeFaithful()`	`prompts/faithfulness.md`
`toBeRelevant()`	`prompts/relevance.md`
`toBeGrounded()`	`prompts/groundedness.md`
`toBeCoherent()`	`prompts/coherence.md`
`toBeHarmless()`	`prompts/harmlessness.md`

If Evaliphy finds a matching file in your promptsDir, it uses that. If not, it falls back to the built-in prompt. This means you only need to create files for the assertions you want to customise — everything else continues to use the defaults.

4. Write your prompt file

Each prompt file is a Markdown file with a YAML frontmatter block at the top.

---
name: faithfulness
input_variables:
  - question
  - context
  - response
---

You are evaluating responses from a customer support RAG system
for a UK-based e-commerce company.

Faithfulness means the response only uses information from the
retrieved help articles. Responses that add information from
general knowledge — even if correct — should be penalised.

## Question
{{question}}

## Retrieved Help Articles
{{context}}

## Agent Response
{{response}}

Evaluate how faithful this response is to the retrieved articles.

Rules for custom prompts:

The input_variables in frontmatter must include question, context, and response — Evaliphy injects these automatically
Use {{question}}, {{context}}, and {{response}} as placeholders in the template body
Do not add an output format section — Evaliphy appends this automatically to ensure consistent scoring
The file must be valid Markdown with valid YAML frontmatter

Evaliphy validates your prompt file at startup. If a required variable is missing or the frontmatter is malformed, the run will fail immediately with a clear error message — before any API calls are made.

Choosing the right judge model

The judge model has a significant impact on scoring quality. As a general guide:

Model	Suitable for
`gpt-4o`	Highest accuracy, best for production evaluation
`gpt-4o-mini`	Good balance of accuracy and cost, suitable for most use cases
`gpt-3.5-turbo`	Fast and cheap, less reliable for nuanced scoring

Smaller or less capable models tend to:

Return scores that cluster around 0.5 regardless of actual quality
Misinterpret domain-specific terminology
Produce inconsistent scores across runs for the same input

If your scores feel unreliable, switching to a more capable model is often the fastest fix.

Calibrating your thresholds

The default threshold for all assertions is 0.7. This is a reasonable starting point but not a universal standard.

What a score of 0.7 means in practice depends on:

The capability of your judge model
The complexity of your domain
The strictness of your custom prompt

How to calibrate:

Run Evaliphy against a set of samples where you already know the expected quality
Look at the scores and the judge's reasoning in the report
Adjust the threshold until passing scores align with what you consider acceptable

You can set thresholds globally in config or per assertion:

// global threshold for all assertions
export default defineConfig({
  llmAsJudgeConfig: {
    threshold: 0.8,
  },
});

// per assertion override
await expect({ query, response, context })
  .toBeFaithful({ threshold: 0.9 });

What Evaliphy cannot guarantee

Being transparent about limitations is important:

LLM judges are not deterministic — the same input may produce slightly different scores across runs. For stability, set temperature: 0 in your judge config.
Scores are opinions, not facts — the judge is an LLM making a judgement call. It can be wrong. Use scores as a signal, not a ground truth.
Built-in prompts are general — they will not capture domain-specific quality standards without customisation.
Thresholds are arbitrary — a score of 0.8 does not mean "80% correct." It means the judge rated this response at 0.8 on the scale defined by the prompt.

The goal of Evaliphy is to give you a consistent, repeatable signal about your RAG system's quality over time — not to produce a scientifically precise measurement on any single run.

Summary

Situation	Recommendation
Getting started	Use built-in prompts and default thresholds
Scores feel too lenient or too strict	Adjust thresholds in config
Domain-specific terminology causing issues	Write custom prompts
Scores vary significantly between runs	Set `temperature: 0` in judge config
Scores are consistently low or high	Switch to a more capable judge model

Start with the defaults, observe the results, and customise from there. The built-in prompts exist to get you scoring immediately — custom prompts exist to make those scores meaningful for your specific system.