LLM As Judge in Evaliphy
Evaliphy uses an LLM as a judge to score your RAG system's responses. Before you rely on these scores to make decisions, it is worth understanding how they are produced, what influences them, and how to improve their accuracy for your specific use case.
How scoring works
When you call an assertion like toBeFaithful(), Evaliphy:
- Takes your
query,response, andcontextas inputs - Loads a prompt template for that assertion
- Fills the template with your inputs
- Sends the rendered prompt to your configured judge model
- Parses the model's response into a numeric score between 0.0 and 1.0
- Compares the score against the threshold to determine pass or fail
The quality of the score depends on two things — the judge model you use, and the prompt that instructs it.
Built-in prompts
Evaliphy ships with default prompts for every built-in assertion:
| Assertion | What it measures |
|---|---|
toBeFaithful() | Whether the response is grounded in the retrieved context |
toBeRelevant() | Whether the response addresses the query |
toBeGrounded() | Whether claims in the response are supported by the context |
toBeCoherent() | Whether the response is logically consistent |
toBeHarmless() | Whether the response contains harmful or toxic content |
These prompts are written to work reasonably well across a broad range of RAG applications. They are a good starting point.
However, they are general by design. A prompt written for a customer support RAG system will score differently on a legal document retrieval system, a medical knowledge base, or a code assistant. The built-in prompts do not know your domain, your users, or what "good" looks like in your specific context.
When built-in prompts may not be enough
You may notice scoring feels off in situations like these:
- Your domain has specialised terminology — the judge may not recognise domain-specific language as correct or faithful
- Your context is structured data — tables, JSON, or code snippets behave differently than prose paragraphs
- Your responses are intentionally brief — a one-word answer to a yes/no question may score poorly on coherence even though it is correct
- Your use case has strict faithfulness requirements — the default threshold may be too lenient or too strict for your standards
- Your language is not English — built-in prompts are written in English and perform best with English inputs
In these cases, custom prompts will give you significantly more accurate and meaningful scores.
Using custom prompts
1. Create a prompts directory
Create a folder in your project to store your custom prompt files:
my-eval-project/
evals/
prompts/ ← add this
evaliphy.config.ts
2. Point Evaliphy to your prompts directory
Add promptsDir to your config file:
import { defineConfig } from '@evaliphy/sdk';
export default defineConfig({
http: {
baseUrl: 'http://localhost:8080',
},
llmAsJudgeConfig: {
model: 'gpt-4o-mini',
provider: {
type: 'openai',
apiKey: process.env.OPENAI_API_KEY,
},
promptsDir: './prompts', // ← add this
},
});
3. Name your prompt files correctly
Evaliphy resolves prompts by filename. The filename must match the assertion name exactly:
| Assertion | Expected filename |
|---|---|
toBeFaithful() | prompts/faithfulness.md |
toBeRelevant() | prompts/relevance.md |
toBeGrounded() | prompts/groundedness.md |
toBeCoherent() | prompts/coherence.md |
toBeHarmless() | prompts/harmlessness.md |
If Evaliphy finds a matching file in your promptsDir, it uses that. If not, it falls back to the built-in prompt. This means you only need to create files for the assertions you want to customise — everything else continues to use the defaults.
4. Write your prompt file
Each prompt file is a Markdown file with a YAML frontmatter block at the top.
---
name: faithfulness
input_variables:
- question
- context
- response
---
You are evaluating responses from a customer support RAG system
for a UK-based e-commerce company.
Faithfulness means the response only uses information from the
retrieved help articles. Responses that add information from
general knowledge — even if correct — should be penalised.
## Question
{{question}}
## Retrieved Help Articles
{{context}}
## Agent Response
{{response}}
Evaluate how faithful this response is to the retrieved articles.
Rules for custom prompts:
- The
input_variablesin frontmatter must includequestion,context, andresponse— Evaliphy injects these automatically - Use
{{question}},{{context}}, and{{response}}as placeholders in the template body - Do not add an output format section — Evaliphy appends this automatically to ensure consistent scoring
- The file must be valid Markdown with valid YAML frontmatter
Evaliphy validates your prompt file at startup. If a required variable is missing or the frontmatter is malformed, the run will fail immediately with a clear error message — before any API calls are made.
Choosing the right judge model
The judge model has a significant impact on scoring quality. As a general guide:
| Model | Suitable for |
|---|---|
gpt-4o | Highest accuracy, best for production evaluation |
gpt-4o-mini | Good balance of accuracy and cost, suitable for most use cases |
gpt-3.5-turbo | Fast and cheap, less reliable for nuanced scoring |
Smaller or less capable models tend to:
- Return scores that cluster around 0.5 regardless of actual quality
- Misinterpret domain-specific terminology
- Produce inconsistent scores across runs for the same input
If your scores feel unreliable, switching to a more capable model is often the fastest fix.
Calibrating your thresholds
The default threshold for all assertions is 0.7. This is a reasonable starting point but not a universal standard.
What a score of 0.7 means in practice depends on:
- The capability of your judge model
- The complexity of your domain
- The strictness of your custom prompt
How to calibrate:
- Run Evaliphy against a set of samples where you already know the expected quality
- Look at the scores and the judge's reasoning in the report
- Adjust the threshold until passing scores align with what you consider acceptable
You can set thresholds globally in config or per assertion:
// global threshold for all assertions
export default defineConfig({
llmAsJudgeConfig: {
threshold: 0.8,
},
});
// per assertion override
await expect({ query, response, context })
.toBeFaithful({ threshold: 0.9 });
What Evaliphy cannot guarantee
Being transparent about limitations is important:
- LLM judges are not deterministic — the same input may produce slightly different scores across runs. For stability, set
temperature: 0in your judge config. - Scores are opinions, not facts — the judge is an LLM making a judgement call. It can be wrong. Use scores as a signal, not a ground truth.
- Built-in prompts are general — they will not capture domain-specific quality standards without customisation.
- Thresholds are arbitrary — a score of 0.8 does not mean "80% correct." It means the judge rated this response at 0.8 on the scale defined by the prompt.
The goal of Evaliphy is to give you a consistent, repeatable signal about your RAG system's quality over time — not to produce a scientifically precise measurement on any single run.
Summary
| Situation | Recommendation |
|---|---|
| Getting started | Use built-in prompts and default thresholds |
| Scores feel too lenient or too strict | Adjust thresholds in config |
| Domain-specific terminology causing issues | Write custom prompts |
| Scores vary significantly between runs | Set temperature: 0 in judge config |
| Scores are consistently low or high | Switch to a more capable judge model |
Start with the defaults, observe the results, and customise from there. The built-in prompts exist to get you scoring immediately — custom prompts exist to make those scores meaningful for your specific system.