Evaliphy is currently in beta. It is not recommended for production use yet. Please try it out and share your feedback.

LLM As Judge in Evaliphy

Evaliphy uses an LLM as a judge to score your RAG system's responses. Before you rely on these scores to make decisions, it is worth understanding how they are produced, what influences them, and how to improve their accuracy for your specific use case.


How scoring works

When you call an assertion like toBeFaithful(), Evaliphy:

  1. Takes your query, response, and context as inputs
  2. Loads a prompt template for that assertion
  3. Fills the template with your inputs
  4. Sends the rendered prompt to your configured judge model
  5. Parses the model's response into a numeric score between 0.0 and 1.0
  6. Compares the score against the threshold to determine pass or fail

The quality of the score depends on two things — the judge model you use, and the prompt that instructs it.


Built-in prompts

Evaliphy ships with default prompts for every built-in assertion:

AssertionWhat it measures
toBeFaithful()Whether the response is grounded in the retrieved context
toBeRelevant()Whether the response addresses the query
toBeGrounded()Whether claims in the response are supported by the context
toBeCoherent()Whether the response is logically consistent
toBeHarmless()Whether the response contains harmful or toxic content

These prompts are written to work reasonably well across a broad range of RAG applications. They are a good starting point.

However, they are general by design. A prompt written for a customer support RAG system will score differently on a legal document retrieval system, a medical knowledge base, or a code assistant. The built-in prompts do not know your domain, your users, or what "good" looks like in your specific context.


When built-in prompts may not be enough

You may notice scoring feels off in situations like these:

  • Your domain has specialised terminology — the judge may not recognise domain-specific language as correct or faithful
  • Your context is structured data — tables, JSON, or code snippets behave differently than prose paragraphs
  • Your responses are intentionally brief — a one-word answer to a yes/no question may score poorly on coherence even though it is correct
  • Your use case has strict faithfulness requirements — the default threshold may be too lenient or too strict for your standards
  • Your language is not English — built-in prompts are written in English and perform best with English inputs

In these cases, custom prompts will give you significantly more accurate and meaningful scores.


Using custom prompts

1. Create a prompts directory

Create a folder in your project to store your custom prompt files:

my-eval-project/
  evals/
  prompts/           ← add this
  evaliphy.config.ts

2. Point Evaliphy to your prompts directory

Add promptsDir to your config file:

import { defineConfig } from '@evaliphy/sdk';

export default defineConfig({
  http: {
    baseUrl: 'http://localhost:8080',
  },
  llmAsJudgeConfig: {
    model: 'gpt-4o-mini',
    provider: {
      type: 'openai',
      apiKey: process.env.OPENAI_API_KEY,
    },
    promptsDir: './prompts',    // ← add this
  },
});

3. Name your prompt files correctly

Evaliphy resolves prompts by filename. The filename must match the assertion name exactly:

AssertionExpected filename
toBeFaithful()prompts/faithfulness.md
toBeRelevant()prompts/relevance.md
toBeGrounded()prompts/groundedness.md
toBeCoherent()prompts/coherence.md
toBeHarmless()prompts/harmlessness.md

If Evaliphy finds a matching file in your promptsDir, it uses that. If not, it falls back to the built-in prompt. This means you only need to create files for the assertions you want to customise — everything else continues to use the defaults.

4. Write your prompt file

Each prompt file is a Markdown file with a YAML frontmatter block at the top.

---
name: faithfulness
input_variables:
  - question
  - context
  - response
---

You are evaluating responses from a customer support RAG system
for a UK-based e-commerce company.

Faithfulness means the response only uses information from the
retrieved help articles. Responses that add information from
general knowledge — even if correct — should be penalised.

## Question
{{question}}

## Retrieved Help Articles
{{context}}

## Agent Response
{{response}}

Evaluate how faithful this response is to the retrieved articles.

Rules for custom prompts:

  • The input_variables in frontmatter must include question, context, and response — Evaliphy injects these automatically
  • Use {{question}}, {{context}}, and {{response}} as placeholders in the template body
  • Do not add an output format section — Evaliphy appends this automatically to ensure consistent scoring
  • The file must be valid Markdown with valid YAML frontmatter

Evaliphy validates your prompt file at startup. If a required variable is missing or the frontmatter is malformed, the run will fail immediately with a clear error message — before any API calls are made.


Choosing the right judge model

The judge model has a significant impact on scoring quality. As a general guide:

ModelSuitable for
gpt-4oHighest accuracy, best for production evaluation
gpt-4o-miniGood balance of accuracy and cost, suitable for most use cases
gpt-3.5-turboFast and cheap, less reliable for nuanced scoring

Smaller or less capable models tend to:

  • Return scores that cluster around 0.5 regardless of actual quality
  • Misinterpret domain-specific terminology
  • Produce inconsistent scores across runs for the same input

If your scores feel unreliable, switching to a more capable model is often the fastest fix.


Calibrating your thresholds

The default threshold for all assertions is 0.7. This is a reasonable starting point but not a universal standard.

What a score of 0.7 means in practice depends on:

  • The capability of your judge model
  • The complexity of your domain
  • The strictness of your custom prompt

How to calibrate:

  1. Run Evaliphy against a set of samples where you already know the expected quality
  2. Look at the scores and the judge's reasoning in the report
  3. Adjust the threshold until passing scores align with what you consider acceptable

You can set thresholds globally in config or per assertion:

// global threshold for all assertions
export default defineConfig({
  llmAsJudgeConfig: {
    threshold: 0.8,
  },
});
// per assertion override
await expect({ query, response, context })
  .toBeFaithful({ threshold: 0.9 });

What Evaliphy cannot guarantee

Being transparent about limitations is important:

  • LLM judges are not deterministic — the same input may produce slightly different scores across runs. For stability, set temperature: 0 in your judge config.
  • Scores are opinions, not facts — the judge is an LLM making a judgement call. It can be wrong. Use scores as a signal, not a ground truth.
  • Built-in prompts are general — they will not capture domain-specific quality standards without customisation.
  • Thresholds are arbitrary — a score of 0.8 does not mean "80% correct." It means the judge rated this response at 0.8 on the scale defined by the prompt.

The goal of Evaliphy is to give you a consistent, repeatable signal about your RAG system's quality over time — not to produce a scientifically precise measurement on any single run.


Summary

SituationRecommendation
Getting startedUse built-in prompts and default thresholds
Scores feel too lenient or too strictAdjust thresholds in config
Domain-specific terminology causing issuesWrite custom prompts
Scores vary significantly between runsSet temperature: 0 in judge config
Scores are consistently low or highSwitch to a more capable judge model

Start with the defaults, observe the results, and customise from there. The built-in prompts exist to get you scoring immediately — custom prompts exist to make those scores meaningful for your specific system.