When the Judge is Wrong: Tuning LLM-as-a-Judge for Your Domain
You’ve set up Evaliphy, connected your RAG API, and started running assertions. But then you see it: a test failed with a low score, but when you read the response, it actually looks... fine? Or worse, a test passed when the bot clearly hallucinated something specific to your industry.
This is the "Judge Gap." Even the best models like GPT-4o or Claude 3.5 Sonnet don't know the specific nuances of your company's internal jargon, legal constraints, or product edge cases out of the box.
If you feel like the LLM-as-a-Judge isn't "getting" your domain, don't throw away the tool. Here is how you deal with it.
1. Analyze the 'Reason'
Before changing anything, look at the reason field in your Evaliphy report. Evaliphy doesn't just give you a score; it asks the judge to explain its work.
Often, you'll find the judge is actually right, but your "Golden Context" was missing a key detail. If the judge says "The response mentions a 30-day refund policy but the context only discusses 14 days," and you know the policy just changed to 30 days, your test data is stale, not the judge.
2. Adjust the Threshold
By default, Evaliphy assertions usually look for a score of 0.7 or higher to pass. If your domain is highly creative or subjective, a 0.6 might actually be a "pass" for you. Conversely, for medical or legal RAG, you might want to bump that to 0.9.
You can adjust this per assertion:
await expect(res).toBeFaithful({ threshold: 0.9 });
3. The Power Move: Custom Prompts
If the default logic for toBeRelevant or toBeFaithful is too generic for your needs, Evaliphy allows you to override the judge's instructions entirely.
Every LLM-based assertion in Evaliphy is powered by a Markdown prompt file. You can provide your own version of these prompts to teach the judge about your domain.
How to use promptsDir
In your evaliphy.config.ts, you can specify a directory where your custom prompts live:
import { defineConfig } from 'evaliphy';
export default defineConfig({
llmAsJudgeConfig: {
model: 'gpt-4o',
promptsDir: './my-custom-prompts' // Path relative to config file
}
});
Overriding an Assertion
If you want to customize how toBeRelevant works, create a file named toBeRelevant.md inside your promptsDir.
Evaliphy uses a simple Markdown format with YAML frontmatter. Here’s what a custom prompt might look like for a medical RAG system:
---
name: toBeRelevant
input_variables: [response, query, context]
---
You are a medical audit expert. Your task is to judge if a chatbot's response
accurately and safely answers a patient's query.
Query: {{query}}
Response: {{response}}
Context: {{context}}
Special Domain Rules:
1. If the response gives specific dosage advice without a disclaimer, it is NOT relevant/safe (score 0).
2. Use professional medical terminology.
Return a JSON object with:
- score: 0.0 to 1.0
- reason: A brief explanation of the score
- passed: true/false (threshold is 0.7)
When you run evaliphy eval, the engine will look in your promptsDir first. If it finds toBeRelevant.md, it uses your medical-specific instructions instead of the built-in ones.
Summary
LLM-as-a-Judge is a starting point, not a black box you're stuck with. By using promptsDir, you can evolve Evaliphy from a general-purpose tool into a domain-expert auditor that understands your business as well as your QA team does.
Stop fighting the judge—start training it.