LLM Evaluation

LLM evaluation is the systematic process of measuring language model output quality across dimensions like accuracy, faithfulness, relevance, and safety.

LLM evaluation is the systematic process of measuring language model output quality across dimensions like accuracy, faithfulness, relevance, and safety. It answers a question that traditional software testing cannot: "is this non-deterministic output good enough?" The field builds on earlier work in NLP evaluation but has evolved significantly for the open-ended generation tasks that LLMs handle.

Why LLM evaluation exists

Traditional software produces deterministic outputs. Given the same input, you get the same output, and you can assert on exact matches. LLM applications break this model. The same prompt can produce different responses across runs, and "correct" is not binary -- a response can be partially accurate, mostly relevant, or somewhat helpful. Evaluation provides the scoring framework that replaces pass/fail assertions.

Without evaluation, teams resort to manual spot-checking, which does not scale and catches problems only after they reach users. Evaluation makes quality measurable, trackable, and gateable in CI/CD pipelines.

Evaluation approaches

Deterministic checks. Format validation, schema compliance, length bounds, regex patterns, PII detection. These are fast, free, and reliable. They catch a surprising share of production failures -- malformed JSON, safety filter bypasses, and output schema violations.

Embedding similarity. Compares the model's output to a reference answer using vector embeddings. A cosine similarity score measures semantic closeness. This works well for tasks with known-good reference answers and tolerates phrasing variation.

LLM-as-a-Judge. A separate (usually stronger) model scores the output against a rubric, as described in the Judging LLM-as-a-Judge research. This handles subjective quality dimensions like helpfulness, tone, and completeness that cannot be checked deterministically. The tradeoff is cost, latency, and the judge's own variance.

Human evaluation. Domain experts rate outputs. This is the gold standard for calibrating automated metrics but does not scale for continuous evaluation. Most teams use human evaluation to validate their automated eval setup, then run automated evals in production.

Key evaluation dimensions

The dimensions you evaluate depend on your application. Common ones include:

  • Factual accuracy -- are the claims in the output correct?
  • Faithfulness -- is the output grounded in the provided context (critical for RAG systems)?
  • Relevance -- does the output actually answer the question?
  • Completeness -- does the output cover all required points?
  • Safety -- does the output avoid harmful, biased, or inappropriate content?
  • Format compliance -- does the output follow the requested structure?

Evaluation in the LLMOps lifecycle

Evaluation appears at three points in the LLMOps lifecycle:

  1. Development-time evaluation. Engineers test prompt changes and model swaps against eval datasets during development. Fast feedback, small datasets.
  2. Pre-deploy evaluation. The CI/CD pipeline runs the eval suite and gates deployment on quality thresholds. This is where regression testing catches quality drops before they reach production.
  3. Production evaluation. Online evaluation samples live traffic and scores it in near-real-time. This catches drift caused by model updates, data changes, and usage pattern shifts that pre-deploy testing cannot cover.

The LLM evaluation guide covers how to build and operate an evaluation practice across all three stages, including metric selection, threshold calibration, and tooling choices. For RAG-specific evaluation patterns, see RAG evaluation. Teams evaluating tooling can compare options in the DeepEval alternative and Promptfoo alternative pages.