Skip to content
Ian Cunningham monogramIan CunninghamAI systems builder

Blog

Testing Isn’t Enough: Evaluating LangGraph Workflows That Actually Work

A practical evaluation pattern for LangGraph using pytest, small datasets, deterministic scorers, and LangSmith-backed experiment tracking.

Testing Isn’t Enough: Evaluating LangGraph Workflows That Actually Work
KT

Article summary

Key Takeaways

  1. A system can be correct and still produce poor results

    Passing tests only proves your workflow behaves correctly, not that it produces useful or high-quality outputs.

  2. Evaluation turns subjective quality into something measurable

    By running datasets through your workflow and scoring the results, you can track improvements, compare changes, and catch regressions.

  3. Start simple, then evolve your evaluation stack

    Deterministic evaluators provide fast regression checks, while LangSmith, LLM-as-Judge evaluators, and human review add richer quality assessment as systems mature.

This article is part of the 7-part Testing LangGraph Applications series. The examples come from the langgraph-testing-demo repository.

Testing LangGraph Applications Series

  1. Stop Testing AI Outputs. Start Testing State
  2. How to Structure LangGraph Tests That Actually Scale
  3. Testing Isn’t Enough: Evaluating LangGraph Workflows That Actually Work ← You are here
  4. Testing Parallel LangGraph Workflows Without Losing Control
  5. Understanding LangGraph Workflows with LangSmith Traces and pytest
  6. Command vs Send in LangGraph: Choosing the Right Primitive
  7. What It Takes to Build Production-Ready LangGraph Systems

All examples in this article are backed by a pytest-based evaluation suite, combining deterministic tests with dataset-driven scoring:

Pytest results

In the previous post, we built a proper test suite for a LangGraph workflow.

At this point, the workflow behaves correctly.

  • Routing works
  • Retries work
  • Failures are handled

But correctness only tells you whether the system behaves properly. What it doesn’t tell you is whether the outputs are actually useful.


The Gap Testing Doesn’t Cover

Your tests prove that the system behaves correctly.

For example:

  • The reviewer triggers a retry when needed
  • The graph stops on failure
  • State transitions happen as expected

But they don’t answer questions like:

  • Is the research actually useful?
  • Is the final output complete?
  • Does the answer reflect the user’s intent?

You can pass every test and still ship a system that produces mediocre results, and that’s where evaluation comes in.


From Ad-Hoc Prompts to Repeatable Evaluation

A common workflow looks like this:

Try a prompt → Read the output → Adjust → Repeat

This doesn’t scale. So instead, we move to:

Dataset → Run graph → Score outputs

This gives you a repeatable way to compare outputs over time and see whether changes are actually improving the system.

In this project, the evaluation setup lives in:

demo_graph/evals/dataset.py
demo_graph/evals/evaluators.py
tests/evals/test_langsmith_evals.py

Defining a Dataset

The dataset is intentionally simple and deterministic.

Each example contains:

  • An input
  • A set of expected terms

For example:

{
    "input": "retry path example",
    "expected_terms": ["attempt 2", "Final output"],
}

This might look basic, but at a minimum, it captures what a useful answer should contain. Not perfectly, but enough to measure progress.


Scoring Outputs

Evaluation starts with a small set of deterministic heuristic functions.

These are intentionally simple:

  • Fast
  • Reproducible
  • Cheap to run
  • CI-friendly

For production AI systems, richer evaluation layers can be added, such as:

  • LLM-as-Judge evaluators
  • Pairwise preference scoring
  • Rubric-based grading
  • Human review workflows

But deterministic evaluators still provide an extremely useful regression safety net.

For example, a simple completeness score:

def completeness_score(output: str, expected_terms: list[str]) -> float:
    normalized_output = output.lower()
    matches = sum(
        1 for term in expected_terms if term.lower() in normalized_output
    )
    return matches / len(expected_terms)

This produces a score between:

  • 0.0 → nothing matched
  • 1.0 → everything matched

Then we wrap it in an evaluator:

def evaluate_example(output: str, example):
    return {
        "completeness": completeness_score(
            output, example["expected_terms"]
        )
    }

This kind of evaluator is intentionally lightweight.

It doesn’t try to deeply understand semantic quality. Instead, it acts as a fast and deterministic regression guard that can run locally, inside CI, or in contributor environments without external dependencies.

A heuristic evaluator might check:

  • Whether required concepts appear
  • Whether citations exist
  • Whether a response follows the expected structure
  • Whether tool calls succeeded

An LLM-as-Judge evaluator, on the other hand, can assess things like:

  • Relevance
  • Helpfulness
  • Coherence
  • Faithfulness
  • Overall response quality

In practice, mature AI systems often combine multiple evaluation layers:

  • Deterministic evaluators for stability and regression detection
  • LLM-as-Judge evaluators for semantic quality assessment
  • Human review for high-confidence validation

Running the Evaluation Loop

The core evaluation loop is simple:

for example in get_dataset():
    result = await graph.ainvoke({"user_input": example["input"]})
    scores = evaluate_example(result["final_output"], example)

This gives you a repeatable way to compare outputs and see how changes affect the system over time.


Using pytest as the Evaluation Runner

One of the most useful patterns here is:

Evaluation runs inside pytest

That means you don’t need a separate system. You simply run:

pytest

And both:

  • Tests (correctness)
  • Evaluations (quality)

run together.

The fallback test is named for exactly what it guarantees:

async def test_local_eval_dataset_meets_minimum_completeness() -> None:
    graph = build_graph()

    for example in get_dataset():
        result = await graph.ainvoke({"user_input": example["input"]})
        scores = evaluate_example(result["final_output"], example)
        assert scores["completeness"] >= 0.5

That final assertion is the important part:

assert scores["completeness"] >= 0.5

This gives you a simple no-credentials safety net:

  • Outputs meet a minimum standard
  • Regressions are caught automatically
  • Contributors can run the checks without external credentials

LangSmith-Backed Evaluation

The local test isn’t the main evaluation mechanism.

It’s the fallback path: useful in CI, forks, and local development environments where LangSmith credentials are not available. You can still run:

pytest

and get a deterministic completeness check.

The primary evaluation path is the LangSmith-backed pytest test. It is skipped unless LANGSMITH_API_KEY is set, and it uses the LangSmith pytest integration directly:

@pytest.mark.skipif(
    not os.getenv("LANGSMITH_API_KEY"),
    reason="LANGSMITH_API_KEY is not set",
)
@pytest.mark.langsmith
@pytest.mark.asyncio
@pytest.mark.parametrize(
    "example",
    get_dataset(),
    ids=lambda example: example["input"],
)
async def test_langsmith_evaluation_logs_dataset_scores(
    example: EvaluationExample,
) -> None:
    graph = build_graph()

    t.log_inputs({"user_input": example["input"]})
    t.log_reference_outputs({"expected_terms": example["expected_terms"]})

    result = await graph.ainvoke({"user_input": example["input"]})
    final_output = result["final_output"]
    t.log_outputs({"final_output": final_output})

    scores = evaluate_example(final_output, example)
    for key, score in scores.items():
        t.log_feedback(key=key, score=score)

    assert scores["completeness"] >= 0.5

At that point, the test is doing more than just tracing the graph.

Each dataset row becomes its own parametrized pytest case. For each case, LangSmith receives:

  • The input: {"user_input": ...}
  • The reference data: {"expected_terms": ...}
  • The graph output: {"final_output": ...}
  • The deterministic evaluator feedback: completeness
  • The pytest pass/fail result

After running the test with LangSmith configured, those examples appear in the LangSmith Datasets & Experiments view:

LangSmith Datasets & Experiments output

That gives the project a primary evaluation path and a fallback:

  • LangSmith-backed pytest evaluation for tracked examples, scores, and experiment comparison
  • Local pytest fallback for fast, portable regression checks when credentials are unavailable

LangSmith Tracing

Tracing focuses on understanding what happened during a graph run.

LangSmith Evaluation

Evaluation focuses on judging the quality of the resulting output for a dataset example.

The refactored test logs inputs, reference outputs, generated outputs, and feedback. That’s what turns the pytest case into a real LangSmith evaluation.

With LANGSMITH_API_KEY set:

  • Results are logged externally
  • You can track runs over time
  • Compare experiments

Without LANGSMITH_API_KEY:

  • Everything still runs locally
  • No friction for contributors

This keeps the system:

  • Portable
  • Easy to run
  • Not tied to a specific tool

Why Evaluation Helps

Without evaluation:

  • Improvements are guesswork
  • Regressions go unnoticed
  • You rely on manual inspection

With evaluation:

  • You can measure change
  • You can compare versions
  • You can iterate with confidence

You move from:

I think this is better

to:

This version improved completeness from 0.5 to 0.8


Testing vs Evaluation (Clear Separation)

It’s important to keep this distinction clear:

Testing

  • Validates correctness
  • Deterministic
  • Focused on behavior

Evaluation

  • Measures quality
  • Often heuristic
  • Focused on usefulness

In this project:

  • pytest handles both
  • But the responsibilities are separate

If your tests depend on output quality, they will fail for the wrong reasons. If you skip evaluation, you won’t know if your system is improving.


A Practical Way to Evolve This

This example intentionally starts with deterministic heuristics because they’re:

  • Easy to understand
  • Easy to debug
  • Cheap to run
  • Stable in CI

In a real system, you would likely evolve this further with:

  • Larger and more representative datasets
  • More nuanced scoring strategies
  • LLM-as-Judge evaluators for semantic quality assessment
  • Pairwise preference comparisons
  • Human review workflows

But the structure stays the same:

Dataset → Run graph → Score outputs → Track results

The evaluation layer becomes more sophisticated over time, but the workflow itself remains stable.


Why You Need Both

  • Testing helps ensure the workflow behaves correctly.
  • Evaluation helps determine whether the outputs are actually useful.

In practice, production systems usually need both.


What’s Next

In the next post, we’ll push this further by introducing:

  • Parallel execution with multiple workers
  • Aggregation patterns
  • New failure modes
  • And how to test all of it

Spoiler alert: Parallel execution introduces a whole new set of testing problems.

Final Thought

AI systems generally improve once their outputs can be measured and compared consistently over time.

That’s when iteration becomes much more engineering-driven instead of relying on intuition alone.

Work with Ian

Need a workflow, pipeline, or copilot built for a real operational use case?

If this post aligns with what you are building, I can help scope the implementation and turn the concept into a production-ready system.