Skip to content
Ian Cunningham monogramIan CunninghamAI systems builder

Blog

How to Structure LangGraph Tests That Actually Scale

How to structure LangGraph tests into unit, graph, and failure layers so the suite stays useful as the workflow grows.

How to Structure LangGraph Tests That Actually Scale
KT

Article summary

Key Takeaways

  1. Unstructured test suites don’t scale

    When everything becomes an end-to-end test, systems slow down, failures become harder to debug, and confidence drops as complexity grows.

  2. Separate tests by responsibility, not convenience

    Splitting tests into unit, graph, and failure layers keeps them fast, focused, and easier to reason about as the workflow evolves.

  3. Better structure leads to faster iteration and fewer regressions

    Clear test boundaries make it easier to diagnose issues, evolve the system safely, and maintain reliability over time.

This article is part of the 7-part Testing LangGraph Applications series. The examples come from the langgraph-testing-demo repository.

Testing LangGraph Applications Series

  1. Stop Testing AI Outputs. Start Testing State
  2. How to Structure LangGraph Tests That Actually Scale ← You are here
  3. Testing Isn’t Enough: Evaluating LangGraph Workflows That Actually Work
  4. Testing Parallel LangGraph Workflows Without Losing Control
  5. Understanding LangGraph Workflows with LangSmith Traces and pytest
  6. Command vs Send in LangGraph: Choosing the Right Primitive
  7. What It Takes to Build Production-Ready LangGraph Systems

All examples in this article are backed by a structured pytest suite covering unit, graph, and failure testing layers:

Pytest results

In the previous post, we treated LangGraph workflows as state machines and focused on what to test.

Now the question is:

How do you structure your tests so they stay useful as the system grows?

This is usually where things start getting messy.


The Common Mistake

It’s easy to end up with a LangGraph test suite looking like this:

  • Run the full graph
  • Check the final output
  • Repeat for a few inputs

(Ask me how I know!)

Before long, almost everything ends up behaving like an integration test.

At first, it might feel fine, but very quickly:

  • Tests become slow
  • Failures are hard to debug
  • You don’t know which node broke
  • Small changes ripple through everything

At that point, the tests start behaving more like slow demo scripts than a maintainable test suite.


Structuring Tests Into Layers

Instead, I like to structure my tests into three distinct layers:

  1. Unit tests (nodes) → logic in isolation
  2. Graph tests (behavior) → routing and state transitions
  3. Failure tests (robustness) → what happens when things go wrong

Each layer answers a different question.


Layer 1 — Unit Tests (Nodes)

Location:

tests/unit/

These tests focus on individual nodes.

They should be:

  • Fast
  • Deterministic
  • Easy to reason about

Each node behaves like a small function:

  • Input → state
  • Output → partial state update

For example, in tests/unit/test_reviewer_node.py:

result = await reviewer({"research": "Insufficient notes."})

assert result["review_status"] == "rejected"

This test tells you:

  • The reviewer logic is working
  • The node correctly identifies incomplete research
  • The output contract is respected

You can also test edge cases:

  • Missing input
  • Retry limits
  • Error conditions

Because these tests don’t run the full graph, they stay:

  • Fast
  • Stable
  • Focused

Layer 2 — Graph Tests (Behavior)

Location:

tests/graph/test_graph_routing.py

This is where you test the system as a whole. Not the output quality, but the behavior.

For example:

result = await graph.ainvoke({"user_input": "retry path example"})

assert result["review_status"] == "approved"
assert result["research_attempts"] == 2

This verifies:

  • The reviewer rejected the first attempt
  • The graph routed back to the researcher
  • A second attempt was made
  • The workflow eventually succeeded

That gives you a much clearer picture of how the workflow behaves over time, and you’re not guessing based on output text. You’re verifying:

  • State transitions
  • Routing decisions
  • System behavior over time

In my opinion, that’s where state-machine-based workflows become genuinely useful.


Layer 3 — Failure Tests (Robustness)

Location:

tests/graph/test_error_paths.py

This layer is often skipped, but it matters a lot in production because the health of third-party APIs, MCP servers, and external services is outside your control.

Here, you simulate failures using fake models.

For example:

  • Researcher throws an exception
  • Reviewer fails unexpectedly

And then you assert:

assert result["review_status"] == "error"
assert result["errors"] == ["reviewer failed: review service unavailable"]

What this guarantees:

  • Failures are captured in state
  • The graph stops safely
  • Downstream nodes (like writer) aren’t executed

Without this layer, your system might:

  • Silently fail
  • Return partial or misleading results
  • Be impossible to debug in production

This is usually the point where the test suite starts feeling production-ready rather than purely functional.


A Note on Graph Construction

The official LangGraph testing docs recommend a useful pattern for stateful agents: create the graph inside each test, then compile it with a fresh checkpointer for that test.

That matters most when you are testing checkpoint persistence, interrupts, update_state, time travel, or resumable execution.

The demo repo keeps the default examples simpler because these graphs don’t use checkpoint persistence. The tests still build fresh graph instances inside each test, and the graph modules expose uncompiled create_* helpers so you can compile with a fresh checkpointer when a test needs that behavior.


A Quick Note on Async Testing

LangGraph supports both sync and async workflows. These examples use async graph execution because most real applications eventually call async model, database, or service APIs, and it keeps the tests close to production usage.

In pytest, that’s handled cleanly with:

@pytest.mark.asyncio

This allows you to:

  • Call await graph.ainvoke(...) directly
  • Keep tests readable
  • Avoid complex setup

For most use cases, async testing adds very little overhead.


Why This Structure Works

Each layer has a clear responsibility:

Layer Purpose Speed Debuggability
Unit tests Validate node logic Fast High
Graph tests Validate system behavior Medium Medium
Failure tests Validate robustness Fast High

Because concerns are separated:

  • Failures are easier to diagnose
  • Tests remain stable as the system grows
  • You avoid brittle, over-coupled tests

What About LLMs?

In this project, all tests use deterministic fake models, and that’s intentional.

If your tests depend on real LLM outputs:

  • They become flaky
  • They slow down
  • They fail for the wrong reasons

Instead:

  • Use pytest for correctness
  • Use evaluation (LangSmith, etc.) for quality

We’ll cover that in the next post.

Why the Structure Matters

If all your tests run the full graph, the test suite eventually becomes difficult to maintain and slow to iterate on.

I believe a well-structured LangGraph test suite:

  • Tests nodes in isolation
  • Tests behavior at the graph level
  • Tests failure paths explicitly

That gives me a lot more confidence when evolving a system.


What’s Next

In the next post, we’ll look at:

  • How to evaluate LangGraph workflows properly
  • Using datasets instead of ad-hoc prompts
  • Scoring outputs with LangSmith

Once correctness is reasonably covered, the harder question becomes output quality and evaluation.


Final Thought

A lot of AI system problems don’t really show up until you start testing failure handling properly.

This isn’t because the model is bad, but because there were never any tests for how the system as a whole behaves when things go wrong.

If I’m designing a system, I want to catch as many failures as possible before the users do, and properly structured tests make that significantly more likely.

Work with Ian

Need a workflow, pipeline, or copilot built for a real operational use case?

If this post aligns with what you are building, I can help scope the implementation and turn the concept into a production-ready system.