Stop Testing AI Outputs. Start Testing State

A better way to test LangGraph workflows by treating the graph as state transitions instead of judging final answer text.

This article is part of the 7-part Testing LangGraph Applications series. The examples come from the langgraph-testing-demo repository.

Testing LangGraph Applications Series

Stop Testing AI Outputs. Start Testing State ← You are here
How to Structure LangGraph Tests That Actually Scale
Testing Isn’t Enough: Evaluating LangGraph Workflows That Actually Work
Testing Parallel LangGraph Workflows Without Losing Control
Understanding LangGraph Workflows with LangSmith Traces and pytest
Command vs Send in LangGraph: Choosing the Right Primitive
What It Takes to Build Production-Ready LangGraph Systems

All examples in this article are backed by a full pytest suite covering unit tests, graph behavior, and failure scenarios:

Pytest results

Many LangGraph examples I’ve seen focus on whether the final answer “looks good”, but I soon learned that approach hides problems once workflows become more complex.

If you’re only checking the output, it’s much harder to understand what the system is actually doing.

To build reliable AI workflows, I believe it’s much better to stop thinking in terms of prompts and start thinking in terms of state and transitions.

The Problem with “Output-Based” Testing

A common approach to testing AI systems looks like this:

Run the workflow
Inspect the final output
Decide if it’s “good enough”

This approach has a few obvious problems:

Outputs are non-deterministic
Tests become brittle
Failures are hard to debug
You have no visibility into why something went wrong

This is especially problematic in systems like LangGraph, where the real complexity isn’t the output but the workflow itself.

The Shift: LangGraph as a State Machine

I think a better mental model is this:

A LangGraph workflow is a state machine with explicit transitions

Each node:

Receives state
Produces a partial update
Influences what happens next

Instead of testing outputs, you test:

State transitions
Routing decisions
Error propagation
Retry behavior

Once you make this shift, the system becomes far more testable.

A Simple (but Real) Example

This article is based on a small demo repository:

pytest

That’s all you need to run everything shown below.

The workflow itself looks like this:

Planner → Researcher → Reviewer → Writer
                      ↑
                    (retry)

The interesting part isn’t the happy path. It’s the reviewer.

The reviewer can:

Approve → continue to writer
Reject → send the graph back to researcher
Error → terminate the workflow

At that point, the workflow behaves more like a graph with branching and loops than a simple pipeline.

State Is the Contract

At the center of this system is a shared state object (state.py).

It contains both:

Data

plan
research
final_output

Control signals

review_status
errors
research_attempts
review_attempts

In practice, this ends up being extremely useful.

State isn’t just data. It’s the contract that defines system behavior

Once your state is explicit and typed, everything else becomes easier:

Nodes become predictable
Routing becomes transparent
Tests become meaningful

What to Test Instead

Instead of asking:

“Did we get a good answer?”

Ask:

Did the graph take the correct path?
Did retries happen when expected?
Did failures stop execution?
Was state updated correctly?

For example, in tests/graph/test_graph_routing.py:

result = await graph.ainvoke({"user_input": "retry path example"})

assert result["review_status"] == "approved"
assert result["research_attempts"] == 2

This tells you:

The reviewer rejected the first attempt
The graph correctly routed back to the researcher
The second attempt succeeded

This lets you test the workflow behavior itself instead of relying on whether the final output happens to look reasonable.

Unit Testing Nodes

Each node is designed to behave like a small, focused function:

Input: state
Output: partial state update
No hidden mutations

That makes unit testing straightforward.

For example, in tests/unit/test_reviewer_node.py:

result = await reviewer({"research": "Insufficient notes."})

assert result["review_status"] == "rejected"

You can also test edge cases:

Missing input
Retry limits
Error handling

Because the node logic is deterministic, these tests are stable and meaningful.

Testing Failure Paths (Where Most Systems Break)

Understandably, a lot of demos ignore failure paths, but these are usually where production systems become difficult.

In this project, failure scenarios are covered in:

tests/graph/test_error_paths.py

These tests simulate failures like:

Researcher crashing
Reviewer throwing an exception

Example:

assert result["review_status"] == "error"
assert result["errors"] == ["reviewer failed: review service unavailable"]

This gives you:

Clear failure signals
Predictable behavior
Confidence that the system won’t silently degrade

Testing the Graph, Not Just the Nodes

Unit tests are only part of the picture. You also need graph-level tests.

These live in:

tests/graph/test_graph_routing.py

They verify:

Routing logic
State progression across nodes
End-to-end behavior

For example:

result = await graph.ainvoke({"user_input": "explain node testing"})

assert result["review_status"] == "approved"
assert result["final_output"].startswith("Final output for:")

This confirms that:

The graph executed correctly
The workflow reached completion
The system behaved as expected

Testing vs Evaluation (They’re Subtly Different)

It also helps to separate testing from evaluation because they solve different problems.

Testing is about:

Determinism
Correct behavior
System reliability

Evaluation is about:

Output quality
Usefulness
LLM performance

In this project:

pytest handles testing
Evaluation lives in tests/evals/test_langsmith_evals.py

You can run it the same way:

pytest

(Some tests are skipped unless LANGSMITH_API_KEY is set.)

If your tests depend on LLM quality, they will eventually fail for the wrong reasons.

Why This Approach Works Better

Once you start treating LangGraph workflows as state machines instead of prompt chains, the testing story becomes much clearer.

Easier to reason about
More predictable under failure
Simpler to debug
Easier to maintain long term

That’s often where graph-based systems start becoming much easier to operate confidently.

What’s Next

In the next post, I’ll go deeper into:

Structuring pytest suites for LangGraph
Testing routing and error paths in detail
Building confidence in more complex workflows

And later, we’ll extend this into:

Parallel execution
More advanced orchestration
Evaluation with LangSmith

Final Thought

In practice, many AI workflow failures come from the surrounding system rather than the model output alone.

LangGraph gives you the tools to fix that, but you still need to test it like real software.