May 07, 2026 β€’ Version: 2026.5.4

ReplyRunAlreadyActiveError Fires on 2026.5.4 for Discrete Sequential chat.send via WebSocket

Sequential discrete chat.send calls through the gateway WebSocket path trigger ReplyRunAlreadyActiveError at 50% rate despite the #77485 fix in 2026.5.4, indicating a coverage gap in the active-run guard cleanup for the WS dispatcher path versus the agent-runner's queued follow-up path.

πŸ” Symptoms

Primary Manifestation

The ReplyRunAlreadyActiveError reproduces deterministically on 2026.5.4 when sending sequential chat.send requests through the gateway WebSocket path, producing an alternating pass/fail pattern at 50% failure rate.

CLI Reproduction Sequence

Execute the following probe against a gateway running 2026.5.4 in embedded mode:

for i in 1 2 3 4 5 6 7 8 9 10; do
  START=$(date +%s%3N)
  RESPONSE=$(curl -s -w "\nHTTP_CODE:%{http_code}" -X POST http://127.0.0.1:18789/chat.send \
    -H "Content-Type: application/json" \
    -d "{\"sessionKey\":\"agent:test:main\",\"message\":\"Reply containing literal: ok-$i-$(date +%s)\"}")
  END=$(date +%s%3N)
  ELAPSED=$((END - START))
  echo "call $i: ${ELAPSED}ms"
  echo "$RESPONSE"
  sleep 1
done

Observed Output Pattern

Call #ElapsedStatusBehavior
1317msFAILEmpty/canned reply returned
21689msPASSReal LLM reply
3302msFAILEmpty/canned reply returned
41876msPASSReal LLM reply
5299msFAILEmpty/canned reply returned
61592msPASSReal LLM reply
7303msFAILEmpty/canned reply returned
81778msPASSReal LLM reply
9315msFAILEmpty/canned reply returned
101604msPASSReal LLM reply

Gateway Error Log Evidence

The gateway error log (pm2 logs openclaw-gateway) shows 16 occurrences of:

followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main

Technical Characteristics of Failures

  • Fast-fail timing: Failed calls return in ~300ms, which is below typical provider RTT. The error is thrown before any LLM dispatch occurs.
  • 1-second gap is insufficient: Despite the 1s pause between calls (well past the prior call's wall-clock completion), the guard remains active.
  • Canned fallback returned: Failed calls return the agent-runner's fallback message ("I had a brief hiccup processing that. Could you try again?") rather than a legitimate LLM response.
  • Binary verification passes: The installed binary is definitively 2026.5.4:
    • dist/run-state-Bg5KVIP6.js sha256: 3cdea3a69fe7be00ccf0a77279c51fbe9e977cfc13868063f09259f6305538dd
    • dist/agent-runner.runtime-BwDd4yvB.js (updated from 5.3)

Baseline Comparison

Against 2026.4.26 (last known good), the same 10-call probe produces:

  • All 10 calls succeed with real replies
  • Warm latency: 1.2–1.7s per call
  • Zero ReplyRunAlreadyActiveError events in gateway log

🧠 Root Cause

Architectural Overview

The OpenClaw gateway maintains an activeRunsByKey guard (a Map or Set keyed by sessionKey) to prevent concurrent reply runs for the same session. The guard is checked at request entry and cleared on run completion.

The Regression Introduction (2026.5.3 β†’ 2026.5.4)

The fix for #77485 (commit a9817a5, shipped in 2026.5.4) addressed the queued auto-follow-up path. The release notes state:

“clear the active reply-run guard before draining queued same-session follow-up turns, so sequential chat.send calls no longer trip ReplyRunAlreadyActiveError”

However, this fix introduced or exposed a coverage gap for the discrete sequential chat.send path through the gateway WebSocket dispatcher.

Two Distinct Paths with Shared Guard

The activeRunsByKey guard is shared between two code paths:

Path A: Agent-Runner Queued Follow-Up (Fixed in 2026.5.4)

Evidence & Sources

This troubleshooting guide was automatically synthesized by the FixClaw Intelligence Pipeline from community discussions.