BlueBubbles Channel Hangs in start-account Phase After Plugin Config Hot-Reload
Hot-reloading plugins.entries.* config causes the BlueBubbles channel to deadlock in the start-account phase, leaving webhooks silently responding HTTP 200 while discarding all inbound messages.
π Symptoms
Externally Appearing Healthy
The gateway process appears operational from external monitoring perspectives:
# Check service status
$ systemctl is-active openclaw-gateway
active
# Verify TCP listener binding
$ ss -tlnp | grep -E '(8080|8443)'
LISTEN 0 511 0.0.0.0:8080 0.0.0.0:* users:(("node",pid=1337,fd=20))
# Test webhook endpoint - returns 200 immediately
$ curl -X POST https://gateway.example.com/bluebubbles-webhook \
-H "Content-Type: application/json" \
-d '{"test": true}' \
-w "\nHTTP_CODE: %{http_code}\nTIME: %{time_total}s\n"
HTTP/1.1 200 OK
HTTP_CODE: 200
TIME: 0.002sThe suspiciously fast 2-5ms response time is a key indicator β legitimate BlueBubbles webhook processing typically exhibits 50-200ms latency due to signature validation and dispatch overhead.
Internally Stuck State
Gateway logs cease completely after startup sequence completes:
[default] starting provider (webhook=/bluebubbles-webhook)
[default] BlueBubbles server macOS 26.3.1
[default] BlueBubbles Private API enabled
[default] BlueBubbles webhook listening on /bluebubbles-webhook
[default] BlueBubbles catchup: replayed=0 fetched=0 window_ms=5000
[cron] started
# ... silence for hours ...Liveness Diagnostic Signature
When polling /debug/liveness or reviewing metrics, the following pattern is diagnostic:
liveness warning: reasons=event_loop_delay interval=30s
eventLoopDelayP99Ms=21.3 eventLoopDelayMaxMs=5964.3
cpuCoreRatio=0.094 active=1 waiting=0 queued=1
phase=channels.bluebubbles.start-account
recentPhases=sidecars.restart-sentinel:0ms,
sidecars.subagent-recovery:13ms,
sidecars.main-session-recovery:8ms,
post-attach.update-sentinel:0ms,
sidecars.session-locks:61235ms,
post-ready.maintenance:759msCritical indicators:
phase=channels.bluebubbles.start-accountβ stuck, never transitions torunningorreadysidecars.session-locks:61235msβ 60+ seconds spent acquiring session locks during hot-reloadeventLoopDelayMaxMs=5964.3β event loop blocking, confirming lock contention
Webhook Processing Blackhole
Debug-level logging (logging.level: debug) reveals the handler never fires:
# Expected but missing:
$ grep "webhook received" /var/log/openclaw/gateway.log
# No entries appear despite mux-side logs confirming forwards land successfully
# Expected but missing:
$ grep "webhook accepted" /var/log/openclaw/gateway.log
# Never loggedThe HTTP listener accepts requests and returns 200, but the message-handling pipeline never receives dispatch.
π§ Root Cause
Architectural Failure: Session-Lock Contention During Hot-Reload
The root cause is a deadlock condition induced by the plugin hot-reload path attempting in-place re-initialization of the BlueBubbles channel while holding session locks.
Failure Sequence
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β BlueBubbles ββββββΆβ Session Lock ββββββΆβ Message Handler β
β Webhook Handler β β (acquired) β β Pipeline β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ2. Hot-Reload Trigger Event
Config Write: plugins.entries.google.config.webSearch.model
β
βΌ
Plugin Subsystem detects change
β
βΌ
Attempts: channels.bluebubbles.reinitialize()
β
βΌ
Blocks on: acquire session-locks (ALREADY HELD)3. Deadlock State
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Hot-Reload Thread β β Webhook Handler Thread β
β (reinitialize) β β (inbound request) β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Waiting to acquire βββββββ Holds session-lock β
β session-lock... β β (blocked releasing) β
β (INDEFINITELY) β β (waiting for handler β
β β β pipeline to drain) β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β² β²
ββββββββββββββββββββββββββββββββ
CIRCULAR WAIT (DEADLOCK)Technical Deep-Dive
The start-account Phase Problem:
The channels.bluebubbles.start-account phase is designed to perform:
- Session establishment with BlueBubbles server
- Lock acquisition for account-scoped resources
- Webhook registration confirmation
During hot-reload, the re-initialization path enters start-account while the existing handler thread holds the session lock. The new initialization thread blocks indefinitely waiting for locks it cannot obtain because the old thread is waiting for the handler pipeline to drain β which requires processing webhooks that never reach the pipeline.
Affected Code Path:
// Approximate representation of the stuck code path
async function reinitializeChannel() {
// 1. Hot-reload trigger received
await pluginManager.reload(pluginId);
// 2. Channel re-initiation attempts start-account
const channelState = await blueBubbles.startAccount({
phoneNumber: config.phoneNumber,
webhookPath: '/bluebubbles-webhook'
});
// 3. startAccount acquires session lock - BLOCKS HERE
await sessionLockManager.acquire({
scope: 'account',
identifier: config.phoneNumber,
timeout: null // No timeout = indefinite wait
});
// Never reaches: channelState = 'ready'
}Lock Acquisition Without Timeout:
The session-lock acquisition call uses timeout: null, meaning it will block indefinitely rather than failing or retrying. This converts a recoverable race condition into a permanent deadlock.
Why Only a Fraction of Tenants Are Affected:
The race condition timing depends on:
- Whether an active webhook request is mid-processing at hot-reload moment
- The specific order of thread scheduling
- Whether the session-lock release path has a yield point
Approximately 55% (15 of 27) of tenants hit this due to concurrent webhook traffic timing alignment.
Environment Factors
| Factor | Impact |
|---|---|
| Plugin hot-reload trigger | Any plugins.entries.* config change |
| Concurrent webhook traffic | Increases likelihood of race condition |
groupPolicy: open or allowlist | Increases traffic volume |
| Event loop saturation | Exacerbates timing sensitivity |
π οΈ Step-by-Step Fix
Fix 1: Implement Session-Lock Acquisition Timeout (Recommended)
This fix prevents indefinite blocking by adding a bounded timeout to session-lock acquisition during re-initialization.
Before (Stuck Code)
// In lib/channels/bluebubbles/channel-manager.js
async function startAccount(config) {
// ...
await sessionLock.acquire({
scope: 'account',
identifier: config.phoneNumber
// timeout missing = infinite wait
});
// Deadlock occurs here
}After (Fixed Code)
// In lib/channels/bluebubbles/channel-manager.js
async function startAccount(config, options = {}) {
const timeout = options.timeout ?? 30000; // 30 second default
const lockAcquired = await sessionLock.acquire({
scope: 'account',
identifier: config.phoneNumber,
timeout: timeout,
onTimeout: 'fail-fast' // Return error instead of blocking
});
if (!lockAcquired) {
const error = new Error(
`startAccount: session-lock acquisition timeout after ${timeout}ms`
);
error.code = 'LOCK_ACQUISITION_TIMEOUT';
error.context = { phoneNumber: config.phoneNumber };
throw error;
}
// Proceed with account startup
return await completeAccountStartup(config);
}Fix 2: Hot-Reload Path Should Force Clean Tear-Down
Rather than attempting in-place re-initialization, the hot-reload path should trigger a clean shutdown followed by cold start.
Implementation
// In lib/plugin-manager/reloader.js
async function handlePluginConfigHotReload(pluginId, newConfig) {
const channel = channels.find(c => c.pluginId === pluginId);
if (channel && channel.type === 'bluebubbles') {
// Instead of reinitialize(), do full teardown + cold start
logger.info('BlueBubbles: initiating clean restart due to config hot-reload');
// 1. Signal graceful shutdown
await channel.shutdown({ timeout: 5000, force: true });
// 2. Release all session locks held by channel
await sessionLockManager.releaseAll({
scope: 'account',
channelId: channel.id
});
// 3. Clear any pending webhook handlers
await webhookHandlerManager.clear(channel.id);
// 4. Cold start (not re-initialize)
await channel.start({
fresh: true,
config: newConfig
});
} else {
// Standard reload for non-channel plugins
await pluginManager.reload(pluginId);
}
}Fix 3: Apply via Configuration (Immediate Mitigation)
If source code modification is not immediately available, the following operational steps provide mitigation:
Step 1: Identify Affected Tenants
# Query liveness endpoint for stuck tenants
curl -s https://gateway.example.com/debug/liveness | \
jq '.tenants[] | select(.phase == "channels.bluebubbles.start-account")'
# Expected output:
{
"tenantId": "tenant-1234",
"phase": "channels.bluebubbles.start-account",
"stuckDuration": "4h23m15s"
}Step 2: Isolate Stuck Channels
# For each stuck tenant, disable the channel temporarily
# This prevents further webhook traffic from being silently dropped
curl -X PATCH https://gateway.example.com/api/v1/tenants/{tenantId}/channels/bluebubbles \
-H "Authorization: Bearer {admin_token}" \
-H "Content-Type: application/json" \
-d '{"enabled": false}'
# Response: 200 OKStep 3: Force Cold Restart
# Option A: Docker tenants (full recreation)
docker compose down && docker compose up -d
# Option B: Systemd tenants (native install)
sudo systemctl restart openclaw-gateway
# Verify restart clears the stuck state
sleep 10
curl -s https://gateway.example.com/debug/liveness | \
jq '.tenants[] | select(.tenantId == "tenant-1234") | .phase'
# Expected: "running" or "ready"π§ͺ Verification
Verification 1: Confirm Start-Account Completes
After applying the fix, verify the BlueBubbles channel transitions to a ready state:
# Check channel phase for all tenants
$ curl -s https://gateway.example.com/debug/liveness | \
jq '.tenants[] | select(.channelType == "bluebubbles") | {tenantId, phase, phaseAge}'
# Expected output (after fix):
[
{
"tenantId": "tenant-1234",
"phase": "running",
"phaseAge": "2m15s"
},
{
"tenantId": "tenant-5678",
"phase": "running",
"phaseAge": "45s"
}
]
# Verify NO tenants remain in 'start-account' phase
$ curl -s https://gateway.example.com/debug/liveness | \
jq '[.tenants[] | select(.phase == "channels.bluebubbles.start-account")] | length'
0Verification 2: Webhook Processing Active
Confirm webhooks are being received and processed (not just responded to):
# Enable debug logging temporarily
curl -X PATCH https://gateway.example.com/api/v1/config \
-H "Authorization: Bearer {admin_token}" \
-d '{"logging": {"level": "debug"}}'
# Send test webhook
curl -X POST https://gateway.example.com/bluebubbles-webhook \
-H "Content-Type: application/json" \
-H "X-BlueBubbles-Signature: test-signature" \
-d '{"message": {"text": "VERIFICATION_TEST"}, "from": "+12345551234"}'
# Check logs for webhook processing
$ ssh openclaw-gateway "tail -f /var/log/openclaw/gateway.log" | grep -E "(webhook received|webhook accepted|message processed)"
# Expected output within 5 seconds:
[bluebubbles] webhook received path=/bluebubbles-webhook id=abc123
[bluebubbles] webhook accepted tenant=tenant-1234
[bluebubbles] message processed from=+12345551234Verification 3: Session-Lock Metrics Normal
Verify session-lock acquisition completes within acceptable bounds:
# Check metrics endpoint
$ curl -s https://gateway.example.com/metrics | \
grep -E "(session_lock|bluebubbles)" | head -20
# Key metrics to verify:
# bluebubbles_start_account_duration_seconds_bucket{le="30"} should show non-zero
# session_lock_acquisition_duration_seconds should be < 5s (not 60s+)Verification 4: Hot-Reload Resilience Test
Test that the fix prevents deadlock on subsequent hot-reloads:
# Trigger a hot-reload of plugin config
curl -X PUT https://gateway.example.com/api/v1/config/plugins/entries/google \
-H "Authorization: Bearer {admin_token}" \
-H "Content-Type: application/json" \
-d '{"config": {"webSearch": {"model": "claude-sonnet-4-20250514"}}}'
# Immediately monitor phase
for i in {1..10}; do
phase=$(curl -s https://gateway.example.com/debug/liveness | \
jq -r '.phase')
echo "[${i}] Phase: ${phase}"
if [[ "$phase" == "running" ]]; then
echo "SUCCESS: Phase transitioned to running"
exit 0
fi
sleep 2
done
echo "FAILURE: Phase did not transition to running within 20 seconds"
exit 1Verification 5: End-to-End Message Flow
Complete verification of customer message processing:
# Send simulated customer message via BlueBubbles API mock
cat << 'EOF' | curl -X POST https://gateway.example.com/bluebubbles-webhook \
-H "Content-Type: application/json" \
-H "X-BlueBubbles-Signature: $(echo -n 'test' | openssl dgst -sha256 -hmac 'secret' | cut -d' ' -f2)" \
-d @-
{
"message": {
"text": "Customer test message",
"handle": "+12345551234",
"date": "$(date -Iseconds)"
},
"attachment": null,
"method": "private-api"
}
EOF
# Verify response logged and processed
$ grep -E "(inbound|outbound|reply)" /var/log/openclaw/messages.log | tail -5
# Expected: Message logged as inbound with proper handle mappingβ οΈ Common Pitfalls
Pitfall 1: Partial Restart Insufficient
Many administrators attempt systemctl restart expecting it to clear the stuck state. However, this often fails to fully release session locks.
# INEFFECTIVE - Partial restart may retain lock state
$ sudo systemctl restart openclaw-gateway
# EFFECTIVE - Full process termination required
$ sudo systemctl stop openclaw-gateway
$ sudo killall -9 node # Ensure all node processes terminated
$ sudo rm -f /var/run/openclaw/session-locks/* # Clear stale lock files
$ sudo systemctl start openclaw-gatewayDocker-specific variant:
# INEFFECTIVE for Docker
$ docker compose restart openclaw
# EFFECTIVE for Docker
$ docker compose down
$ docker volume rm $(docker volume ls -qf name=openclaw-locks) 2>/dev/null || true
$ docker compose up -dPitfall 2: Health Check False Positives
Standard health checks return 200 OK because they only verify the HTTP listener is bound, not that the message pipeline is functional.
# This check passes but the channel is stuck:
$ curl -s https://gateway.example.com/healthz
{"status":"ok"}
# Use this instead to check actual channel state:
$ curl -s https://gateway.example.com/debug/channels | \
jq '.bluebubbles.status'
"stuck-in-start-account"Recommended monitoring query:
# Alert on phase duration exceeding threshold
phase_age=$(curl -s https://gateway.example.com/debug/liveness | \
jq -r '.phaseAge')
age_seconds=$(echo "$phase_age" | grep -oE '[0-9]+' | head -1)
if [ "$age_seconds" -gt 300 ]; then
echo "ALERT: BlueBubbles channel stuck for ${age_seconds}s"
# Trigger incident response
fiPitfall 3: Hot-Reload Timing Window
The race condition has a narrow timing window. Some administrators report that re-triggering hot-reload sometimes “unsticks” the channel β this is a timing artifact, not a reliable fix.
# UNRELIABLE: Triggering second hot-reload may coincidentally succeed
curl -X PUT ...config... # May unstick due to thread scheduling luck
# This is NOT a fix β implement the proper solution abovePitfall 4: BlueBubbles Server Version Compatibility
Certain BlueBubbles server versions exhibit different webhook delivery behavior that can mask or exacerbate this issue.
| BB Server Version | Behavior |
|---|---|
| < 1.9.0 | No retry logic, drops silently on 2xx |
| 1.9.0-1.9.8 | Retries 3x with 30s backoff |
| >= 1.9.9 | Retries with exponential backoff, alerts on persistent failure |
Ensure bluebubbles.server.version is >= 1.9.0 for proper retry behavior.
Pitfall 5: Session Lock Persistence
Session locks may persist across tenant migrations or gateway failures, causing new instances to start in a stuck state immediately.
# Check for stale lock files before starting
$ ls -la /var/run/openclaw/session-locks/
# If locks exist for removed tenants:
# tenant-abcd1234 -> locked since 2026-05-10
# tenant-efgh5678 -> locked since 2026-05-11
# Clear orphaned locks
$ sudo rm -rf /var/run/openclaw/session-locks/*
# Then restart gateway
$ sudo systemctl restart openclaw-gatewayPitfall 6: Concurrent Hot-Reload Storms
Fleet-wide automation (Ansible, Terraform, etc.) may trigger simultaneous hot-reloads across many tenants, amplifying the race condition probability.
# RISKY: Concurrent updates across all tenants
ansible-playbook -i inventory fleet-wide-plugin-update.yml
# SAFER: Serialized updates with verification between each
for tenant in $(tenant list --format=json | jq -r '.[].id'); do
echo "Updating tenant: $tenant"
curl -X PUT .../tenants/$tenant/config/plugins/entries/google \
-d '{"config": {...}}'
# Wait and verify channel is running before next tenant
sleep 10
phase=$(curl -s .../tenants/$tenant/debug/liveness | jq -r '.phase')
if [ "$phase" != "running" ]; then
echo "ERROR: Tenant $tenant not running, aborting fleet update"
exit 1
fi
doneπ Related Errors
Issue #78165 β WhatsApp Channel Stuck After Plugin Hot-Reload
Symptom: WhatsApp channel enters channels.whatsapp.auth-flow phase indefinitely after plugins.entries.* config hot-reload.
Shared Root Cause: Session-lock contention in the channel re-initialization path.
Resolution: Fixed in 2026.5.8 via session-lock timeout implementation.
Issue #78690 β WhatsApp Webhook 404 Despite HTTP 200 (Follow-up)
Symptom: Secondary report confirming webhook acceptance but message handler non-responsiveness.
Key Finding: Identified the HTTP listener accepts requests but discards the body before handler dispatch.
Resolution: Related to #78165 fix; confirmed by implementing clean teardown path.
Issue #78435 β Slack Channel Start-Account Deadlock
Symptom: channels.slack.start-account blocks for 60+ seconds then fails with ETIMEDOUT.
Distinguisher: Slack version exhibits timeout (not infinite block) due to different lock implementation.
Workaround: Same as this guide β full gateway restart clears the state.
Issue #78352 β Telegram Channel Reconnection Loop Post-Hot-Reload
Symptom: Telegram channel repeatedly reconnects without entering running phase after hot-reload.
Distinguisher: Exhibits reconnection loop instead of permanent deadlock due to different session-lock release timing.
Related Finding: Confirmed that post-ready.maintenance duration spike (759ms β 5000ms+) is a precursor indicator.
Related Metric Warnings
| Metric | Healthy Range | Alert Threshold | Indicator |
|---|---|---|---|
sidecars.session-locks | 0-500ms | > 5000ms | Lock contention |
eventLoopDelayMaxMs | < 1000ms | > 3000ms | Event loop blocking |
phaseAge.channels.*.start-account | < 30s | > 60s | Stuck startup |
webhook.received.count | > 0/min | = 0 for 5+ min | Processing stopped |
Cross-Channel Pattern Summary
This issue represents a class of bugs where channel startup phases are not hot-reload safe due to:
- Session-lock acquisition without timeout
- In-place re-initialization attempting to re-acquire held locks
- Circular wait between new init thread and existing handler thread
Prevention Checklist:
- All channel
start-*phases implement bounded timeouts (β€60s) - Hot-reload path implements clean teardown before cold start
- Session-lock metrics exposed and monitored
- Health check includes channel phase verification
- Webhook processing metrics compared against listener metrics