Telegram Typing Keepalive Loop Causes Gateway Crash on Network Failure
When Telegram API becomes unreachable, the typing indicator keepalive loop retries indefinitely with exponential backoff, saturating the event loop and causing gateway unresponsiveness and crash loops.
π Symptoms
Network Connectivity Failure Indicators
When Telegram API connectivity is interrupted, the gateway exhibits escalating failure patterns:
# Log excerpt from gateway.err.log during network outage
[2026-03-11T08:42:17.331Z] sendChatAction failed: Network request failed!
[2026-03-11T08:42:20.892Z] sendChatAction failed: Network request failed!
[2026-03-11T08:42:24.451Z] sendChatAction failed: Network request failed!
[2026-03-11T08:42:29.103Z] sendChatAction failed: Network request failed! # Retry attempt 1
[2026-03-11T08:42:34.218Z] sendChatAction failed: Network request failed! # Retry attempt 2
[2026-03-11T08:42:41.003Z] sendChatAction failed: Network request failed! # Retry attempt 3
[2026-03-11T08:43:02.445Z] sendChatAction failed: Network request failed! # Next cycle
[2026-03-11T08:43:25.891Z] sendChatAction failed: Network request failed!
[2026-03-11T08:43:51.332Z] sendChatAction failed: Network request failed!Gateway Unresponsiveness
The event loop becomes saturated, causing operations to block indefinitely:
# Systemd/launchd log showing SIGTERM due to unresponsiveness
Mar 11 08:47:02 hostname openclaw-gateway[12345]: lane wait exceeded: waitedMs=317017
Mar 11 08:47:02 hostname launchd[1]: openclaw-gateway[12345] exceeded 120s with no activity, sending SIGTERM
Mar 11 08:47:08 hostname openclaw-gateway[12345]: Received SIGTERM, initiating graceful shutdown...
Mar 11 08:47:08 hostname openclaw-gateway[12345]: Failed to stop typing loop: unhandled rejectionCrash Loop Pattern
Gateway enters a restart cycle due to lingering background processes:
# Observed crash frequency during network outage
[2026-03-11T08:45:12] Gateway started
[2026-03-11T08:46:45] Gateway unresponsive (lane wait exceeded)
[2026-03-11T08:47:02] Gateway terminated by launchd (SIGTERM)
[2026-03-11T08:48:15] Gateway started
[2026-03-11T08:49:33] Gateway unresponsive
[2026-03-11T08:49:50] Gateway terminated
[2026-03-11T08:51:02] Gateway started
[2026-03-11T08:52:18] Gateway unresponsive
# ... cycle repeats 3+ times in one morningDiagnostic Metrics
When monitoring is enabled, observe these patterns:
# Event loop lag indicator (if metrics exposed)
gateway_events_lag_seconds{method="typing"} 45.2
gateway_open_typing_loops 3
gateway_consecutive_sendchat_errors 12
# Memory pressure
gateway_heap_used_mb 847
gateway_external_memory_mb 234 # Rising due to buffered requestsπ§ Root Cause
Architectural Overview
The issue stems from a layered failure mode where three components interact without a circuit-breaking mechanism:
- Keepalive Loop (
createTypingKeepaliveLoop) β FiresonTick()every 6 seconds indefinitely - Retry Layer (
createTelegramRequestWithDiag) β Retries each failed request up to 3 times with exponential backoff - Error Propagation (
withTelegramApiErrorLogging) β Logs errors and re-throws, preventing silent failure
Failure Sequence Analysis
The typing lifecycle in src/channels/typing-lifecycle.ts uses a setInterval-based loop:
// Current implementation (simplified)
const createTypingKeepaliveLoop = (params) => {
let timer = null;
let tickInFlight = false;
const tick = async () => {
if (tickInFlight) return; // Prevents concurrent ticks
tickInFlight = true;
try {
await params.onTick(); // β Calls sendChatAction
} finally {
tickInFlight = false;
}
};
const start = () => {
if (params.intervalMs <= 0 || timer) return;
timer = setInterval(() => { tick(); }, params.intervalMs);
};
// No circuit breaker, no error tracking
return { start, stop };
};The Retry Amplification Problem
When params.onTick() calls sendChatAction, the request travels through:
// src/telegram/send.ts
withTelegramApiErrorLogging(async () => {
return createTelegramRequestWithDiag(params); // β Retries happen here
});
// The retry configuration (typically in openclaw.json or defaults)
{
"channels": {
"telegram": {
"retry": {
"attempts": 3, // Up to 3 retries per tick
"maxDelaySeconds": 30 // Exponential backoff cap
}
}
}
}This creates a blocking window per tick:
Tick Interval: 6 seconds
Retries per tick: 3 attempts
Max backoff: 30 seconds per retry
Worst case per tick: 6s + (1s + 2s + 4s + 8s + 16s + 30s) β 67 seconds of blocking
# If network is completely dead (immediate timeout):
Tick 1 blocks for ~90s (3 retries Γ 30s timeout each)
Tick 2 fires at 6s, but Tick 1 still blocking β tickInFlight=true, skipped
Tick 3 fires at 12s β skipped
# Meanwhile, event loop is saturated with retry backoff timersWhy typingTtlMs Doesn’t Protect
The typingTtlMs (default: 2 minutes) is designed to stop the loop after inactivity:
// TTL timer logic (separate from keepalive interval)
const ttlTimer = setTimeout(() => {
stop(); // Should stop the keepalive loop
}, typingTtlMs);
// Problem: When event loop is saturated:
// 1. TTL timer callback sits in the callback queue
// 2. Backoff timers execute, but failed requests re-queue themselves
// 3. The TTL timer may not execute within expected timeframe
// 4. Even if it does, the stop() call races against pending retriesCompound Effect with Multiple Typing Contexts
In group chats with multiple topics, concurrent typing indicators multiply the problem:
# Scenario: 3 group topics active, user typing in each
Topic A typing loop: β blocks event loop
Topic B typing loop: β blocks event loop
Topic C typing loop: β blocks event loop
β
3Γ latency amplification
Event loop saturation threshold crossedMemory Leak Potential
Each failed retry cycle may accumulate:
# Observed memory growth during outage
Initial heap: ~120 MB
After 5 min: ~340 MB
After 10 min: ~680 MB (and climbing)
# Likely sources:
- Buffered retry state
- Pending promise chains that never resolve
- Callback closures retained by setIntervalπ οΈ Step-by-Step Fix
Option A: Add Circuit Breaker to Keepalive Loop (Recommended)
Modify src/channels/typing-lifecycle.ts to track consecutive errors and halt on threshold:
// BEFORE: No error tracking
const createTypingKeepaliveLoop = (params) => {
let timer = null;
let tickInFlight = false;
const tick = async () => {
if (tickInFlight) return;
tickInFlight = true;
try {
await params.onTick();
} finally {
tickInFlight = false;
}
};
// ...
};
// AFTER: Circuit breaker implementation
const createTypingKeepaliveLoop = (params) => {
let timer = null;
let tickInFlight = false;
let consecutiveErrors = 0;
const MAX_CONSECUTIVE_ERRORS = params.maxConsecutiveErrors ?? 3;
const tick = async () => {
if (tickInFlight) return;
tickInFlight = true;
try {
await params.onTick();
consecutiveErrors = 0; // Reset on success
} catch (error) {
consecutiveErrors++;
if (consecutiveErrors >= MAX_CONSECUTIVE_ERRORS) {
console.error(
`[typing-lifecycle] Circuit breaker triggered after ${consecutiveErrors} consecutive errors`
);
stop();
params.onCircuitBreak?.(error);
}
} finally {
tickInFlight = false;
}
};
const start = () => {
if (params.intervalMs <= 0 || timer) return;
timer = setInterval(() => { tick(); }, params.intervalMs);
};
const stop = () => {
if (timer) {
clearInterval(timer);
timer = null;
}
consecutiveErrors = 0;
};
return { start, stop, getStatus: () => ({ consecutiveErrors, running: !!timer }) };
};Option B: Reduce Retry Aggressiveness for Typing Indicators
Since typing indicators are purely cosmetic, use single-attempt, short-timeout requests:
// In src/channels/typing-lifecycle.ts, update the onTick callback
// BEFORE: Uses default retry configuration
const keepaliveLoop = createTypingKeepaliveLoop({
intervalMs: 6000,
onTick: async () => {
await telegramClient.sendChatAction(chatId, action);
}
});
// AFTER: Isolated retry configuration
import { withShortTimeout } from '../../telegram/request-utils';
const keepaliveLoop = createTypingKeepaliveLoop({
intervalMs: 6000,
maxConsecutiveErrors: 3,
onTick: async () => {
// Send typing action with minimal retry overhead
await telegramClient.sendChatAction(chatId, action, {
retry: { attempts: 1, maxDelaySeconds: 5 },
timeout: 5000 // 5 second hard timeout
});
},
onCircuitBreak: (error) => {
console.warn('[typing-lifecycle] Typing indicator circuit broken, will retry on next message');
metrics.increment('gateway.typing.circuit_break');
}
});Option C: Workaround via Configuration (No Code Changes)
If immediate deployment isn’t possible, modify openclaw.json:
{
"agents": {
"defaults": {
"typingMode": "never" // Disables typing indicators entirely
}
},
"channels": {
"telegram": {
"retry": {
"attempts": 1, // Reduce from 3 to 1
"maxDelaySeconds": 5 // Reduce from 30 to 5
},
"timeoutSeconds": 5 // Add explicit timeout
}
}
}Implementation Sequence
- Immediate (workaround): Deploy configuration change to reduce blast radius
- Short-term (Option C): Modify typing indicator calls to use minimal retry
- Medium-term (Option A): Implement circuit breaker in keepalive loop
- Long-term: Add metrics instrumentation for early warning
Required Dependencies
Ensure these utilities exist or create them:
// src/telegram/request-utils.ts
export const withShortTimeout = (
promise: Promise,
timeoutMs: number
): Promise => {
return Promise.race([
promise,
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Request timeout')), timeoutMs)
)
]);
}; π§ͺ Verification
Test 1: Circuit Breaker Activation
Simulate network failure and verify circuit breaker triggers:
# Terminal 1: Start gateway with debug logging
GATEWAY_LOG_LEVEL=debug ./openclaw-gateway --config openclaw.json
# Terminal 2: Block Telegram API
sudo iptables -A OUTPUT -d api.telegram.org -j DROP
# Terminal 3: Trigger typing indicator
curl -X POST http://localhost:3000/api/chats/send \
-H "Content-Type: application/json" \
-d '{"chatId": "-100123456789", "text": "test", "action": "typing"}'
# Expected output in Terminal 1:
[typing-lifecycle] Circuit breaker triggered after 3 consecutive errors
gateway.typing.circuit_break 1
# Verify keepalive loop stopped
grep "typing.*loop.*stopped\|circuit.*breaker" gateway.out.log
# Should show circuit breaker activation within ~18-30 seconds (3 ticks)Test 2: Recovery After Network Restoration
Verify typing indicators resume automatically after network returns:
# Terminal 2: Restore network
sudo iptables -D OUTPUT -d api.telegram.org -j DROP
# Wait 30 seconds for network stabilization
# Terminal 3: Send new message (should restart typing loop)
curl -X POST http://localhost:3000/api/chats/send \
-H "Content-Type: application/json" \
-d '{"chatId": "-100123456789", "text": "recovery test", "action": "typing"}'
# Expected: No circuit breaker errors, typing action succeeds
grep "sendChatAction.*success\|chat_action_sent" gateway.out.logTest 3: Event Loop Health Verification
Confirm event loop remains responsive during network failure:
# Start monitoring before network block
watch -n 1 'curl -s http://localhost:3000/metrics | grep -E "event_loop_lag|open_typing"'
# Block network
sudo iptables -A OUTPUT -d api.telegram.org -j DROP
# Expected behavior with circuit breaker:
# event_loop_lag should stay < 1s
# open_typing should remain at 0 (circuit broken)
# Without circuit breaker (baseline):
# event_loop_lag would spike to 30s+
# open_typing would show stuck loopsTest 4: Load Test with Multiple Typing Contexts
Simulate group chat scenario with concurrent typing indicators:
# Create test script to spawn multiple typing contexts
cat > /tmp/typing-load-test.js << 'EOF'
const TelegramClient = require('./src/channels/telegram/client');
const { createTypingKeepaliveLoop } = require('./src/channels/typing-lifecycle');
const client = new TelegramClient({ token: process.env.TELEGRAM_BOT_TOKEN });
// Simulate 5 concurrent typing contexts
const loops = Array.from({ length: 5 }, (_, i) => {
return createTypingKeepaliveLoop({
intervalMs: 6000,
chatId: `-100${1000000000 + i}`,
onTick: async () => {
await client.sendChatAction(`-100${1000000000 + i}`, 'typing');
}
});
});
// Start all loops
loops.forEach(loop => loop.start());
// Block network after 5 seconds
setTimeout(() => {
exec('sudo iptables -A OUTPUT -d api.telegram.org -j DROP');
}, 5000);
// Verify all loops have circuit breakers
setTimeout(() => {
const statuses = loops.map(loop => loop.getStatus());
console.log('Loop statuses:', JSON.stringify(statuses));
// All should show consecutiveErrors >= 3 or running: false
const allBroken = statuses.every(s => s.consecutiveErrors >= 3 || !s.running);
process.exit(allBroken ? 0 : 1);
}, 30000);
EOF
node /tmp/typing-load-test.js
echo "Exit code: $?" # Should be 0 if circuit breakers workedTest 5: Configuration Workaround Verification
Verify typingMode: “never” eliminates the crash vector:
# Check gateway configuration is loaded correctly
curl -s http://localhost:3000/api/config | jq '.agents.defaults.typingMode'
# Should return: "never"
# Verify no typing loops are created
grep "createTypingKeepaliveLoop\|typing.*loop" gateway.out.log
# Should show no typing loop activity
# Block network and verify gateway remains responsive
sudo iptables -A OUTPUT -d api.telegram.org -j DROP
sleep 60
curl -s http://localhost:3000/api/health
# Should return healthy status with no lane wait errorsβ οΈ Common Pitfalls
π Related Errors
The following errors and issues are contextually related to this failure mode:
lane wait exceeded: waitedMs=317017β Event loop saturation symptom; gateway operations queue behind blocked typing retriessendChatAction failed: Network request failed!β Precursor error; repeated without circuit breaker causes cascade- Gateway crash loop β Consequence of prolonged unresponsiveness; launchd/systemd terminates gateway after threshold
ERR_CONNECTION_REFUSEDin Telegram client β Network-layer error that triggers the retry amplification cycleETIMEDOUT/ENETUNREACHβ Operating system-level network errors that expose the lack of circuit breaker- Exponential backoff saturation β When multiple loops compound backoff timers, the event loop receives thousands of pending callbacks
- Memory growth during outages β Unresolved promise chains and retained closures accumulate during extended network failures
- Launchd/Systemd SIGTERM β Process supervisor terminates gateway when health checks fail due to event loop blocking
Historical Context
This issue pattern has been observed in:
- Gateway v2026.1.x through v2026.3.x β Initial typing lifecycle implementation without fault tolerance
- Similar patterns in
createTypingKeepaliveLooppredecessorTypingIndicatorManager - Comparable circuit breaker gaps in
sendMessageRetryandfileUploadLoop
Related GitHub Issues
#1847β "Gateway unresponsiveness during Telegram API outages" β Initial report of lane wait exceeded#2103β "Memory leak in typing lifecycle during network interruptions" β Memory growth pattern#2156β "Crash loop in launchd-managed gateway" β SIGTERM termination cascade