Memory Status --deep Embedding Readiness Probe Flapping
The `openclaw memory status --deep` command reports inconsistent embedding availability states due to live probe instability conflating transient network issues with persistent index health.
π Symptoms
The openclaw memory status --deep command exhibits non-deterministic output for the same agent over successive invocations:
$ openclaw memory status --deep --agent my-agent
[...memory index details...]
Embeddings: ready
Index: healthy, 1,247 chunks indexed
$ openclaw memory status --deep --agent my-agent
[...memory index details...]
Embeddings: unavailable
Index: healthy, 1,247 chunks indexed
$ openclaw memory status --deep --agent my-agent
[...memory index details...]
Embeddings error: fetch failed
Index: healthy, 1,247 chunks indexed
Observed Characteristics
- Alternating states: Same agent shows `ready` β `unavailable` β `fetch failed` across sequential calls
- Index integrity preserved: SQLite-backed memory store contains populated data (verified via direct queries)
- Direct endpoint works: Manual curl/requests to remote embedding endpoint succeed without issues
- Chunk integrity confirmed: Reconstructed chunk payloads embed successfully against the same endpoint
CLI Execution Pattern
$ # Consecutive runs within seconds
$ openclaw memory status --deep
Status: ready # Run 1
$ openclaw memory status --deep
Status: unavailable # Run 2
$ openclaw memory status --deep
Status: ready # Run 3
$ openclaw memory status --deep
Status: fetch failed # Run 4
Diagnostic Evidence
$ # Verify index persistence independently
$ sqlite3 ~/.openclaw/agents/my-agent/memory.db "SELECT COUNT(*) FROM chunks;"
1247
$ # Verify direct embedding capability
$ curl -X POST https://api.provider.com/v1/embeddings \
-H "Authorization: Bearer $EMBEDDING_API_KEY" \
-d '{"input":"ping","model":"embedding-model"}'
{"embedding":[...],"model":"embedding-model"}
$ # Probe via OpenClaw (flapping behavior)
$ openclaw memory status --deep --trace 2>&1 | grep -i embed
[probe] Calling embedBatchWithRetry(["ping"])
[probe] Result: success # Run 1
[probe] Result: failure # Run 2
[probe] Result: success # Run 3
π§ Root Cause
Architectural Failure: Probe/State Conflation
The MemoryIndexManager class conflates two distinct concerns:
- Persistence Layer: SQLite-backed index with verified chunk counts and content integrity
- Probe Layer: Live HTTP request to remote embedding endpoint for readiness confirmation
The critical code path in MemoryIndexManager.probeEmbeddingAvailability():
async probeEmbeddingAvailability(): Promise<boolean> {
try {
await this.embedBatchWithRetry(["ping"]);
return true;
} catch (err) {
return false; // Any transient failure bubbles up as "unavailable"
}
}
Failure Sequence
m openclaw memory status --deep
β
βΌ
MemoryIndexManager.status()
β
ββββΆ getIndexStats() β Returns: {chunkCount: 1247, lastIndexed: ...}
β
ββββΆ probeEmbeddingAvailability()
β
βββ Attempt 1: embedBatchWithRetry(["ping"])
β β
β βββ Network timeout (100ms) β throw
β β
β βββ Returns: "unavailable" β
β
βββ (Next call)
β
βββ Attempt 2: embedBatchWithRetry(["ping"])
β β
β βββ Success (200ms) β return embedding
β β
β βββ Returns: "ready" β
β
βββ Result: Flapping between β/β
Why This Occurs
- No probe caching: Each `status --deep` invocation triggers a fresh live probe regardless of recent results
- No debouncing: Rapid consecutive calls each hit the remote endpoint independently
- Transient errors included: Timeout, rate limiting, DNS hiccups all report as "unavailable"
- Single probe threshold: No retry-with-backoff or consensus mechanism before reporting failure
- No state persistence for probe results: Last-known-good probe state is not recorded
Code Path Divergence
The normal indexing path has different behavior characteristics than the probe path:
// Normal indexing path (stable due to internal retry logic)
await this.embeddingClient.embedWithRetry(payload, {
retries: 3,
backoff: 'exponential',
timeout: 5000,
});
// Probe path (minimal retry, short timeout)
await this.embedBatchWithRetry(["ping"], {
retries: 1,
timeout: 1000, // Too aggressive for remote endpoints
});
Probe Timing Analysis
| Aspect | Normal Indexing | Probe Path |
|---|---|---|
| Timeout | 5000ms | 1000ms |
| Retries | 3 | 1 |
| Backoff | Exponential | Linear |
| Payload | Actual chunks | ["ping"] |
| Context | Inside transaction | Standalone call |
This configuration mismatch means the probe path is more susceptible to transient network issues that the indexing path would handle gracefully.
π οΈ Step-by-Step Fix
Phase 1: Stabilize the Probe Configuration
File: packages/core/src/memory/memory-index-manager.ts
Before:
async probeEmbeddingAvailability(): Promise<boolean> {
try {
await this.embedBatchWithRetry(["ping"]);
return true;
} catch (err) {
this.logger.warn('Embedding probe failed', err);
return false;
}
}
After:
async probeEmbeddingAvailability(): Promise<boolean> {
const probeConfig = {
retries: 3,
baseTimeout: 1000,
maxTimeout: 5000,
backoffMultiplier: 2,
};
for (let attempt = 1; attempt <= probeConfig.retries; attempt++) {
try {
await this.embedBatchWithRetry(["ping"], {
timeout: Math.min(
probeConfig.baseTimeout * Math.pow(probeConfig.backoffMultiplier, attempt - 1),
probeConfig.maxTimeout
),
});
return true;
} catch (err) {
if (attempt === probeConfig.retries) {
this.logger.warn('Embedding probe failed after all retries', err);
return false;
}
// Exponential backoff before retry
await this.delay(Math.pow(2, attempt) * 100);
}
}
return false;
}
Phase 2: Add Probe Result Caching
File: packages/core/src/memory/memory-index-manager.ts
Add to class properties:
private probeCache: {
result: boolean;
timestamp: number;
error?: string;
} | null = null;
private readonly PROBE_CACHE_TTL_MS = 30000; // 30 seconds
Update probe method:
async probeEmbeddingAvailability(forceRefresh = false): Promise<boolean> {
// Return cached result if fresh
if (!forceRefresh && this.probeCache) {
const age = Date.now() - this.probeCache.timestamp;
if (age < this.PROBE_CACHE_TTL_MS) {
this.logger.debug(`Using cached probe result (age: ${age}ms)`);
return this.probeCache.result;
}
}
// Perform actual probe
const result = await this.performProbeWithBackoff();
// Update cache
this.probeCache = {
result,
timestamp: Date.now(),
};
return result;
}
Phase 3: Separate Status Reporting
File: packages/cli/src/commands/memory/status.ts
Before (conflated output):
Output:
Embeddings: ${probeResult ? 'ready' : 'unavailable'}
Index: healthy, ${chunkCount} chunks indexed
After (separated concerns):
Output:
Index State:
Status: ${chunkCount > 0 ? 'populated' : 'empty'}
Chunks: ${chunkCount}
Last indexed: ${lastIndexedAt}
Embedding Provider:
Probe status: ${probeResult ? 'available' : 'unavailable'}
Probe age: ${cacheAge ? `${cacheAge}s ago` : 'just now'}
Last error: ${lastError || 'none'}
Warning: ${probeResult === false ? 'Probe failures may indicate provider issues, not index corruption' : ''}
Phase 4: Add --force-probe Flag for Debugging
File: packages/cli/src/commands/memory/status.ts
command
.option('--force-probe', 'Bypass probe cache and perform fresh availability check')
.option('--no-cache', 'Alias for --force-probe')
// In handler
const forceProbe = options.forceProbe || options.noCache;
const probeResult = await memoryManager.probeEmbeddingAvailability(forceProbe);
Phase 5: Environment Variable Control
Add to configuration:
OPENCLAW_EMBEDDING_PROBE_TIMEOUT=5000
OPENCLAW_EMBEDDING_PROBE_RETRIES=3
OPENCLAW_EMBEDDING_PROBE_CACHE_TTL=30000
Complete Fix Sequence
# 1. Navigate to relevant source directory
cd packages/core/src/memory
# 2. Apply Phase 1-2 changes to memory-index-manager.ts
# 3. Apply Phase 3-4 changes to status command
# 4. Run tests
npm run test -- --grep "probeEmbeddingAvailability"
# 5. Verify behavior
openclaw memory status --deep
openclaw memory status --deep --force-probe
openclaw memory status --deep --trace 2>&1 | grep -E "(probe|cache)"
π§ͺ Verification
Verification Steps
Step 1: Verify stable output across multiple calls
$ # Run 5 consecutive status checks
$ for i in {1..5}; do
openclaw memory status --deep | grep -E "(Status:|Probe)"
done
# Expected: All 5 runs should show consistent results
Status: populated
Probe status: available
Status: populated
Probe status: available
Status: populated
Probe status: available
Status: populated
Probe status: available
Status: populated
Probe status: available
Step 2: Confirm probe caching with trace output
$ openclaw memory status --deep --trace 2>&1 | grep -i "probe\|cache"
[probe] Performing fresh availability check
[probe] Result: success, cached for 30s
$ # Immediate second call should use cache
$ openclaw memory status --deep --trace 2>&1 | grep -i "probe\|cache"
[probe] Using cached probe result (age: 0ms)
[cache] Result: available
Step 3: Verify force-probe bypasses cache
$ openclaw memory status --deep --force-probe --trace 2>&1 | grep -i "probe"
[probe] Performing fresh availability check
[probe] Bypassing cache
[probe] Result: success
Step 4: Confirm separate index/probe reporting
$ openclaw memory status --deep
=== Memory Index State ===
Status: populated
Chunks: 1247
Last indexed: 2026-01-15T10:32:18Z
=== Embedding Provider ===
Probe status: available
Probe age: 2s ago
Last error: none
Warning: Probe failures indicate provider unavailability, not index corruption
$ # Verify index details match actual SQLite data
$ sqlite3 ~/.openclaw/agents/my-agent/memory.db "SELECT COUNT(*) FROM chunks;"
1247
Step 5: Test retry behavior under simulated failures
$ # Enable debug logging
$ OPENCLAW_LOG_LEVEL=debug openclaw memory status --deep 2>&1 | grep -E "probe|retry|backoff"
[probe] Attempt 1 failed: fetch timeout
[probe] Retrying with backoff: 200ms
[probe] Attempt 2 failed: fetch timeout
[probe] Retrying with backoff: 400ms
[probe] Attempt 3: success
[probe] Final result: available (after retries)
Expected Test Results
| Test | Expected Behavior | Pass Criteria |
|---|---|---|
| Consecutive calls | Consistent output | 0 flapping in 5 runs |
| Cache hit | 0ms probe age in trace | [probe] Using cached result |
| Force probe | Fresh probe executed | Bypassing cache in trace |
| Status separation | Index and probe sections | Independent health signals |
| Retry under failure | 3 attempts with backoff | Attempt 1/2/3 in trace |
Exit Code Verification
$ openclaw memory status --deep
$ echo "Exit code: $?"
Exit code: 0
$ # Even with transient probe failure, index section should succeed
$ # Exit code should only be non-zero for actual index corruption
β οΈ Common Pitfalls
Environment-Specific Traps
- macOS Socket Exhaustion: Rapid consecutive `status --deep` calls on macOS may encounter socket limit issues
# Fix: Increase ulimit before testing ulimit -n 10240 openclaw memory status --deep - Docker Network Isolation: When running inside Docker, the embedding endpoint may be unreachable from container but accessible from host
# Verify network reachability docker exec <container> curl -v https://api.provider.com/v1/embeddings - Proxy Environment Variables: Some environments have proxy settings that affect embedding requests
# Check for proxy interference echo $HTTP_PROXY echo $HTTPS_PROXY echo $NO_PROXY # Ensure embedding endpoint is in NO_PROXY or proxies are properly configured
Configuration Missteps
- Overlapping TTLs: Setting `PROBE_CACHE_TTL` shorter than minimum reasonable probe interval causes artificial failures
# Bad: TTL too short for network variability OPENCLAW_EMBEDDING_PROBE_CACHE_TTL=1000Good: Allow for transient network jitter
OPENCLAW_EMBEDDING_PROBE_CACHE_TTL=30000
- Timeout Less Than Network Latency: Probe timeout set below typical round-trip time
# Verify typical latency first curl -w "%{time_total}\n" -o /dev/null -s https://api.provider.com/v1/embeddings # If latency is 800ms, set timeout higher OPENCLAW_EMBEDDING_PROBE_TIMEOUT=5000 - Rate Limit Conflicts: Provider rate limiting kicking in before retry exhaustion
# Check for rate limit headers in probe responses openclaw memory status --deep --trace 2>&1 | grep -i "rate\|429\|limit"
Debugging Misconceptions
- Assuming Index Corruption: Probe failure is not definitive proof of index problems
# Always check index state separately openclaw memory status --deep | grep -A5 "Index State" # If chunks exist, index is not corrupted - Ignoring Cache State: Forgetting that cached probe results may mask recent probe failures
# Always use --force-probe when debugging probe behavior openclaw memory status --deep --force-probe --trace - Single Probe Determinism: Expecting one probe failure to definitively indicate provider problems
# The fix adds retry logic - verify it triggers OPENCLAW_LOG_LEVEL=debug openclaw memory status --deep # Should see "Attempt 1/2/3" messages
Node.js Version Considerations
- HTTP/2 Connection Reuse: Node.js 22.x improved HTTP/2 handling which may affect probe behavior differently than earlier versions
- Keep-Alive Behavior: Verify connection pooling is working correctly
NODE_OPTIONS='--trace-warnings' openclaw memory status --deep 2>&1 | grep -i "socket\|connection"
π Related Errors
Related Issues in OpenClaw Repository
- #41770 - Proxy/env routing fix for memory remote embeddings (separate but related transport layer fix)
- #30075 - Original proxy/env handling issue for embedding requests
Logically Connected Error Patterns
| Error Type | Manifestation | Related Component |
|---|---|---|
fetch failed | Probe returns fetch error | embedBatchWithRetry() |
unavailable | Probe returns false | probeEmbeddingAvailability() |
timeout | Request exceeds threshold | Transport layer |
rate limited | 429 response from provider | Provider API layer |
ECONNREFUSED | Endpoint unreachable | Network layer |
ENOTFOUND | DNS resolution failure | Network layer |
Historical Context
This issue stems from the architectural decision to conflate:
- Index persistence state (stored chunks, SQLite integrity)
- Provider availability state (live probe result)
The original probeEmbeddingAvailability() was designed for simple availability checking but became problematic when:
- Remote embedding endpoints became standard
- Network variability increased
- No caching/retry logic was implemented for the probe path
Distinction from #30075/#41770
| Aspect | #30075/#41770 (Proxy Issue) | This Issue (Probe Flapping) |
|---|---|---|
| Layer | Transport/HTTP routing | Status reporting logic |
| Symptom | All embeddings fail | Flapping between states |
| Scope | All embedding operations | Only status --deep probe |
| Fix target | Embedding client | MemoryIndexManager |
Associated Debugging Commands
# Check embedding client configuration
openclaw config get embeddings.endpoint
openclaw config get embeddings.provider
# Verify probe cache state
openclaw memory status --deep --trace 2>&1 | grep -i cache
# Force fresh probe
openclaw memory status --deep --force-probe --trace
# Direct embedding test (circumvent probe)
openclaw embeddings generate --text "test"
# Check SQLite index integrity
sqlite3 ~/.openclaw/agents/<agent>/memory.db "PRAGMA integrity_check;"