April 24, 2026 β€’ Version: 2026.x

Memory Status --deep Embedding Readiness Probe Flapping

The `openclaw memory status --deep` command reports inconsistent embedding availability states due to live probe instability conflating transient network issues with persistent index health.

πŸ” Symptoms

The openclaw memory status --deep command exhibits non-deterministic output for the same agent over successive invocations:

$ openclaw memory status --deep --agent my-agent
[...memory index details...]
Embeddings: ready
Index: healthy, 1,247 chunks indexed

$ openclaw memory status --deep --agent my-agent
[...memory index details...]
Embeddings: unavailable
Index: healthy, 1,247 chunks indexed

$ openclaw memory status --deep --agent my-agent
[...memory index details...]
Embeddings error: fetch failed
Index: healthy, 1,247 chunks indexed

Observed Characteristics

  • Alternating states: Same agent shows `ready` β†’ `unavailable` β†’ `fetch failed` across sequential calls
  • Index integrity preserved: SQLite-backed memory store contains populated data (verified via direct queries)
  • Direct endpoint works: Manual curl/requests to remote embedding endpoint succeed without issues
  • Chunk integrity confirmed: Reconstructed chunk payloads embed successfully against the same endpoint

CLI Execution Pattern

$ # Consecutive runs within seconds
$ openclaw memory status --deep
Status: ready        # Run 1
$ openclaw memory status --deep
Status: unavailable # Run 2
$ openclaw memory status --deep
Status: ready        # Run 3
$ openclaw memory status --deep
Status: fetch failed # Run 4

Diagnostic Evidence

$ # Verify index persistence independently
$ sqlite3 ~/.openclaw/agents/my-agent/memory.db "SELECT COUNT(*) FROM chunks;"
1247

$ # Verify direct embedding capability
$ curl -X POST https://api.provider.com/v1/embeddings \
  -H "Authorization: Bearer $EMBEDDING_API_KEY" \
  -d '{"input":"ping","model":"embedding-model"}'
{"embedding":[...],"model":"embedding-model"}

$ # Probe via OpenClaw (flapping behavior)
$ openclaw memory status --deep --trace 2>&1 | grep -i embed
[probe] Calling embedBatchWithRetry(["ping"])
[probe] Result: success  # Run 1
[probe] Result: failure  # Run 2
[probe] Result: success  # Run 3

🧠 Root Cause

Architectural Failure: Probe/State Conflation

The MemoryIndexManager class conflates two distinct concerns:

  1. Persistence Layer: SQLite-backed index with verified chunk counts and content integrity
  2. Probe Layer: Live HTTP request to remote embedding endpoint for readiness confirmation

The critical code path in MemoryIndexManager.probeEmbeddingAvailability():

async probeEmbeddingAvailability(): Promise<boolean> {
  try {
    await this.embedBatchWithRetry(["ping"]);
    return true;
  } catch (err) {
    return false; // Any transient failure bubbles up as "unavailable"
  }
}

Failure Sequence

m openclaw memory status --deep
           β”‚
           β–Ό
MemoryIndexManager.status()
           β”‚
           β”œβ”€β”€β–Ά getIndexStats()     β†’ Returns: {chunkCount: 1247, lastIndexed: ...}
           β”‚
           └──▢ probeEmbeddingAvailability()
                        β”‚
                        β”œβ”€β”€ Attempt 1: embedBatchWithRetry(["ping"])
                        β”‚              β”‚
                        β”‚              β”œβ”€β”€ Network timeout (100ms) β†’ throw
                        β”‚              β”‚
                        β”‚              └── Returns: "unavailable" ❌
                        β”‚
                        └── (Next call)
                                     β”‚
                                     β”œβ”€β”€ Attempt 2: embedBatchWithRetry(["ping"])
                                     β”‚              β”‚
                                     β”‚              β”œβ”€β”€ Success (200ms) β†’ return embedding
                                     β”‚              β”‚
                                     β”‚              └── Returns: "ready" βœ“
                                     β”‚
                                     └── Result: Flapping between ❌/βœ“

Why This Occurs

  • No probe caching: Each `status --deep` invocation triggers a fresh live probe regardless of recent results
  • No debouncing: Rapid consecutive calls each hit the remote endpoint independently
  • Transient errors included: Timeout, rate limiting, DNS hiccups all report as "unavailable"
  • Single probe threshold: No retry-with-backoff or consensus mechanism before reporting failure
  • No state persistence for probe results: Last-known-good probe state is not recorded

Code Path Divergence

The normal indexing path has different behavior characteristics than the probe path:

// Normal indexing path (stable due to internal retry logic)
await this.embeddingClient.embedWithRetry(payload, {
  retries: 3,
  backoff: 'exponential',
  timeout: 5000,
});

// Probe path (minimal retry, short timeout)
await this.embedBatchWithRetry(["ping"], {
  retries: 1,
  timeout: 1000,  // Too aggressive for remote endpoints
});

Probe Timing Analysis

AspectNormal IndexingProbe Path
Timeout5000ms1000ms
Retries31
BackoffExponentialLinear
PayloadActual chunks["ping"]
ContextInside transactionStandalone call

This configuration mismatch means the probe path is more susceptible to transient network issues that the indexing path would handle gracefully.

πŸ› οΈ Step-by-Step Fix

Phase 1: Stabilize the Probe Configuration

File: packages/core/src/memory/memory-index-manager.ts

Before:

async probeEmbeddingAvailability(): Promise<boolean> {
  try {
    await this.embedBatchWithRetry(["ping"]);
    return true;
  } catch (err) {
    this.logger.warn('Embedding probe failed', err);
    return false;
  }
}

After:

async probeEmbeddingAvailability(): Promise<boolean> {
  const probeConfig = {
    retries: 3,
    baseTimeout: 1000,
    maxTimeout: 5000,
    backoffMultiplier: 2,
  };

  for (let attempt = 1; attempt <= probeConfig.retries; attempt++) {
    try {
      await this.embedBatchWithRetry(["ping"], {
        timeout: Math.min(
          probeConfig.baseTimeout * Math.pow(probeConfig.backoffMultiplier, attempt - 1),
          probeConfig.maxTimeout
        ),
      });
      return true;
    } catch (err) {
      if (attempt === probeConfig.retries) {
        this.logger.warn('Embedding probe failed after all retries', err);
        return false;
      }
      // Exponential backoff before retry
      await this.delay(Math.pow(2, attempt) * 100);
    }
  }
  return false;
}

Phase 2: Add Probe Result Caching

File: packages/core/src/memory/memory-index-manager.ts

Add to class properties:

private probeCache: {
  result: boolean;
  timestamp: number;
  error?: string;
} | null = null;

private readonly PROBE_CACHE_TTL_MS = 30000; // 30 seconds

Update probe method:

async probeEmbeddingAvailability(forceRefresh = false): Promise<boolean> {
  // Return cached result if fresh
  if (!forceRefresh && this.probeCache) {
    const age = Date.now() - this.probeCache.timestamp;
    if (age < this.PROBE_CACHE_TTL_MS) {
      this.logger.debug(`Using cached probe result (age: ${age}ms)`);
      return this.probeCache.result;
    }
  }

  // Perform actual probe
  const result = await this.performProbeWithBackoff();

  // Update cache
  this.probeCache = {
    result,
    timestamp: Date.now(),
  };

  return result;
}

Phase 3: Separate Status Reporting

File: packages/cli/src/commands/memory/status.ts

Before (conflated output):

Output:
  Embeddings: ${probeResult ? 'ready' : 'unavailable'}
  Index: healthy, ${chunkCount} chunks indexed

After (separated concerns):

Output:
  Index State:
    Status: ${chunkCount > 0 ? 'populated' : 'empty'}
    Chunks: ${chunkCount}
    Last indexed: ${lastIndexedAt}

  Embedding Provider:
    Probe status: ${probeResult ? 'available' : 'unavailable'}
    Probe age: ${cacheAge ? `${cacheAge}s ago` : 'just now'}
    Last error: ${lastError || 'none'}
    Warning: ${probeResult === false ? 'Probe failures may indicate provider issues, not index corruption' : ''}

Phase 4: Add --force-probe Flag for Debugging

File: packages/cli/src/commands/memory/status.ts

command
  .option('--force-probe', 'Bypass probe cache and perform fresh availability check')
  .option('--no-cache', 'Alias for --force-probe')

// In handler
const forceProbe = options.forceProbe || options.noCache;
const probeResult = await memoryManager.probeEmbeddingAvailability(forceProbe);

Phase 5: Environment Variable Control

Add to configuration:

OPENCLAW_EMBEDDING_PROBE_TIMEOUT=5000
OPENCLAW_EMBEDDING_PROBE_RETRIES=3
OPENCLAW_EMBEDDING_PROBE_CACHE_TTL=30000

Complete Fix Sequence

# 1. Navigate to relevant source directory
cd packages/core/src/memory

# 2. Apply Phase 1-2 changes to memory-index-manager.ts
# 3. Apply Phase 3-4 changes to status command
# 4. Run tests
npm run test -- --grep "probeEmbeddingAvailability"

# 5. Verify behavior
openclaw memory status --deep
openclaw memory status --deep --force-probe
openclaw memory status --deep --trace 2>&1 | grep -E "(probe|cache)"

πŸ§ͺ Verification

Verification Steps

Step 1: Verify stable output across multiple calls

$ # Run 5 consecutive status checks
$ for i in {1..5}; do
    openclaw memory status --deep | grep -E "(Status:|Probe)"
  done

# Expected: All 5 runs should show consistent results
Status: populated
Probe status: available

Status: populated
Probe status: available

Status: populated
Probe status: available

Status: populated
Probe status: available

Status: populated
Probe status: available

Step 2: Confirm probe caching with trace output

$ openclaw memory status --deep --trace 2>&1 | grep -i "probe\|cache"
[probe] Performing fresh availability check
[probe] Result: success, cached for 30s

$ # Immediate second call should use cache
$ openclaw memory status --deep --trace 2>&1 | grep -i "probe\|cache"
[probe] Using cached probe result (age: 0ms)
[cache] Result: available

Step 3: Verify force-probe bypasses cache

$ openclaw memory status --deep --force-probe --trace 2>&1 | grep -i "probe"
[probe] Performing fresh availability check
[probe] Bypassing cache
[probe] Result: success

Step 4: Confirm separate index/probe reporting

$ openclaw memory status --deep

=== Memory Index State ===
Status: populated
Chunks: 1247
Last indexed: 2026-01-15T10:32:18Z

=== Embedding Provider ===
Probe status: available
Probe age: 2s ago
Last error: none
Warning: Probe failures indicate provider unavailability, not index corruption

$ # Verify index details match actual SQLite data
$ sqlite3 ~/.openclaw/agents/my-agent/memory.db "SELECT COUNT(*) FROM chunks;"
1247

Step 5: Test retry behavior under simulated failures

$ # Enable debug logging
$ OPENCLAW_LOG_LEVEL=debug openclaw memory status --deep 2>&1 | grep -E "probe|retry|backoff"

[probe] Attempt 1 failed: fetch timeout
[probe] Retrying with backoff: 200ms
[probe] Attempt 2 failed: fetch timeout
[probe] Retrying with backoff: 400ms
[probe] Attempt 3: success
[probe] Final result: available (after retries)

Expected Test Results

TestExpected BehaviorPass Criteria
Consecutive callsConsistent output0 flapping in 5 runs
Cache hit0ms probe age in trace[probe] Using cached result
Force probeFresh probe executedBypassing cache in trace
Status separationIndex and probe sectionsIndependent health signals
Retry under failure3 attempts with backoffAttempt 1/2/3 in trace

Exit Code Verification

$ openclaw memory status --deep
$ echo "Exit code: $?"
Exit code: 0

$ # Even with transient probe failure, index section should succeed
$ # Exit code should only be non-zero for actual index corruption

⚠️ Common Pitfalls

Environment-Specific Traps

  • macOS Socket Exhaustion: Rapid consecutive `status --deep` calls on macOS may encounter socket limit issues
    # Fix: Increase ulimit before testing
    ulimit -n 10240
    openclaw memory status --deep
    
  • Docker Network Isolation: When running inside Docker, the embedding endpoint may be unreachable from container but accessible from host
    # Verify network reachability
    docker exec <container> curl -v https://api.provider.com/v1/embeddings
    
  • Proxy Environment Variables: Some environments have proxy settings that affect embedding requests
    # Check for proxy interference
    echo $HTTP_PROXY
    echo $HTTPS_PROXY
    echo $NO_PROXY
    # Ensure embedding endpoint is in NO_PROXY or proxies are properly configured
    

Configuration Missteps

  • Overlapping TTLs: Setting `PROBE_CACHE_TTL` shorter than minimum reasonable probe interval causes artificial failures
    # Bad: TTL too short for network variability
    OPENCLAW_EMBEDDING_PROBE_CACHE_TTL=1000
    

    Good: Allow for transient network jitter

    OPENCLAW_EMBEDDING_PROBE_CACHE_TTL=30000

  • Timeout Less Than Network Latency: Probe timeout set below typical round-trip time
    # Verify typical latency first
    curl -w "%{time_total}\n" -o /dev/null -s https://api.provider.com/v1/embeddings
    # If latency is 800ms, set timeout higher
    OPENCLAW_EMBEDDING_PROBE_TIMEOUT=5000
    
  • Rate Limit Conflicts: Provider rate limiting kicking in before retry exhaustion
    # Check for rate limit headers in probe responses
    openclaw memory status --deep --trace 2>&1 | grep -i "rate\|429\|limit"
    

Debugging Misconceptions

  • Assuming Index Corruption: Probe failure is not definitive proof of index problems
    # Always check index state separately
    openclaw memory status --deep | grep -A5 "Index State"
    # If chunks exist, index is not corrupted
    
  • Ignoring Cache State: Forgetting that cached probe results may mask recent probe failures
    # Always use --force-probe when debugging probe behavior
    openclaw memory status --deep --force-probe --trace
    
  • Single Probe Determinism: Expecting one probe failure to definitively indicate provider problems
    # The fix adds retry logic - verify it triggers
    OPENCLAW_LOG_LEVEL=debug openclaw memory status --deep
    # Should see "Attempt 1/2/3" messages
    

Node.js Version Considerations

  • HTTP/2 Connection Reuse: Node.js 22.x improved HTTP/2 handling which may affect probe behavior differently than earlier versions
  • Keep-Alive Behavior: Verify connection pooling is working correctly
    NODE_OPTIONS='--trace-warnings' openclaw memory status --deep 2>&1 | grep -i "socket\|connection"
    
  • #41770 - Proxy/env routing fix for memory remote embeddings (separate but related transport layer fix)
  • #30075 - Original proxy/env handling issue for embedding requests

Logically Connected Error Patterns

Error TypeManifestationRelated Component
fetch failedProbe returns fetch errorembedBatchWithRetry()
unavailableProbe returns falseprobeEmbeddingAvailability()
timeoutRequest exceeds thresholdTransport layer
rate limited429 response from providerProvider API layer
ECONNREFUSEDEndpoint unreachableNetwork layer
ENOTFOUNDDNS resolution failureNetwork layer

Historical Context

This issue stems from the architectural decision to conflate:

  1. Index persistence state (stored chunks, SQLite integrity)
  2. Provider availability state (live probe result)

The original probeEmbeddingAvailability() was designed for simple availability checking but became problematic when:

  • Remote embedding endpoints became standard
  • Network variability increased
  • No caching/retry logic was implemented for the probe path

Distinction from #30075/#41770

Aspect#30075/#41770 (Proxy Issue)This Issue (Probe Flapping)
LayerTransport/HTTP routingStatus reporting logic
SymptomAll embeddings failFlapping between states
ScopeAll embedding operationsOnly status --deep probe
Fix targetEmbedding clientMemoryIndexManager

Associated Debugging Commands

# Check embedding client configuration
openclaw config get embeddings.endpoint
openclaw config get embeddings.provider

# Verify probe cache state
openclaw memory status --deep --trace 2>&1 | grep -i cache

# Force fresh probe
openclaw memory status --deep --force-probe --trace

# Direct embedding test (circumvent probe)
openclaw embeddings generate --text "test"

# Check SQLite index integrity
sqlite3 ~/.openclaw/agents/<agent>/memory.db "PRAGMA integrity_check;"

Evidence & Sources

This troubleshooting guide was automatically synthesized by the FixClaw Intelligence Pipeline from community discussions.