April 13, 2026

Gateway crashes or becomes unresponsive after in-process self-upgrade due to live runtime file replacement

Self-upgrading the OpenClaw gateway by replacing package files under the active running process creates a self-corruption window that manifests as missing hashed chunks, stale dynamic import paths, and lost environment-backed secrets.

🔍 Symptom

After initiating an OpenClaw self-upgrade, the gateway exhibits one or more of the following failure modes:

  • Missing hashed chunks: The gateway crashes or returns errors after the dist/ directory is replaced, reporting missing or unresolvable hashed asset chunks.
  • Stale dynamic import paths: Runtime path resolution fails post-upgrade because the process retains cached import graph references to the old package tree.
  • Lost LaunchAgent environment secrets: After plist regeneration, required EnvironmentVariables keys are absent, causing the gateway to fail startup or operate in a degraded state.
  • Half-restarted runtime probing: Post-upgrade health checks or doctor commands run against a process that is simultaneously restarting, causing recovery attempts to make the situation worse.
  • 502 responses on HTTP endpoints: Gateway HTTP endpoints become unreachable or return 502 Bad Gateway immediately following an upgrade.
  • ACP launcher path resolution failure: Bundled ACP extensions are resolved from the wrong path after the update completes, breaking plugin functionality.

These symptoms often appear after the package installation phase reports success, leading users to believe the upgrade completed when the system is actually unhealthy.

🧠 Principle

The root cause is architectural: the serving process is mutating the files it has imported. OpenClaw's current update flow performs in-process package replacement, which violates a fundamental invariant—that a long-running process cannot safely replace files in its own import tree.

Technical failure sequence:

  1. The live gateway process holds open file descriptors and cached module references to the current dist/ and package tree.
  2. The upgrade process replaces these files while the gateway continues serving requests.
  3. The gateway's import graph becomes inconsistent with the on-disk state, leading to missing chunks and broken dynamic imports.
  4. LaunchAgent/plist regeneration during upgrade does not preserve the EnvironmentVariables key set, causing env-backed secrets to vanish.
  5. Post-upgrade health probes run against a half-restarted runtime, masking the actual state of the system.

Why traditional in-process upgrade patterns fail here:

  • Node.js/V8 module caching: Once modules are loaded, V8 retains compiled references. Replacing .js files does not invalidate these references.
  • LaunchAgent atomicity gap: Writing a new .plist file is atomic, but restoring environment variables is not—it requires explicit snapshot and replay logic.
  • Health check race conditions: When the upgrade process spawns a restart and immediately queries status, it may be talking to the old process, the new process, or neither.
  • Staged vs. in-place installation: Installing directly over the active package means there is no fallback state if the new package is broken.
  • The fix is not to improve error recovery within the current model—it is to replace the current model with a two-process supervised handoff architecture that guarantees the serving process never mutates its own files.

🛠️ Fix

Implement a supervised two-process upgrade architecture as follows:

1. External updater process

Create a dedicated updater binary or supervisor script that runs outside the gateway process. The gateway must never replace files in its own import tree. The updater can be invoked via CLI or launchd/systemd mediated handoff.

2. Pre-upgrade state snapshot

Before touching any files, the updater must capture:

  • Current installed package path and version
  • Full LaunchAgent plist content
  • Sanitized list of required EnvironmentVariables key names (values must not be persisted in plaintext)
  • Config and runtime metadata required for rollback

3. Clean gateway shutdown

Drain active connections and stop the gateway cleanly before any file operations. Do not attempt file replacement while the gateway is still serving traffic.

4. Staged installation

Install the new package into a staging directory, not the live install path. This preserves the previous version as a fallback artifact until health verification passes.

5. Pre-promotion validation

Before activating the staged package, verify:

  • Entrypoints exist and are executable
  • Required bundled assets and chunks exist
  • Expected plugin and runtime paths resolve correctly
  • Offline validation or pointed-at-staging doctor checks pass

6. Atomic promotion

Switch the active package pointer to the staged version atomically (e.g., symlink swap or atomic rename). The previous package tree remains intact for rollback until the new runtime proves healthy.

7. LaunchAgent and env restoration

After promotion, restore the plist and rehydrate required environment-backed secrets:

  • Write the snapshot plist back to the correct location
  • Compare old and new plist for required EnvironmentVariables key presence
  • Re-apply env-backed secrets from the secure storage mechanism
  • Do not trust a freshly-generated plist without this validation step

8. Post-start health gating

Only after the new LaunchAgent starts should health verification occur:

  • Config validation
  • Gateway deep status
  • Channel probe verification
  • Cron and model status checks

9. Automatic rollback on failure

If any startup or probe step fails, the updater must automatically roll back:

  • Restore the previous package tree
  • Restore the previous plist and env snapshot
  • Restart the gateway with the prior known-good state
  • Surface clear rollback notification to the user

10. Success criteria

Declare upgrade success only when:

  1. Package is installed and promoted
  2. Gateway has restarted and is responsive
  3. Health checks pass (all three stages verified)

Report these as distinct stages to the user, not a single binary success/failure.