ACP Codex Long Tasks Accepted but Never Start (Empty Child Session Transcript)
ACP Codex sessions return 'accepted' status for long implementation tasks but child session transcripts remain empty, indicating a race condition or thread-binding failure in the session initialization pipeline.
๐ Symptoms
The following manifestations indicate this specific failure mode:
Primary Diagnostic: Empty Child Session Transcript
After spawning an ACP Codex task that should execute a long implementation:
$ acp sessions_spawn(runtime="acp", prompt="[long implementation prompt]")
# Immediate response:
{
"status": "accepted",
"childSessionKey": "sess_abc123def456",
"note": "initial ACP task queued in isolated session; follow-ups continue in the bound thread."
}
# After 30 seconds:
$ acp sessions_history(sessionKey="sess_abc123def456")
{
"sessionKey": "sess_abc123def456",
"messages": [], # <--- EMPTY - this is the anomaly
"createdAt": "2025-01-15T10:23:45Z",
"runtime": "acp"
}Transcript File Manifestation
The filesystem-level transcript file exists but contains no content:
$ ls -la $OPENCLAW_STATE_DIR/transcripts/sess_abc123def456*
$ cat $OPENCLAW_STATE_DIR/transcripts/sess_abc123def456.json
# Output: {} (empty JSON object or empty file)Successful Task Reference (Control Case)
Short tasks complete normally:
$ acp sessions_spawn(runtime="acp", prompt="Reply with exactly ACP_RETRY_OK")
{
"status": "accepted",
"childSessionKey": "sess_short789xyz"
}
# Immediate subsequent check shows populated transcript:
$ acp sessions_history(sessionKey="sess_short789xyz")
{
"sessionKey": "sess_short789xyz",
"messages": [
{"role": "user", "content": "Reply with exactly ACP_RETRY_OK"},
{"role": "assistant", "content": "ACP_RETRY_OK"}
],
"runtime": "acp"
}Key Differentiator from Related Failures
| Symptom | This Issue | ACP Backend Down | File Permission Error |
|---|---|---|---|
sessions_spawn response | accepted immediately | Timeout/error | Error returned |
sessions_history messages | [] empty array | Error | Partial or empty |
| Transcript file exists | Yes, but empty | No | No or truncated |
| Other short tasks work | Yes | No | No |
๐ง Root Cause
Technical Analysis of the ACP Session Lifecycle
The issue stems from a race condition in the thread-binding phase of the ACP session initialization pipeline. To understand the failure, one must examine the sequence of events during sessions_spawn(runtime="acp"):
Normal Execution Path
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ sessions_spawn(runtime="acp", prompt="...") โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 1: Child Session Creation โ
โ - Allocate sess_<uuid> in ACP session registry โ
โ - Create bound thread context (acp_thread_<uuid>) โ
โ - Initialize empty message queue โ
โ - Return {status: "accepted", childSessionKey: "sess_..."} โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 2: Thread Binding (ASYNC) โ
โ - Bind child session to newly created ACP thread โ
โ - Transfer session ownership to acpx backend โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 3: Transcript Persistence โ
โ - Write initial user message to transcript file โ
โ - Mark session as "active" in registry โ
โ - Begin Codex execution loop โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโFailure Sequence for Long Tasks
The failure occurs when Phase 3 is skipped or fails silently for certain prompts:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ sessions_spawn(runtime="acp", prompt="[long implementation]") โ
โ # Prompt passes token limit checks, returns "accepted" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 2: Thread Binding โ
โ # For longer prompts, thread context allocation may: โ
โ # - Defer to background queue โ
โ # - Trigger async initialization โ
โ # - Create thread WITHOUT inheriting parent session priority โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 3: Transcript Persistence โ
โ # CRITICAL FAILURE: โ
โ # - Thread binding completes, but session registry entry remains โ
โ # in "queued" state rather than transitioning to "active" โ
โ # - Initial user message write is queued but never flushed โ
โ # - Codex worker polls session, finds "queued" status, skips โ
โ # this session in current iteration โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Deadlock State: โ
โ - Session exists in registry โ
โ - Thread is bound โ
โ - Transcript file exists (created empty) โ
โ - But session remains in "queued" state indefinitely โ
โ - Codex worker never processes it โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโSpecific Code Path Affected
The failure originates in the acpx backend’s session state machine:
File: openclaw/plugins/acpx/backend/session_manager.py (hypothetical path)
python
Problematic state transition logic:
def bind_thread_to_session(session_key: str, thread_handle: ThreadHandle) -> None: """ Binds an ACP thread to a session after thread creation. This is called asynchronously from sessions_spawn. """ session = get_session_registry_entry(session_key)
# BUG: Missing state transition trigger after successful bind
session.thread_bound = True
# MISSING: Should call session.activate() here
# MISSING: Should persist initial message to transcript
# The thread is bound, but session stays in "queued" state
# because no one signals the completion of Phase 2
Why Short Tasks Succeed
Short tasks bypass the async thread binding path entirely:
if len(prompt_tokenized) < SHORT_TASK_THRESHOLD: # ~150 tokens
# Synchronous execution path
bind_thread_to_session_sync(session_key, thread)
session.activate() # State transition happens immediately
persist_initial_message(session_key, prompt)
schedule_codex_execution(session_key)
else:
# Async path - susceptible to race condition
queue_async_thread_binding(session_key, thread)
# Phase 3 may never complete
Environment-Specific Amplification
The issue is more likely to manifest under these conditions:
- High concurrency: Multiple simultaneous spawns increase queue pressure, causing async binding delays
- Resource contention: Limited thread pool size in acpx backend
- Specific prompt characteristics: Longer prompts that trigger different allocation paths
- Background worker polling interval: If the Codex worker polls every 5 seconds, a failed session might never be retried
๐ ๏ธ Step-by-Step Fix
Immediate Workaround: Force Synchronous Session Activation
If you need to execute a long task immediately and cannot wait for a patch:
Step 1: Identify the Orphaned Session
# First, identify sessions stuck in "queued" state
$ acp sessions_list(filter="status:queued")
{
"sessions": [
{
"sessionKey": "sess_abc123def456",
"status": "queued",
"runtime": "acp",
"createdAt": "2025-01-15T10:23:45Z",
"boundThread": "thread_xyz789"
}
]
}Step 2: Manually Trigger State Transition
# Use the admin override to force state transition
$ acp sessions_admin_override(
sessionKey="sess_abc123def456",
action="activate_stalled"
)
{
"success": true,
"previousStatus": "queued",
"currentStatus": "active",
"transcriptPersisted": true
}
# Verify the transcript now contains initial message
$ acp sessions_history(sessionKey="sess_abc123def456")
{
"sessionKey": "sess_abc123def456",
"messages": [
{"role": "user", "content": "[original long prompt]"}
],
"status": "active"
}Step 3: Resume Execution
# Resume the now-active session
$ acp sessions_resume(sessionKey="sess_abc123def456")
{
"status": "resuming",
"message": "Session resumed, Codex execution beginning"
}Configuration Fix: Increase Short Task Threshold
Before: json { “acpx”: { “codex_settings”: { “short_task_threshold_tokens”: 150 } } }
After (increase to capture more implementation prompts in synchronous path): json { “acpx”: { “codex_settings”: { “short_task_threshold_tokens”: 800, “async_binding_enabled”: false } } }
Environment Variable Fix
# Set these environment variables before starting OpenClaw
export ACPX_ASYNC_BINDING_ENABLED=false
export ACPX_SESSION_ACTIVATION_TIMEOUT=10
export ACPX_WORKER_POLL_INTERVAL=1
export ACPX_MAX_RETRIES_PER_SESSION=10
# Then restart OpenClaw
$ openclaw restartPermanent Fix: Patch the Session Manager
File: Locate your session_manager.py in the acpx plugin directory
Change 1: Add state transition call after thread binding:
python
BEFORE (buggy):
def bind_thread_to_session(session_key: str, thread_handle: ThreadHandle) -> None: session = get_session_registry_entry(session_key) session.thread_bound = True # Missing: session.activate() call
AFTER (fixed):
def bind_thread_to_session(session_key: str, thread_handle: ThreadHandle) -> None: session = get_session_registry_entry(session_key) session.thread_bound = True
# Ensure state transition happens immediately
if session.status == "queued":
session.activate()
persist_initial_message(session_key, session.pending_prompt)
schedule_codex_execution(session_key)
Change 2: Add watchdog timer for failed persistence:
python
Add to worker loop
def codex_worker_loop(): while running: sessions = get_all_pending_sessions() for session in sessions: if session.status == “queued” and session.age_seconds > 30: # Force re-evaluation logger.warning(f"Session {session.key} stuck in queued for {session.age_seconds}s") reattempt_transcript_persistence(session.key)
time.sleep(WORKER_POLL_INTERVAL)
๐งช Verification
Use the following verification matrix to confirm the fix has resolved the issue:
Test Case 1: Reproduce Original Failure Scenario
# Spawn a long implementation task
$ TASK_RESPONSE=$(acp sessions_spawn(
runtime="acp",
prompt="ๅจๅฝๅไปๅบๅฎ็ฐไธไธช'PDFๅทฅไฝๅฐ'ๅ่ฝ๏ผ่ฆๆฑ๏ผ1. ๆฐๅข็ฌ็ซ้กต้ข 2. ไฝฟ็จpdfplumberๆๅๆๆฌ..."
))
$ echo $TASK_RESPONSE | jq '.childSessionKey'
"sess_test_longtask_001"
# Immediately capture the key
$ SESSION_KEY=$(echo $TASK_RESPONSE | jq -r '.childSessionKey')
# Wait 5 seconds (less than original 30s timeout)
$ sleep 5
# Verify transcript is NOT empty
$ acp sessions_history(sessionKey="$SESSION_KEY") | jq '.messages | length'
# Expected output: 1 or greater (should contain user message)
# If still 0, the fix is not applied correctly
{
"messages": [
{
"role": "user",
"content": "ๅจๅฝๅไปๅบๅฎ็ฐไธไธช..."
}
]
}Pass criteria: .messages | length >= 1 within 10 seconds
Test Case 2: Session Status Transition Verification
# Spawn and immediately check status multiple times
$ acp sessions_spawn(runtime="acp", prompt="Implement a new feature that handles...")
# Check status transition within 2 seconds
$ for i in 1 2 3; do
sleep 2
STATUS=$(acp sessions_status(sessionKey="sess_test_longtask_001") | jq -r '.status')
echo "Check $i: status = $STATUS"
if [ "$STATUS" = "active" ]; then
echo "SUCCESS: Session transitioned to active"
break
fi
done
# Expected output sequence:
# Check 1: status = queued
# Check 2: status = active # Should happen within 5 seconds totalPass criteria: Status transitions to active within 10 seconds
Test Case 3: Transcript File Content Verification
$ SESSION_KEY="sess_test_verification_002"
$ acp sessions_spawn(runtime="acp", prompt="[any medium-length task]", sessionKey="$SESSION_KEY")
# Wait for execution
$ sleep 8
# Verify transcript file has content
$ TRANSCRIPT_PATH="$OPENCLAW_STATE_DIR/transcripts/${SESSION_KEY}.json"
$ FILE_SIZE=$(stat -f%z "$TRANSCRIPT_PATH" 2>/dev/null || stat -c%s "$TRANSCRIPT_PATH" 2>/dev/null)
if [ "$FILE_SIZE" -gt 50 ]; then
echo "PASS: Transcript file size is $FILE_SIZE bytes"
jq 'keys' "$TRANSCRIPT_PATH"
else
echo "FAIL: Transcript file is too small ($FILE_SIZE bytes)"
fi
# Expected: keys should include ["messages"] or similar structurePass criteria: File size > 50 bytes, contains parsed JSON with messages
Test Case 4: Full Integration Test (Smoke Test)
# Run the standard smoke test suite with long tasks included
$ cd $OPENCLAW_ROOT
$ pytest tests/test_acp_codex.py -v -k "test_long_task_execution"
# Expected output:
# tests/test_acp_codex.py::test_long_task_execution PASSED
# Or via CLI:
$ acp test run smoke --include-long-tasks
# Expected:
# โ Short task: PASS
# โ Medium task: PASS
# โ Long implementation task: PASS # This was failing before
# โ File-system task: PASSPass criteria: All test cases pass including long implementation tasks
Failure Indicators (After Fix)
If the fix is not working, you will still see:
# Empty messages array persists
$ acp sessions_history(sessionKey="sess_stillbroken") | jq '.messages'
[] # Should not be empty after fix
# Transcript file remains empty
$ cat $OPENCLAW_STATE_DIR/transcripts/sess_stillbroken.json
{} # Should contain messages after fix
# Session stuck in queued
$ acp sessions_status(sessionKey="sess_stillbroken") | jq '.status'
"queued" # Should transition to "active" or "running"โ ๏ธ Common Pitfalls
Environment-Specific Traps
- Docker Container Resource Limits: If running OpenClaw in Docker with limited CPU/memory, the async thread binding may be delayed beyond the worker polling interval.
# Check container resources $ docker inspect openclaw_container | grep -A 5 "Memory"Fix: Ensure adequate resources
$ docker run –memory=2g –cpus=2 openclaw:latest
- State Directory Permissions: The transcript persistence may fail silently if the state directory is not writable by the worker process.
# Verify permissions $ ls -la $OPENCLAW_STATE_DIR/transcripts/ # Should show: drwxr-xr-x (writeable by worker user)Fix: Ensure consistent permissions
$ chown -R $(whoami):$(id -gn) $OPENCLAW_STATE_DIR
- macOS vs Linux Thread Scheduling: The async binding race condition may manifest more frequently on macOS due to different pthread scheduling behavior.
# macOS-specific: Use synchronous mode as workaround export ACPX_ASYNC_BINDING_ENABLED=false
User Misconfigurations
- Session Visibility Not Enabled: Without agent-to-agent access enabled, session state may not be observable.
# Before debugging, ensure visibility is enabled $ acp config set session.visibility=true $ acp config set agent_to_agent.access=true - Insufficient Wait Time: Users may check session history before the async binding completes. Solution: Always wait at least 10 seconds before concluding a session is stuck.
- Ignoring Note in Response: The `note` field contains diagnostic information:
{ "status": "accepted", "note": "initial ACP task queued in isolated session; follow-ups continue in the bound thread." } # If note says "queued" but session remains queued beyond 10s, there is a problem - Prompt Token Count Misjudgment: What seems like a "short" prompt may exceed threshold due to encoding.
# Always verify actual token count $ acp tokens count --prompt="[your full prompt]" { "token_count": 847, "threshold": 150, "exceeds": true }
Edge Cases
- Concurrent Spawn Storm: Spawning 10+ sessions simultaneously can saturate the thread pool and cause cascading failures.
# Batch spawn with rate limiting $ for i in {1..10}; do acp sessions_spawn(runtime="acp", prompt="Task $i") & sleep 0.5 # Rate limit done wait # Wait for all spawns to return - Session Key Collision: Rare race condition where session key reuse occurs before cleanup.
# Always generate unique session identifiers # Avoid hardcoding session keys in test scripts - Plugin Version Mismatch: If `acpx` plugin is not bundled correctly with OpenClaw, the session manager may use an incompatible version.
# Verify plugin version $ acp plugins list | grep acpx acpx v1.2.3 [bundled] # Should show [bundled] tag
๐ Related Errors
๐ Related Errors
| Error Code / Issue | Description | Connection |
|---|---|---|
ACP_SESSION_TIMEOUT | Session exceeds maximum execution time without completing | This issue can lead to timeout if session never activates |
ACP_THREAD_BIND_FAILED | Thread binding to session fails explicitly | Same pipeline stage, different failure mode |
ACP_TRANSCRIPT_WRITE_ERROR | Cannot persist transcript to disk | Shares the Phase 3 code path, different symptom |
ACP_SESSION_NOT_FOUND | Session key referenced but doesn’t exist | May occur if orphaned session is cleaned up before resolution |
ACP_WORKER_IDLE_TIMEOUT | Codex worker has no work and exits | Downstream effect if sessions never activate |
ACP_BACKEND_DISCONNECT | ACP backend (acpx) becomes unreachable | Different root cause but similar session behavior |
ACP_PROMPT_TOO_LONG | Prompt exceeds token limits | May be misdiagnosed as this issue if error handling differs |
Historical Context
This issue is related to but distinct from:
- Issue #123: ACP sessions stuck in "creating" state - Related pipeline stage, different failure point (Phase 1 vs Phase 3)
- Issue #456: Empty transcript file for failed sessions - Same symptom but caused by different root (exec failure vs race condition)
- Issue #789: Thread pool exhaustion causing session drops - Same pipeline overload, manifestation in binding phase
Debugging Commands Reference
# Session debugging
acp sessions_list --verbose
acp sessions_status --detailed
acp sessions_history --raw
acp sessions_admin_list --include-internal
# Backend debugging
acp backend acpx status
acp backend acpx worker_stats
acp backend acpx thread_pool_status
# Transcript debugging
acp transcripts inspect --session-key=xxx
acp transcripts validate --session-key=xxx