# CLI Subagent Transcript Fix — Engineering Analysis

_Generated 2026-03-05. Evidence: claude probe `ba87eb8b`, gemini timeout probe `f13c6d2d`, gemini done probe `8af3c941`._

---

## 1. Root-Cause Hypotheses — Ranked by Probability

### H1 — `try {} catch {}` completely silences Patch B failures (CONFIRMED structure, high certainty)

**File**: `subagent-registry-CkqrXKq4.js:39821`
**Evidence**: The injected line is:
```js
if (cliText && params.sessionKey) { try { await callGatewayTool("chat.inject", {}, {sessionKey: params.sessionKey, message: cliText}); } catch {} }
```
There is zero error surfacing. Any HTTP error, auth rejection, "session not found", or thrown exception disappears silently. We cannot determine the actual failure mode without instrumenting the catch.

**Impact**: This is the reason we have no diagnostic signal. Without logging, every downstream hypothesis looks the same from the outside (no JSONL written, no error).

---

### H2 — `callGatewayTool("chat.inject", {}, ...)` sends request with `token: undefined` (HIGH probability root cause of auth failure)

**File**: `subagent-registry-CkqrXKq4.js:13500-13518`
`resolveGatewayOptions$1({})` with empty opts:
- No `opts.gatewayUrl` → `validatedOverride = undefined`
- No `opts.gatewayToken` → `explicitToken = undefined`
- `token = undefined` (no explicitToken, no override target token)

The HTTP call in `callGateway` then fires as:
```
POST http://localhost:18789
Authorization: Bearer undefined   <- or no header at all
```

The gateway's auth middleware will reject this with an auth error — swallowed by `try {} catch {}`.

**Fix**: Pass `callGatewayLeastPrivilege` instead, or pass opts that resolve to a valid operator token. The function `callGatewayLeastPrivilege` (also imported from `./call-DaJKh-6e.js`) resolves the operator token from the running config internally.

---

### H3 — `loadSessionEntry(sessionKey)` returns `null` for subagent sessions (MEDIUM probability, secondary)

**File**: `gateway-cli-CuFEx2ht.js:11449`
```js
const { cfg, storePath, entry } = loadSessionEntry(rawSessionKey);
const sessionId = entry?.sessionId;
if (!sessionId || !storePath) {
    respond(false, void 0, errorShape(ErrorCodes.INVALID_REQUEST, "session not found"));
    return;
}
```

For subagent sessions with key `agent:claude:subagent:ba87eb8b-...`, the session MUST be in `sessions.json` before `runCliAgent` completes. The spawning path writes it before the run starts, so this should not be a timing race in normal flow. However, if the subagent runs in an isolated context where `sessions.json` isn't flushed yet (file-lock contention, async I/O lag), this could fail.

**Confidence**: Medium — only becomes the cause if H2 is ruled out.

---

### H4 — The probe runs (ba87eb8b, 8af3c941) predate the patches (CONFIRMED, explains the evidence)

Both probes were run at ~03:56 and ~03:58 respectively. Patches were applied during the debugging session that followed. The `sessionFile_exists: False` evidence for these probes reflects their pre-patch state. **These probes cannot be used to validate whether Patch B works.** A fresh post-patch probe has not been run.

**Impact**: The entire "persistent empty transcripts despite Patch B" narrative is based on stale evidence. We do not actually know if Patch B fires correctly for new runs.

---

### H5 — `runtimeMs` negative bug: fallback `startedAt` overwrite (CONFIRMED mechanism)

**File**: `subagent-registry-CkqrXKq4.js:21829`
```js
// UNCLAMPED:
(entry.endedAt ?? now) - (entry.startedAt ?? entry.createdAt)
```

For gemini probe `f13c6d2d` (timeout): when gemini-cli times out, `runWithModelFallback` triggers `blockrun/auto` as fallback. The fallback emits a new lifecycle `phase: "start"` event which overwrites `entry.startedAt` with the fallback's start timestamp. `entry.endedAt` was already set from the original run's `phase: "end"` event. So `endedAt < startedAt` → negative `runtimeMs`.

The clamped path in `buildCompactAnnounceStatsLine` at line 84302 uses `Math.max(0, ...)` and is fine. Only the `subagents list` display path at 21829 is unclamped.

---

## 2. Verification Plan

**Step 1 — Instrument Patch B to surface the error:**

```bash
# Find exact line of the catch block
grep -n "callGatewayTool.*chat.inject" /usr/lib/node_modules/openclaw/dist/subagent-registry-CkqrXKq4.js
```

Edit line ~39821 from:
```js
} catch {} }
```
to:
```js
} catch(e) { console.error("[patch-b-inject] chat.inject failed:", String(e)); } }
```

Then restart the gateway:
```bash
# Use config.apply RPC to graceful-restart (do NOT use CLI stop/start)
curl -s -X POST http://localhost:18789 \
  -H "Content-Type: application/json" \
  -d '{"method":"config.apply","params":{}}'
```

**Step 2 — Run a fresh CLI probe and tail the gateway log:**

```bash
# Watch for the error line in real-time
journalctl -f -u openclaw 2>/dev/null || tail -f /var/log/openclaw/gateway.log &

# Run a minimal probe via the CLI backend
# (use whatever mechanism triggers a CLI backend subagent run)
```

Expected output if H2 (token auth): `[patch-b-inject] chat.inject failed: Error: 401 Unauthorized`
Expected output if H3 (session not found): `[patch-b-inject] chat.inject failed: Error: session not found`

**Step 3 — Verify token resolution path:**

```bash
# In the gateway process context, check what token would be resolved
node -e "
const {loadConfig} = require('/usr/lib/node_modules/openclaw/dist/subagent-registry-CkqrXKq4.js');
const cfg = loadConfig();
console.log('token:', cfg?.gateway?.token ? 'PRESENT' : 'MISSING');
"
```

**Step 4 — Post-fix verification: Run new probes and check:**

```bash
# After applying the fix below:
ls -la /home/ubuntu/.openclaw/agents/claude/sessions/*.jsonl | tail -5
# Verify fresh probes create JSONL files

# Check sessions_history is populated:
cat /home/ubuntu/.openclaw/agents/claude/sessions/sessions.json | \
  python3 -c "import json,sys; data=json.load(sys.stdin); \
  [print(k, v.get('sessionFile','NONE')) for k,v in list(data.items())[-3:]]"
```

**Step 5 — Verify `runtimeMs` fix:**

```bash
# Run a gemini probe that triggers timeout + fallback
# Then check subagents list output
openclaw subagents list 2>/dev/null | grep -E "runtimeMs|runtime"
# Expect: no negative values
```

---

## 3. Code-Level Fix Proposals

### Fix A — Expose errors in Patch B (prerequisite for all other fixes)

**File**: `subagent-registry-CkqrXKq4.js:39821`

```diff
- if (cliText && params.sessionKey) { try { await callGatewayTool("chat.inject", {}, {sessionKey: params.sessionKey, message: cliText}); } catch {} }
+ if (cliText && params.sessionKey) { try { await callGatewayTool("chat.inject", {}, {sessionKey: params.sessionKey, message: cliText}); } catch(e) { log$11.warn(`[cli-inject] chat.inject failed for ${params.sessionKey}: ${String(e)}`); } }
```

**Ordering constraint**: Apply before any other fix. This gives you diagnostic data immediately on next run.

---

### Fix B — Replace `callGatewayTool` with direct `appendAssistantMessageToSessionTranscript` (recommended primary fix)

Bypasses the gateway HTTP round-trip entirely. `appendAssistantMessageToSessionTranscript` is already imported in `subagent-registry-CkqrXKq4.js` as `r` (line 29 import). Direct call avoids auth token issues completely.

**File**: `subagent-registry-CkqrXKq4.js:39821`

```diff
- if (cliText && params.sessionKey) { try { await callGatewayTool("chat.inject", {}, {sessionKey: params.sessionKey, message: cliText}); } catch {} }
+ if (cliText && params.sessionKey) {
+   try {
+     await appendAssistantMessageToSessionTranscript({
+       agentId: params.agentId,
+       sessionKey: params.sessionKey,
+       text: cliText
+     });
+   } catch(e) {
+     log$11.warn(`[cli-inject] transcript write failed for ${params.sessionKey}: ${String(e)}`);
+   }
+ }
```

**Idempotency**: `appendAssistantMessageToSessionTranscript` writes a new message each call. If called twice (e.g., due to retry), you get two messages. Guard: wrap in a `let injected = false; if (!injected) { injected = true; ... }` flag within the closure scope.

**Ordering constraint**: Must run after `runCliAgent` completes and before `emitAgentEvent({phase: "end"})`. Current Patch B position satisfies this.

---

### Fix C — Use `callGatewayLeastPrivilege` instead of `callGatewayTool` (alternative if Fix B is not preferred)

If staying with the HTTP approach, `callGatewayLeastPrivilege` (imported from `./call-DaJKh-6e.js`) resolves the token from the running config internally. Replace:

```diff
- await callGatewayTool("chat.inject", {}, {sessionKey: params.sessionKey, message: cliText});
+ await callGatewayLeastPrivilege("chat.inject", {sessionKey: params.sessionKey, message: cliText});
```

Check the exact signature of `callGatewayLeastPrivilege` in the dist before applying — the param order may differ.

---

### Fix D — Clamp `runtimeMs` at line 21829 (runtimeMs negative bug)

**File**: `subagent-registry-CkqrXKq4.js:21829`

Find the exact pattern:
```
(entry.endedAt ?? now) - (entry.startedAt ?? entry.createdAt)
```

```diff
- (entry.endedAt ?? now) - (entry.startedAt ?? entry.createdAt)
+ Math.max(0, (entry.endedAt ?? now) - (entry.startedAt ?? entry.createdAt))
```

**Stable anchor for search**: This appears inside a `mapRuns` callback in the `subagents list` handler. The pattern `entry.endedAt ?? now` is unique enough to locate it. Verify only one occurrence before replacing.

---

## 4. Dist/Minified Targeting Strategy

**Active files (confirmed by gateway process inspection):**
- `gateway-cli-CuFEx2ht.js` — loaded via `register.subclis-GK5pfSQG.js`
- `subagent-registry-CkqrXKq4.js` — contains all functions to patch

**Inactive (do NOT re-patch):**
- `gateway-cli-vk3t7zJU.js` — has backup `.bak`, not loaded
- `reply-DhtejUNZ.js` — has backup `.bak`, not loaded

**Finding anchors in minified code:**

For any target pattern, use the following stable anchors:

```bash
# Fix B — find appendAssistantMessageToSessionTranscript call site:
grep -n "chat.inject.*sessionKey.*params.sessionKey" \
  /usr/lib/node_modules/openclaw/dist/subagent-registry-CkqrXKq4.js

# Fix D — find unclamped runtimeMs:
grep -n "endedAt.*now.*startedAt.*createdAt" \
  /usr/lib/node_modules/openclaw/dist/subagent-registry-CkqrXKq4.js

# Verify createIfMissing:true is in place (Patch A):
grep -c "createIfMissing: true" \
  /usr/lib/node_modules/openclaw/dist/gateway-cli-CuFEx2ht.js
# Expected: 3 (2 original + 1 patched)
```

**Re-apply instructions after an openclaw package update:**

1. Check file hashes first — if `CuFEx2ht` and `CkqrXKq4` hashes change, the patches are gone
2. Re-verify anchors with the grep commands above
3. Apply patches via `sudo python3 -c "..."` (node_modules requires root)
4. Use `config.apply` RPC to graceful-restart — never `openclaw gateway stop`

**Pre-patch backup command:**
```bash
sudo cp /usr/lib/node_modules/openclaw/dist/subagent-registry-CkqrXKq4.js \
  /usr/lib/node_modules/openclaw/dist/subagent-registry-CkqrXKq4.js.bak.$(date +%s)
```

---

## 5. Test Matrix + Definition of Done

| Probe | Provider | Expected post-fix | Verification |
|---|---|---|---|
| fresh-claude-1 | claude-sonnet-openclaw | `sessionFile_exists: True`, JSONL has assistant message, `sessions_history` non-empty | `cat sessions.json \| python3 -c "..."` + read JSONL |
| fresh-gemini-1 | gemini-cli | `sessionFile_exists: True`, JSONL has assistant message | Same |
| fresh-kimi-1 | kimi (if configured) | `sessionFile_exists: True` | Same |
| fresh-deepseek-1 | deepseek-code | `sessionFile_exists: True` | Same |
| gemini-timeout-1 | gemini-cli (short timeout) | Fallback fires, JSONL contains fallback output OR original output, `runtimeMs >= 0` | `subagents list \| grep runtimeMs` |
| empty-output-1 | any CLI backend with empty stdout | No crash, `sessions_history: []` is acceptable, no error in logs | Gateway log shows no unhandled exception |

**Definition of Done:**

1. `[patch-b-inject]` log lines appear in gateway log for every CLI subagent run
2. All "fresh" probes in the matrix above have `sessionFile_exists: True`
3. `readLatestSubagentOutput(sessionKey)` returns non-empty string for all passing probes
4. `runSubagentAnnounceFlow` produces a non-`"(no output)"` findings string in the announce message
5. `subagents list` shows `runtimeMs >= 0` for all rows (zero is allowed for instant runs)
6. No regression: embedded PI agents (blockrun/auto) still deliver results normally

---

## 6. Rollback + Risk Notes

**Rollback procedure:**

```bash
# Restore subagent-registry:
sudo cp /usr/lib/node_modules/openclaw/dist/subagent-registry-CkqrXKq4.js.bak \
  /usr/lib/node_modules/openclaw/dist/subagent-registry-CkqrXKq4.js

# Restore gateway-cli (if Patch A needs rollback):
sudo cp /usr/lib/node_modules/openclaw/dist/gateway-cli-CuFEx2ht.js.bak \
  /usr/lib/node_modules/openclaw/dist/gateway-cli-CuFEx2ht.js

# Graceful restart:
curl -s -X POST http://localhost:18789 \
  -H "Content-Type: application/json" \
  -d '{"method":"config.apply","params":{}}'
```

**Risk: Double-write if retry occurs**
If `runAgentTurnWithFallback` is retried (transient HTTP error path at `TRANSIENT_HTTP_RETRY_DELAY_MS`), the CLI run may run again and write a second message to the JSONL. The `appendAssistantMessageToSessionTranscript` call would append a duplicate. Mitigation: the `injected` flag pattern described in Fix B guards this within a single closure invocation; across retries, the second run would write a second (accurate) output which is acceptable.

**Risk: `createIfMissing: true` JSONL creation races with session archival**
If the session is being archived/deleted concurrently, a new JSONL might be created at the old path and then orphaned. Low probability in normal operation; acceptable risk for a debugging fix.

**Risk: Gateway restart timing**
`config.apply` triggers a graceful restart. There is a window (~500ms) where in-flight subagent runs may lose their lifecycle "end" event. If a run is in progress when restart fires, its JSONL write will still complete (write happens before lifecycle end emission), but the announce flow may not see it. Schedule restarts only when `subagents list` shows no active runs.

**Monitoring signals to watch:**
- `[patch-b-inject]` in gateway log: confirms the code path is executing
- `sessionFile_exists: True` in post-run session inspection: confirms JSONL written
- `sessions_history` non-empty in subagent result: confirms end-to-end delivery
- `runtimeMs < 0` in `subagents list` output: confirms Fix D was not applied or regressed
