Codex Desktop App

Apache-2.0

beta

Structural
discipline
for agentic
development.

A Codex Desktop App plugin that orchestrates multi-stage agentic work through a manifest-driven pipeline. Human-approval gates where decisions actually live. Instrumented audit stages that catch the drift class which silently ships the wrong product.

View on GitHub Read the manual

Fig. 01 | Pipeline schematic v0.9.1 | feature.yaml

Eleven stages | four phases | three human gates | shared stop validator | optional lifecycle hooks A run, plus the evidence-bound stop-decision and hook guardrail control plane.

Agent stageIsolated subagent. Role brief plus manifest and prior artifacts. Writes one artifact, exits.

AutomatedDeterministic check. Exit 0 advances; non-zero halts.

Human gateExplicit APPROVE or describe what should change. No autonomous-mode-by-stealth.

New at v0.5Drift-detector, critique, auto-promote. Single-AI hardening.

Control planeFinal-response gate, decision gate, decision ledger, and navigator. Stops require evidence.

Section 01Why this
exists

Agentic work fails in predictable ways.

The agent improvises past the spec. It claims tests pass without running them on a fresh dependency set. It merges in-flight work while a scope question is still open. It picks architectural decisions silently rather than surfacing them for review.

agent-pipeline-codex enforces a structural pattern that catches every one of those - not by hoping the agent is well-behaved, but by making the well-behaved path the only path. Every stage produces a durable artifact under .agent-runs/<run-id>/; every gate produces a one-question prompt that cannot be bypassed.

iManifest gate - every run begins with a human-approved contract naming goal, allowed paths, forbidden paths, non-goals, expected outputs, and definition-of-done. Fuzzy manifests fail strict schema validation at the gate, before they cascade.
iiDirector-decisions - the researcher surfaces open questions; the human picks; choices are recorded as binding constraints before the planner runs. Architectural decisions never happen in chat.
iiiStanding invariants - the cumulative-drift gap that lets feature-scoped manifests silently ship stale top-of-file content is closed by doc-currency invariants the drift-detector checks on every run.

Section 02Install in
thirty seconds

One command. Then orient. Then run.

Install once across all projects. From any project root - empty directory, fresh clone, or a working repo - run the pipeline-init skill. It asks one question (PRD path, repo URL, or description paragraph), produces a project-orientation summary, and scaffolds .pipelines/, scripts/policy/, and a starter AGENTS.md if one is not present.

# Install once, across all Codex Desktop App projects
> git clone https://github.com/scottconverse/agent-pipeline-codex.git ~/agent-pipeline-codex-plugin
> python scripts/verify_plugin_release.py --live
# The live gate checks deterministic install state, then repeats fresh Codex probes.

# Then in any project root
> Use pipeline-init for this project.
> Use intake for <plain-English task>.
> Use new-run for feature <task-slug>.
>  ... fill in the manifest the plugin scaffolds ...
> Use validate-manifest for <run-id>.
> Use run-pipeline for feature <run-id>.
> Use show-run-status for <run-id>.

Codex skills

Eight Codex skills cover the full surface: agent-pipeline orients and routes; pipeline-init onboards a project, intake drafts starting artifacts from a plain-English task, new-run initializes a blank run manifest, validate-manifest preflights the manifest schema, run-pipeline orchestrates the eleven stages with resume-from-log, show-run-status gives read-only run orientation, and audit-init scaffolds the v0.3 dual-AI audit-handoff discipline.

Subagent isolation

Each agent stage runs as an isolated Codex subagent with no parent-conversation memory. The orchestrator passes a role file plus run context as the entire prompt. The judge layer (v0.4 opt-in) intercepts proposed executor tool calls and routes high-risk actions through a context-isolated judge subagent before they execute.

Section 03Release
history

Each minor release adds a layer the previous version did not catch.

The layers stack. Every version's failure-mode coverage is preserved on upgrade.

v0.2Module-release

Six-phase release pipeline.

Phase 0 audits the release workflow before any product code is touched. Phase 2 rehearses the release sequence locally on fresh state. The CI workflow becomes the execution mechanism, not the discovery mechanism.

Catches Execution-cascade failures - pre-existing CI bugs, tag-move dances, halt-and-ask loops.

v0.3Audit-handoff

Dual-AI discipline.

audit-init scaffolds the three-artifact discipline for projects where one AI implements and another audits. Implementer runs a hostile 5-lens self-audit before push; auditor runs a 10-section verification protocol; both share an in-repo drift-patterns catalog that grows over time.

Catches Drift failures - wrong endpoint, stale CHANGELOG, "Closed" without evidence, status-word abuse.

v0.4Judge layer

Real-time action supervision.

Opt-in. Every executor tool call is classified by risk (read_only / reversible_write / external_facing / high_risk). Dangerous actions route to a context-isolated judge subagent with four verdicts - allow, block, revise, escalate.

Catches Unauthorized actions in real time - destructive commands, external writes, force pushes, credential-touching operations.

v0.5Single-AI hardened

Three new stages, one structural backstop.

Critic reads every artifact adversarially in fresh context. Drift-detector compares manifest contract to assembled state and, at v0.5.1, enforces standing doc-currency invariants regardless of manifest scope. Auto-promote scores six structural conditions and collapses the manager gate when clean. Pre-edit fact-forcing in the executor catches blast-radius surprises before they hit the verifier.

Catches The drift class without needing a second AI - durable doc drift, cross-file inconsistency, version strings out of sync, status-word abuse.

v0.5.8Status polish

The install surface has one truth.

Run status now reports skipped malformed log lines, decision ledgers are tested through the production writer and validator together, git classification rules have focused negative cases, and public CI badges make source-only verification visible.

Fixes Helper drift, brittle parser confidence, stale standalone skill ambiguity, and incomplete restart prompts.

v0.5.9Canonical rung lock

The next rung is no longer inferred.

Every product run now carries a scope lock tied to the canonical release plan. Policy checks block missing locks, future-rung paths, contradictory docs, and commit subjects that do not belong to the locked rung. Prompt-plan conflicts stop before edits with SCOPE_CONFLICT.

Fixes The agent working confidently on the wrong release rung because user wording and the canonical plan diverged.

v0.6.0Directive contracts

Directive-conformant gates can auto-fire without training reflexive approval.

A run-local directive binds exact manifest and scope-lock content, plan assertions, manager assertions, author provenance, and a SHA-256 hash into run.log; mismatch or tampering falls back to human review.

Fixes Stale active-control-state files and plausible blocker text becoming escape hatches during authorized work.

v0.7.0Hooked autonomy

Pipeline discipline moves into Codex lifecycle events.

Optional plugin hooks add active-run context, warn on stale skill names, preflight risky tools, deny unsafe approval requests, add corrective context after failed tools, and continue invalid stops when plugin_hooks is enabled.

Fixes Long sessions where the model remembers the plan but forgets the runtime stop and safety gates.

v0.8.0Intake drafting

Plain-English work requests become draft run artifacts without starting execution.

The new intake skill writes intake.md, draft manifest.yaml, draft scope-lock.yaml, and missing-question notes, then stops before validation or agent work.

Fixes The awkward gap between "I know what I want" and "I have a valid manifest" without weakening manifest approval.

v0.9.0Persistent memory

Trusted hooks write a local handoff that the next session can wake up with.

Active runs now keep memory/events.jsonl, turns.jsonl, decisions.jsonl, open_loops.jsonl, memory_probe.log, and handoff_current.md. SessionStart injects the compact handoff beside active-run context.

Fixes Session restarts and compaction losing recent warnings, open loops, and resume intent.

v0.9.1Baseline isolationCurrent

Release rehearsal failures get checked against baseline before deep debugging.

local-rehearsal.md Step 2.5 requires a branch-vs-merge-base rerun when verify-release.sh fails or hangs. verifier.md points to the same rule when verifier tests diverge from the implementation report.

Fixes Pre-existing main-branch flakes being mistaken for branch regressions and sending agents into the wrong debugging path.

Section 04Negative
space

What the plugin will not do - by construction.

The contract is enforced as much by what is forbidden as by what is required. The hard rules below are baked into the role files, the policy stage, and the orchestrator. They are not toggles.

Propose autonomous mode. Every gate is explicit.
Silently expand scope. The policy stage blocks every change outside allowed_paths.
Skip tests. Never skip tests is enforced as a project-level hard rule.
Promote a run when the verifier marked any criterion as not met.
Merge in-flight work while a halt is active - including cleanup PRs that "seem obviously safe."
Stop, defer, skip push, skip CI, or write a stopping handoff because of an unverified risk.

Section 05Lessons
baked in

Defaults reflect failures, not preferences.

Every default in this plugin earned its place by being the fix for a failure that cost real time on a real project. The following are notable enough to call out.

Halts apply to everything

When the orchestrator stops, no other repo state changes happen - including in-flight cleanup PRs. Auto mode never overrides explicit stops.

Local pass is not evidence

Tests passing on the developer's machine are not proof. CI on a fresh dependency set, in a clean container, is.

Amendments are corrections

A manifest amendment fixes a wording defect. It does not expand scope. If the scope grew, that is a new run, not a manifest edit.

Manager cites, never summarises

The final decision quotes the verifier verbatim. Summarisation and encouragement are how bad runs get promoted, and both are forbidden in the role file.

Decisions are written

Director decisions belong in director-decisions.md, not in chat. The researcher surfaces them; the human picks in writing.

Standing invariants beat memory

v0.5.1's doc-currency invariants enforce inventory accuracy on every run because cumulative drift was the failure mode v0.2 through v0.5 silently accumulated.

Installs need runtime proof

A plugin release is not installed until verify_plugin_release.py --live proves a fresh Codex process sees the plugin and the namespaced agent-pipeline-codex:* skills.

Documentation

README User Manual Architecture Changelog

Community

Discussions Issues Releases

Stability

Beta. The structural pattern has shipped across multiple projects and absorbed real failure receipts into its defaults. Semver applies once 0.1.x-beta drops.

Read the changelog ->