Design Specification

Sandbox. Memory. Self-Healing. Self-Learning.

Four coordinated capabilities extending the bees domain — giving Worker Bees a safe execution environment, human-readable memory, resilient failure recovery, and autonomous skill authoring.

4Core Capabilities
4Feature Flags
4Sequenced PRs
~$8.80Per 1k Tasks
Scroll to explore

Four Capabilities, One Integrated System

Each capability ships behind an independent per-hive feature flag. All four are designed to coexist without disrupting existing memory, circuit-breaker, or tool-retry systems.

01

Per-Hive Daytona Sandbox

A persistent execution environment mirroring the main agent's Daytona pattern. Each hive gets its own isolated container with a persistent volume at /workspace/hive/.

2 CPU / 4 GB RAM 5 GB Disk 300s Auto-stop
sandbox_enabled
02

Markdown File Memory Layer

A human-readable procedural knowledge layer using MEMORY.md, daily trajectory logs, and a skills library. Complements — never replaces — the existing MemoryService.

MEMORY.md Daily Logs Skills Library
file_memory_enabled
03

Three-Tier Self-Healing

A supervisor wrapping Worker execution: mechanical rules → single Reflexion pass → failure note. Hard cost cap of 1 Haiku call per task. Wraps, never replaces, existing resilience systems.

Tier 1: Rules Tier 2: Reflexion Tier 3: Note
self_healing_enabled
04

Voyager-Style Skill Library

Workers autonomously author markdown playbooks (plus optional sandboxed Python) on successful tasks and self-improve them on re-use. No RL or fine-tuning required.

Auto-Author Self-Revise 100 Skills Cap
skill_authoring_enabled

Architecture Overview

QueenOrchestrator
Haiku classify → Kimi K2.5 plan → dispatch
↓ SwarmTasks via Dramatiq
NEW
HealingSupervisor
Wraps swarm_jobs.run_task · Three-tier retry/reflect/note
↓ Orchestrates
NEW
BeesSandboxService
Daytona lifecycle · per-hive
EXTENDED
ContextManager
build_extended() · skills + memory_md
NEW
SkillLibraryService
search · author · revise · quarantine
↓ Reads/Writes
NEW
HiveWorkspace
MEMORY.md · memory/ · skills/
MemoryService
pgvector · 6 types · unchanged semantics
NEW
MemoryEventLog
bees.memory_events · append-only
New component Extended Unchanged

Explicit Non-Goals (v1)

No replacement of existing MemoryService, CircuitBreaker, or tool-retry
No RL fine-tuning, LoRA training, or prompt-gradient optimization
No multi-sandbox-provider abstraction (Daytona directly coupled)
No UI for skill browsing or editing (APIs only)
No plan-level replanning on failure (Queen's plan unchanged)
No dual-write of memory to files during v1

Per-Hive Daytona Sandbox

Each hive gets exactly one Daytona sandbox with a persistent volume. The sandbox mirrors the main agent's pattern with bees-specific configuration for longer idle times and conversational workloads.

Isolation Hierarchy

Hive
exactly ONE Daytona volume · never shared across hives
📦
Active Sandbox (at most one per hive)
2 CPU · 4 GB RAM · 5 GB disk · auto-stop 300s
📁
/workspace/hive/
MEMORY.md · skills/ · memory/ · .hive.toml

Sandbox Lifecycle

None
No sandbox_metadata
create + volume
Running
Active container
300s idle
Stopped
~2-3s resume
1800s
Archived
~10-15s unarchive
get_or_start()
Volume retained 90 days after last unarchive. Redis lock (TTL 30s) serializes creation — at most one active sandbox per hive.

Sandbox Tools

$_
SandboxShellTool
tmux-backed long-running shell commands
sandbox_enabled
📄
SandboxFilesTool
read / write / list / delete under /workspace/
sandbox_enabled
🐍
SandboxPythonTool
python -c or python <path>; captures stdout/stderr/exit
sandbox_enabled
🌐
SandboxBrowserTool
Playwright: navigate, wait, screenshot, extract_text, click
sandbox_enabled + browser role
SkillExecutorTool
runs skills/<slug>.py with JSON args · 60s timeout · 1MB output cap
skill_authoring_enabled

Security Model

🔐
Auth on Every Call
Labels + account_id check on every get_or_start. Mismatch raises PermissionError.
🏖️
Sandboxed Python
Agent-authored skills/*.py run ONLY inside sandbox — never in the backend process.
🔑
Credential Safety
Credentials fetched per-tool-call from credential-profile lookup. Never written to sandbox filesystem.
🚫
Default-Deny Egress
Inherits main agent's default-deny egress policy. New destinations require Change Management RFC.
🔒
AES-256 Encryption
Sandbox volumes encrypted at rest using Daytona's AES-256 provider-managed keys.
🎭
Log Masking
HiveWorkspace installs structlog processor that redacts MEMORY.md and skills/ bodies from Sentry/Langfuse traces.

File Memory Layer

A markdown-based procedural knowledge layer that complements the existing pgvector MemoryService. Files are the source of truth for procedural knowledge; the database remains authoritative for declarative facts.

🗄️
MemoryService (DB)
Unchanged · pgvector
Single declarative facts
"The user prefers Y"
Embedding + importance score
Mutable row + event log
6 memory types (preference, pattern, fact, procedure*, feedback, episodic)
*procedure writes frozen in v1
coexist by function
📝
File Memory Layer
New · Markdown files
Multi-step procedures
"How to do X given preferences Y"
Reusable Python helpers
Revisable documents
Trajectory logs + skill playbooks
Skills cite DB memories by mem:<id>

Filesystem Layout

/workspace/hive/
📋
MEMORY.md
Rendered projection of DB memories + agent scratchpad. Regenerated at task start. Agent may only edit the AGENT_SCRATCHPAD block.
📁
memory/
Daily trajectory logs
YYYY-MM-DD.md — Today's log (append-only)
YYYY-MM-DD.N.md — Roll suffix on 5MB overflow
archive/YYYY-MM.tar.gz — Nightly compression (>30 days)
index.md — Auto-maintained 30-day index
📁
skills/
Procedural playbooks authored by Workers
<slug>.md — Playbook with YAML frontmatter
<slug>.py — Optional executable helper
.archive/ — Skills over the 100-skill soft cap
⚙️
.hive.toml
Metadata: hive_id, account_id, schema_version, last_projected_at

MEMORY.md — Rendered Projection

1
Read Scratchpad
Preserve existing AGENT_SCRATCHPAD block from current MEMORY.md
2
Fetch Memories
isPermanent=true + top-k weighted memories via MemoryRepository.search_memories
3
Group & Sort
Group by memoryType, sort by weighted_score, emit stable [mem:<id>] citations
4
Atomic Write
tmp + rename into sandbox mount. Cache in Redis TTL 60s.
MEMORY.md
---
hive_id: <uuid>
generated_at: 2026-04-21T14:03:12Z
generator: bees.MarkdownMemoryProjector/v1
---
# Hive Memory
## Who you're working with
- Beekeeper name: Alice
- Role: Head of Marketing
## Preferences (top 10 by weighted score)
- Prefers bullet points over prose [mem:pref-a1b2 · 0.9]
- Never mentions competitors by name [mem:pref-c3d4 · 0.8]
## Recent patterns
- Schedules LinkedIn posts 9-10am ET [mem:ptn-7g8h]
<!-- AGENT_SCRATCHPAD:START -->
<!-- Agent may edit ONLY this block -->
<!-- AGENT_SCRATCHPAD:END -->

Memory Event Log — Migration Readiness

An append-only bees.memory_events table ships with v1 as migration-readiness infrastructure. Zero behavior change for readers — but enables future replay and file-based migration.

created
New memory upserted
updated
Existing row second write
superseded
Old memory replaced by new
tombstoned
Memory soft-deleted
PII Protection: before_content / after_content columns are application-layer encrypted via KMS envelope key. Log masking processor blocks them from Sentry/Langfuse. DSAR cascade purge on memory deletion.

File Ownership Invariants

Path Reader Writer
MEMORY.md (above scratchpad) Worker (RO in prompt) MarkdownMemoryProjector only
MEMORY.md scratchpad Worker (RO in prompt) Worker via SandboxFilesTool
memory/<today>.md Worker (RO excerpt) HiveWorkspace.append_trajectory only
memory/<past>.md Worker (RO via files tool) nobody (immutable)
skills/<slug>.md Worker (RO via prompt) SkillLibraryService only

Three-Tier HealingSupervisor

The HealingSupervisor wraps Worker execution with a three-tier escalation strategy. It sits above — and never replaces — the existing CircuitBreaker, tool-retry, and confidence-escalation systems.

Tier 1
Mechanical Rule Table
Zero LLM cost
Pure deterministic rules. First match wins. Table-driven and fully testable. No LLM calls.
tool_timeout (1st)
Double the tool's timeout
rate_limit
Backoff min(60, 2^attempts)s
schema_invalid
Append "Return ONLY valid JSON..."
empty_result + fallback
Substitute fallback tool
circuit_open + multi-provider
Switch provider
LLM transient 5xx
Single immediate retry
if Tier 1 fails + semantic/schema/repeated_tool_error
Tier 2
Single Reflexion Pass
~$0.001 (1 Haiku call)
One Haiku call per task. Reads the failure trajectory and produces a ≤120-word corrective note for the next attempt. Hard cap: 1 reflection per task.
Failure Trajectory
Haiku Self-Reflection
≤120-word corrective note
Prepended to next attempt as <reflection>
Every Tier-2 reflection is written to bees.memory_events for audit trail (Incident Response Policy §Documentation).
if Tier 1 + Tier 2 both fail
Tier 3
Failure Note
Feeds future tasks
Appends a structured failure note to today's trajectory log AND creates a fact-type memory in pgvector so future tasks can retrieve it.
📝
Append to memory/<today>.md with signature, tried approach, reflection, root cause, and guidance
🗄️
Emit memory_events row + insert into bees.memories as fact-type for future pgvector retrieval
⚠️
Task returns failure to user. Existing low-confidence escalation path remains intact.

Failure Decision Table

Failure Signal Tier 1 Tier 2 Tier 3
Tool HTTP timeout, first time tool_timeout_bump fires
Tool HTTP 429 rate_limit_backoff fires
Invalid JSON, 1st time schema_invalid_stricten fires
Invalid JSON, 2nd time after stricten fires possible
Empty result, no fallback fires (semantic) possible
Same tool fails 3x with different args fires (repeated_tool_error) possible
Circuit open, no alternate fires
Worker confidence < 0.3 fires (semantic) possible

Observability

bees.healing.tier1.count
Per-hive and global Tier-1 rule fires
bees.healing.tier2.count
Reflexion invocations + Haiku cost
bees.healing.tier3.count
Failure notes written to memory
heal_rate_by_class
Success rate per failure classification
Langfuse spans: healing.tier1.rule_fired · healing.tier2.reflection · healing.tier3.failure_note

Voyager-Style Skill Library

Workers autonomously author markdown playbooks on successful tasks and self-improve them on re-use. No RL, no fine-tuning — just structured knowledge accumulation through experience.

Skill File Format

skills/draft_linkedin_post.md
---
slug: draft_linkedin_post
title: Draft a LinkedIn post in Alice's voice
version: 3
triggers:
- "linkedin post"
- "draft a post for linkedin"
tool_dependencies: [linkedin.publish, sandbox.python]
memory_citations: [pref-c3d4, fact-e5f6]
success_criteria:
- post length 800-1200 chars
- no competitor mentions
network_egress: false
credential_access: []
---
# Draft LinkedIn Post
## When to use
When user requests a LinkedIn post...
## Steps
1. Fetch user's voice preferences...
2. Draft with sanitized rich-text...
## Known pitfalls
- Rich-text </> causes 400 errors
## Revision history
- v3: No hashtags (user feedback)
- v2: Added sanitization step
- v1: Initial authoring
🏷️
YAML Frontmatter
Slug, title, version, triggers, tool dependencies, memory citations, success criteria, security flags
📖
Markdown Body
When to use, step-by-step procedure, known pitfalls, revision history
🐍
Optional .py Helper
Pure computation only. AST-linted. Runs inside sandbox. No network, no filesystem writes outside /tmp/
🔢
Version Tracking
Every revision bumps version, updates file_hash, preserves history in ## Revision history section

Skill Authoring Flow

Task succeeds
Gate 1
Heuristics (free)
confidence ≥ 0.75
tool_call_count ≥ 4
duration ≥ 5s
no existing skill matched
skill count < 100
↓ passes
Gate 2
Haiku Y/N (~$0.0005)
"Is this a reusable pattern the agent should remember as a skill? Answer Y/N plus a one-line justification."
↓ Y
Author
Sonnet call (~$0.01)
Produces frontmatter YAML + body + optional .py. Atomic write to /workspace/hive/skills/. Runs as post-task Dramatiq job — zero user-visible latency impact.

Skill Revision Triggers

🔴
Force Revise
Tier-3 failure note mentioned this skill
🟡
Candidate Revise
Worker trajectory deviated from skill steps but succeeded
🟡
Candidate Revise
Skill's memory_citations contain superseded IDs since last revision
🟢
Eligibility Only
≥ 3 uses since last revision (not a trigger alone)
Rate limit: ≤ 1 revision per skill per 24h. Revision uses Sonnet with unified diff. Old version preserved in ## Revision history.

Skill .py Safety — AST Linter

🚫 Reject List
import subprocess
import socket
import requests / urllib
eval / exec / os.system
__import__
random.* for token/OTP paths
Writes outside /tmp/ and /workspace/hive/skill_scratch/
✅ Allow List
json, re, datetime
typing, math, statistics
html, urllib.parse
secrets, hmac, hashlib
Linter reject → skill authored as markdown only; .py discarded with a note in the body.
Linter Governance: Allow/reject lists live in skill_linter_rules.py. Changes require (a) Change Management RFC, (b) security review by CODEOWNERS, (c) version bump of BEES_SKILL_LINTER_VERSION.

Three-Run Illustration

Run 1
Cold Hive
Worker drafts LinkedIn post; tool returns 400 (rich-text)
Tier 1: no match. Tier 2 Haiku reflects: "sanitize </>"
Retry succeeds. MemoryService writes feedback memory
maybe_author Gate 1+2 pass → Sonnet authors skills/draft_linkedin_post.{md,py}
Run 2
Next Day
Sandbox archived → unarchive (12s) → volume re-attaches
search() returns skill with similarity 0.82
Skill body injected; skill_draft_linkedin_post runs in sandbox
~40% faster · ~30% fewer tokens · Zero Tier-2
Run 3
User Feedback
User: "too many hashtags" → feedback memory superseded
Worker deviates from skill step 3 (no hashtags)
maybe_revise triggers (deviation + success + stale citation)
Sonnet produces diff → v1 → v2 with "No hashtags" pitfall added

End-to-End Per-Task Sequence

The complete flow for a single task with all four feature flags enabled. Post-task jobs run asynchronously via Dramatiq — zero impact on user-visible latency.

📨
POST /api/bees/hives/{id}/prompt
User prompt arrives
👑
QueenOrchestrator.execute()
Haiku classify → Kimi K2.5 plan → dispatch SwarmTasks via Dramatiq
🛡️
HealingSupervisor.run(task) NEW
Wraps the entire Worker execution with three-tier healing
Parallel setup
📦
BeesSandboxService.get_or_start() NEW
Redis-locked, idempotent sandbox acquisition
🧠
ContextManager.build_extended() EXTENDED
DB memories + MEMORY.md render + today's log excerpt + candidate_skills
🐝
Worker Tool Loop (inside sandbox)
bees-native tools + sandbox tools + skill_<slug> tools · trajectory buffer captures all calls
On failure
Tier 1 → Tier 2 → Tier 3
↓ on success
Post-task Dramatiq jobs (async — NOT inline)
a MemoryService.extract_memories_from_task() (existing)
b SkillLibraryService.maybe_author(task, outcome) NEW
c SkillLibraryService.maybe_revise(...) if skill used NEW
↓ always
📋
HiveWorkspace.append_trajectory() NEW
Write task entry to memory/YYYY-MM-DD.md

Feature Flag Behavior Matrix

All 16 flag combinations produce a valid, non-crashing system. Each feature degrades gracefully when its flag is off.

Flag Off Effect
sandbox_enabled = false get_or_start returns InProcessSandbox stub. Sandbox-requiring tools raise SandboxDisabledError; non-sandbox tools keep working.
file_memory_enabled = false Projector not called; daily logs not written; SkillLibraryService.search returns []. DB memory path unchanged. Event log still writes (always safe).
self_healing_enabled = false HealingSupervisor degenerates to single attempt + logging. Circuit breaker, tool retry, confidence escalation still active.
skill_authoring_enabled = false maybe_author returns early. maybe_revise returns early. Pre-authored skills still served by search (read-only mode).

Cost Model (per 1,000 tasks, all flags on)

Skill-author Sonnet
~$5.00
Sandbox CPU/RAM
~$2.00
Skill-revision Sonnet
~$1.00
Tier-2 Reflexion (Haiku)
~$0.30
Skill-author Gate 2 (Haiku)
~$0.20
Memory projector embeddings
~$0.20
Sandbox volume storage
~$0.10
Total delta ~$8.80 / 1k tasks

Rollout Plan & Testing

Four sequenced PRs, each behind its own feature flag. Dev → staging → prod with 48h dashboard watch between each environment.

4-PR Split Plan (mandatory — 1,000 LOC limit)

PR 1
Database Migrations
bees.memory_events, canonical_statement, preferred_filepath, skill_credential_grants, hive_skills, hive flags + sandbox_metadata. Zero behavior change; backfill + replay tests.
~450 LOC
PR 2
Sandbox Infrastructure
BeesSandboxService + BeesSandboxToolsBase + sandbox tools + HiveWorkspace. Behind sandbox_enabled flag only.
~550 LOC
PR 3
Memory + Healing
MarkdownMemoryProjector + ContextManager.build_extended + daily-log writer + HealingSupervisor + healing rule table + Tier-2 reflection audit.
~500 LOC
PR 4
Skill Library
SkillLibraryService + SkillExecutorTool + skill linter + authoring/revision jobs + quarantine watchdog + skill_credential_grants wiring + new API endpoints.
~550 LOC
Dependencies: PRs 2-4 depend on 1 · PR 3 depends on 2 · PR 4 depends on 2 and 3

Per-Flag Rollout Order

1
memory_events + canonical_statement
Unflagged (pure additive DB work). Watch one week.
2
sandbox_enabled = true
Internal NeuralArc testing hive. One week.
3
file_memory_enabled = true
Internal hive. Second week.
4
self_healing_enabled = true
Internal hive. Third week.
5
skill_authoring_enabled = true
Internal hive. Fourth week.
6
Alpha cohort (5 external opt-in hives)
Repeat steps 2-5 at 72h cadence.
7
Gradual rollout
10% → 50% → 100% via account_id-hash bucket.

Testing Strategy

pytest -m unit
Unit Tests
HealingSupervisor tier transitions
HEALING_RULES table-driven (one per rule)
MarkdownMemoryProjector.render stability
SkillLibraryService Gate 1 heuristics
Skill .py AST linter (20+ fixtures)
MemoryEventLog write coverage
pytest -m integration
Integration Tests
memory_events writes against real MemoryService
canonical_statement backfill (100-row dataset)
hive_skills reconciler drift detection
Feature-flag 16-combination smoke matrix
pytest -m sandbox
Sandbox Tests
get_or_start full lifecycle
2x concurrent Dramatiq jobs on same hive
SkillExecutorTool timeout + output-cap
Credential injection scoping
pytest -m e2e
End-to-End
Three-run story (§10.2) as single e2e test
pytest -m property
Property Tests (Hypothesis)
append_trajectory is idempotent
render is deterministic for fixed inputs
memory_events replay reconstructs state
Coverage
Coverage Targets
New code ≥ 80% line coverage
HealingSupervisor rule table: 100%

Success Metrics (v1 post-rollout)

🛡️
Healing Effectiveness
≥ 40% of Tier-1 eligible failures self-heal at Tier 1
≥ 20% of Tier-2 semantic failures heal at Tier 2
Total user-visible failure rate drops ≥ 15%
📚
Skill Uptake
≥ 30% of hives with ≥ 20 tasks have ≥ 1 skill within 14 days
Median success-after-use rate ≥ 85%
🧠
Memory Continuity
0 regressions on existing MemoryService retrieval quality
A/B against baseline on 100-prompt golden set
💰
Cost & Latency
Per-task incremental cost ≤ $0.015 p95
p95 user-visible latency change < +5%

Open Risks

R1
Daytona SPOF
All sandboxed features degrade together on outage
get_or_start health check auto-flips per-hive flag off after 60s of 5xx; graceful degrade to in-process tools
R2
Skill Authoring Noise
Low-value skills clutter library
Gate 2 Haiku + "existing skill matches?" check + quarantine watchdog + 100-skill soft cap + DELETE endpoint
R3
Event Log Growth
Unbounded growth of bees.memory_events
Monthly partitions + TTL archive to S3 for > 12 months
R4
PII in Daily Logs
PII leakage via daily logs surfaced back to agent
Hive-scoped isolation, 120-word reflection cap, compliance masking, listed in DSAR export