Bees — Sandbox, Memory, Self-Healing & Self-Learning

01 — Overview

Four Capabilities, One Integrated System

Each capability ships behind an independent per-hive feature flag. All four are designed to coexist without disrupting existing memory, circuit-breaker, or tool-retry systems.

01

Per-Hive Daytona Sandbox

A persistent execution environment mirroring the main agent's Daytona pattern. Each hive gets its own isolated container with a persistent volume at /workspace/hive/.

2 CPU / 4 GB RAM 5 GB Disk 300s Auto-stop

sandbox_enabled

02

Markdown File Memory Layer

A human-readable procedural knowledge layer using MEMORY.md, daily trajectory logs, and a skills library. Complements — never replaces — the existing MemoryService.

MEMORY.md Daily Logs Skills Library

file_memory_enabled

03

Three-Tier Self-Healing

A supervisor wrapping Worker execution: mechanical rules → single Reflexion pass → failure note. Hard cost cap of 1 Haiku call per task. Wraps, never replaces, existing resilience systems.

Tier 1: Rules Tier 2: Reflexion Tier 3: Note

self_healing_enabled

04

Voyager-Style Skill Library

Workers autonomously author markdown playbooks (plus optional sandboxed Python) on successful tasks and self-improve them on re-use. No RL or fine-tuning required.

Auto-Author Self-Revise 100 Skills Cap

skill_authoring_enabled

Architecture Overview

QueenOrchestrator

Haiku classify → Kimi K2.5 plan → dispatch

↓ SwarmTasks via Dramatiq

NEW

HealingSupervisor

Wraps swarm_jobs.run_task · Three-tier retry/reflect/note

↓ Orchestrates

NEW

BeesSandboxService

Daytona lifecycle · per-hive

EXTENDED

ContextManager

build_extended() · skills + memory_md

NEW

SkillLibraryService

search · author · revise · quarantine

↓ Reads/Writes

NEW

HiveWorkspace

MEMORY.md · memory/ · skills/

MemoryService

pgvector · 6 types · unchanged semantics

NEW

MemoryEventLog

bees.memory_events · append-only

New component Extended Unchanged

Explicit Non-Goals (v1)

✕ No replacement of existing MemoryService, CircuitBreaker, or tool-retry

✕ No RL fine-tuning, LoRA training, or prompt-gradient optimization

✕ No multi-sandbox-provider abstraction (Daytona directly coupled)

✕ No UI for skill browsing or editing (APIs only)

✕ No plan-level replanning on failure (Queen's plan unchanged)

✕ No dual-write of memory to files during v1

02 — Sandbox

Per-Hive Daytona Sandbox

Each hive gets exactly one Daytona sandbox with a persistent volume. The sandbox mirrors the main agent's pattern with bees-specific configuration for longer idle times and conversational workloads.

Isolation Hierarchy

👤

Account (User or Team)

owns N hives

⬡

Hive

exactly ONE Daytona volume · never shared across hives

📦

Active Sandbox (at most one per hive)

2 CPU · 4 GB RAM · 5 GB disk · auto-stop 300s

📁

/workspace/hive/

MEMORY.md · skills/ · memory/ · .hive.toml

Sandbox Lifecycle

None

No sandbox_metadata

create + volume

Running

Active container

300s idle

Stopped

~2-3s resume

1800s

Archived

~10-15s unarchive

get_or_start()

Volume retained 90 days after last unarchive. Redis lock (TTL 30s) serializes creation — at most one active sandbox per hive.

Sandbox Tools

$_

SandboxShellTool

tmux-backed long-running shell commands

sandbox_enabled

📄

SandboxFilesTool

read / write / list / delete under /workspace/

sandbox_enabled

🐍

SandboxPythonTool

python -c or python <path>; captures stdout/stderr/exit

sandbox_enabled

🌐

SandboxBrowserTool

Playwright: navigate, wait, screenshot, extract_text, click

sandbox_enabled + browser role

⚡

SkillExecutorTool

runs skills/<slug>.py with JSON args · 60s timeout · 1MB output cap

skill_authoring_enabled

Security Model

🔐

Auth on Every Call

Labels + account_id check on every get_or_start. Mismatch raises PermissionError.

🏖️

Sandboxed Python

Agent-authored skills/*.py run ONLY inside sandbox — never in the backend process.

🔑

Credential Safety

Credentials fetched per-tool-call from credential-profile lookup. Never written to sandbox filesystem.

🚫

Default-Deny Egress

Inherits main agent's default-deny egress policy. New destinations require Change Management RFC.

🔒

AES-256 Encryption

Sandbox volumes encrypted at rest using Daytona's AES-256 provider-managed keys.

🎭

Log Masking

HiveWorkspace installs structlog processor that redacts MEMORY.md and skills/ bodies from Sentry/Langfuse traces.

03 — Memory

File Memory Layer

A markdown-based procedural knowledge layer that complements the existing pgvector MemoryService. Files are the source of truth for procedural knowledge; the database remains authoritative for declarative facts.

🗄️

MemoryService (DB)

Unchanged · pgvector

Single declarative facts

"The user prefers Y"

Embedding + importance score

Mutable row + event log

6 memory types (preference, pattern, fact, procedure*, feedback, episodic)

*procedure writes frozen in v1

coexist by function

📝

File Memory Layer

New · Markdown files

Multi-step procedures

"How to do X given preferences Y"

Reusable Python helpers

Revisable documents

Trajectory logs + skill playbooks

Skills cite DB memories by mem:<id>

Filesystem Layout

/workspace/hive/

📋

MEMORY.md

Rendered projection of DB memories + agent scratchpad. Regenerated at task start. Agent may only edit the AGENT_SCRATCHPAD block.

📁

memory/

Daily trajectory logs

YYYY-MM-DD.md — Today's log (append-only)

YYYY-MM-DD.N.md — Roll suffix on 5MB overflow

archive/YYYY-MM.tar.gz — Nightly compression (>30 days)

index.md — Auto-maintained 30-day index

📁

skills/

Procedural playbooks authored by Workers

<slug>.md — Playbook with YAML frontmatter

<slug>.py — Optional executable helper

.archive/ — Skills over the 100-skill soft cap

⚙️

.hive.toml

Metadata: hive_id, account_id, schema_version, last_projected_at

MEMORY.md — Rendered Projection

1

Read Scratchpad

Preserve existing AGENT_SCRATCHPAD block from current MEMORY.md

→

2

Fetch Memories

isPermanent=true + top-k weighted memories via MemoryRepository.search_memories

→

3

Group & Sort

Group by memoryType, sort by weighted_score, emit stable [mem:<id>] citations

→

4

Atomic Write

tmp + rename into sandbox mount. Cache in Redis TTL 60s.

MEMORY.md

---

hive_id: <uuid>

generated_at: 2026-04-21T14:03:12Z

generator: bees.MarkdownMemoryProjector/v1

---

# Hive Memory

## Who you're working with

- Beekeeper name: Alice

- Role: Head of Marketing

## Preferences (top 10 by weighted score)

- Prefers bullet points over prose [mem:pref-a1b2 · 0.9]

- Never mentions competitors by name [mem:pref-c3d4 · 0.8]

## Recent patterns

- Schedules LinkedIn posts 9-10am ET [mem:ptn-7g8h]

Memory Event Log — Migration Readiness

An append-only bees.memory_events table ships with v1 as migration-readiness infrastructure. Zero behavior change for readers — but enables future replay and file-based migration.

created

New memory upserted

updated

Existing row second write

superseded

Old memory replaced by new

tombstoned

Memory soft-deleted

PII Protection: before_content / after_content columns are application-layer encrypted via KMS envelope key. Log masking processor blocks them from Sentry/Langfuse. DSAR cascade purge on memory deletion.

File Ownership Invariants

Path Reader Writer

MEMORY.md (above scratchpad) Worker (RO in prompt) MarkdownMemoryProjector only

MEMORY.md scratchpad Worker (RO in prompt) Worker via SandboxFilesTool

memory/<today>.md Worker (RO excerpt) HiveWorkspace.append_trajectory only

memory/<past>.md Worker (RO via files tool) nobody (immutable)

skills/<slug>.md Worker (RO via prompt) SkillLibraryService only

04 — Self-Healing

Three-Tier HealingSupervisor

The HealingSupervisor wraps Worker execution with a three-tier escalation strategy. It sits above — and never replaces — the existing CircuitBreaker, tool-retry, and confidence-escalation systems.

Tier 1

Mechanical Rule Table

Zero LLM cost

Pure deterministic rules. First match wins. Table-driven and fully testable. No LLM calls.

tool_timeout (1st)

→

Double the tool's timeout

rate_limit

→

Backoff min(60, 2^attempts)s

schema_invalid

→

Append "Return ONLY valid JSON..."

empty_result + fallback

→

Substitute fallback tool

circuit_open + multi-provider

→

Switch provider

LLM transient 5xx

→

Single immediate retry

if Tier 1 fails + semantic/schema/repeated_tool_error

Tier 2

Single Reflexion Pass

~$0.001 (1 Haiku call)

One Haiku call per task. Reads the failure trajectory and produces a ≤120-word corrective note for the next attempt. Hard cap: 1 reflection per task.

Failure Trajectory

→

Haiku Self-Reflection

→

≤120-word corrective note

→

Prepended to next attempt as <reflection>

Every Tier-2 reflection is written to bees.memory_events for audit trail (Incident Response Policy §Documentation).

if Tier 1 + Tier 2 both fail

Tier 3

Failure Note

Feeds future tasks

Appends a structured failure note to today's trajectory log AND creates a fact-type memory in pgvector so future tasks can retrieve it.

📝

Append to memory/<today>.md with signature, tried approach, reflection, root cause, and guidance

🗄️

Emit memory_events row + insert into bees.memories as fact-type for future pgvector retrieval

⚠️

Task returns failure to user. Existing low-confidence escalation path remains intact.

Failure Decision Table

Failure Signal Tier 1 Tier 2 Tier 3

Tool HTTP timeout, first time tool_timeout_bump fires — —

Tool HTTP 429 rate_limit_backoff fires — —

Invalid JSON, 1st time schema_invalid_stricten fires — —

Invalid JSON, 2nd time after stricten — fires possible

Empty result, no fallback — fires (semantic) possible

Same tool fails 3x with different args — fires (repeated_tool_error) possible

Circuit open, no alternate — — fires

Worker confidence < 0.3 — fires (semantic) possible

Observability

bees.healing.tier1.count

Per-hive and global Tier-1 rule fires

bees.healing.tier2.count

Reflexion invocations + Haiku cost

bees.healing.tier3.count

Failure notes written to memory

heal_rate_by_class

Success rate per failure classification

Langfuse spans: healing.tier1.rule_fired · healing.tier2.reflection · healing.tier3.failure_note

05 — Self-Learning

Voyager-Style Skill Library

Workers autonomously author markdown playbooks on successful tasks and self-improve them on re-use. No RL, no fine-tuning — just structured knowledge accumulation through experience.

Skill File Format

skills/draft_linkedin_post.md

---

slug: draft_linkedin_post

title: Draft a LinkedIn post in Alice's voice

version: 3

triggers:

- "linkedin post"

- "draft a post for linkedin"

tool_dependencies: [linkedin.publish, sandbox.python]

memory_citations: [pref-c3d4, fact-e5f6]

success_criteria:

- post length 800-1200 chars

- no competitor mentions

network_egress: false

credential_access: []

---

# Draft LinkedIn Post

## When to use

When user requests a LinkedIn post...

## Steps

1. Fetch user's voice preferences...

2. Draft with sanitized rich-text...

## Known pitfalls

- Rich-text </> causes 400 errors

## Revision history

- v3: No hashtags (user feedback)

- v2: Added sanitization step

- v1: Initial authoring

🏷️

YAML Frontmatter

Slug, title, version, triggers, tool dependencies, memory citations, success criteria, security flags

📖

Markdown Body

When to use, step-by-step procedure, known pitfalls, revision history

🐍

Optional .py Helper

Pure computation only. AST-linted. Runs inside sandbox. No network, no filesystem writes outside /tmp/

🔢

Version Tracking

Every revision bumps version, updates file_hash, preserves history in ## Revision history section

Skill Authoring Flow

✅

Task succeeds

↓

Gate 1

Heuristics (free)

confidence ≥ 0.75

tool_call_count ≥ 4

duration ≥ 5s

no existing skill matched

skill count < 100

↓ passes

Gate 2

Haiku Y/N (~$0.0005)

"Is this a reusable pattern the agent should remember as a skill? Answer Y/N plus a one-line justification."

↓ Y

Author

Sonnet call (~$0.01)

Produces frontmatter YAML + body + optional .py. Atomic write to /workspace/hive/skills/. Runs as post-task Dramatiq job — zero user-visible latency impact.

Skill Revision Triggers

🔴

Force Revise

Tier-3 failure note mentioned this skill

🟡

Candidate Revise

Worker trajectory deviated from skill steps but succeeded

🟡

Candidate Revise

Skill's memory_citations contain superseded IDs since last revision

🟢

Eligibility Only

≥ 3 uses since last revision (not a trigger alone)

Rate limit: ≤ 1 revision per skill per 24h. Revision uses Sonnet with unified diff. Old version preserved in ## Revision history.

Skill .py Safety — AST Linter

🚫 Reject List

import subprocess

import socket

import requests / urllib

eval / exec / os.system

__import__

random.* for token/OTP paths

Writes outside /tmp/ and /workspace/hive/skill_scratch/

✅ Allow List

json, re, datetime

typing, math, statistics

html, urllib.parse

secrets, hmac, hashlib

Linter reject → skill authored as markdown only; .py discarded with a note in the body.

Linter Governance: Allow/reject lists live in skill_linter_rules.py. Changes require (a) Change Management RFC, (b) security review by CODEOWNERS, (c) version bump of BEES_SKILL_LINTER_VERSION.

Three-Run Illustration

Run 1

Cold Hive

Worker drafts LinkedIn post; tool returns 400 (rich-text)

Tier 1: no match. Tier 2 Haiku reflects: "sanitize </>"

Retry succeeds. MemoryService writes feedback memory

maybe_author Gate 1+2 pass → Sonnet authors skills/draft_linkedin_post.{md,py}

→

Run 2

Next Day

Sandbox archived → unarchive (12s) → volume re-attaches

search() returns skill with similarity 0.82

Skill body injected; skill_draft_linkedin_post runs in sandbox

~40% faster · ~30% fewer tokens · Zero Tier-2

→

Run 3

User Feedback

User: "too many hashtags" → feedback memory superseded

Worker deviates from skill step 3 (no hashtags)

maybe_revise triggers (deviation + success + stale citation)

Sonnet produces diff → v1 → v2 with "No hashtags" pitfall added

06 — Data Flow

End-to-End Per-Task Sequence

The complete flow for a single task with all four feature flags enabled. Post-task jobs run asynchronously via Dramatiq — zero impact on user-visible latency.

📨

POST /api/bees/hives/{id}/prompt

User prompt arrives

↓

👑

QueenOrchestrator.execute()

Haiku classify → Kimi K2.5 plan → dispatch SwarmTasks via Dramatiq

↓

🛡️

HealingSupervisor.run(task) NEW

Wraps the entire Worker execution with three-tier healing

↓

Parallel setup

📦

BeesSandboxService.get_or_start() NEW

Redis-locked, idempotent sandbox acquisition

🧠

ContextManager.build_extended() EXTENDED

DB memories + MEMORY.md render + today's log excerpt + candidate_skills

↓

🐝

Worker Tool Loop (inside sandbox)

bees-native tools + sandbox tools + skill_<slug> tools · trajectory buffer captures all calls

↓

On failure

Tier 1 → Tier 2 → Tier 3

↓ on success

Post-task Dramatiq jobs (async — NOT inline)

a MemoryService.extract_memories_from_task() (existing)

b SkillLibraryService.maybe_author(task, outcome) NEW

c SkillLibraryService.maybe_revise(...) if skill used NEW

↓ always

📋

HiveWorkspace.append_trajectory() NEW

Write task entry to memory/YYYY-MM-DD.md

Feature Flag Behavior Matrix

All 16 flag combinations produce a valid, non-crashing system. Each feature degrades gracefully when its flag is off.

Flag Off Effect

sandbox_enabled = false get_or_start returns InProcessSandbox stub. Sandbox-requiring tools raise SandboxDisabledError; non-sandbox tools keep working.

file_memory_enabled = false Projector not called; daily logs not written; SkillLibraryService.search returns []. DB memory path unchanged. Event log still writes (always safe).

self_healing_enabled = false HealingSupervisor degenerates to single attempt + logging. Circuit breaker, tool retry, confidence escalation still active.

skill_authoring_enabled = false maybe_author returns early. maybe_revise returns early. Pre-authored skills still served by search (read-only mode).

Cost Model (per 1,000 tasks, all flags on)

Skill-author Sonnet

~$5.00

Sandbox CPU/RAM

~$2.00

Skill-revision Sonnet

~$1.00

Tier-2 Reflexion (Haiku)

~$0.30

Skill-author Gate 2 (Haiku)

~$0.20

Memory projector embeddings

~$0.20

Sandbox volume storage

~$0.10

Total delta ~$8.80 / 1k tasks

07 — Rollout

Rollout Plan & Testing

Four sequenced PRs, each behind its own feature flag. Dev → staging → prod with 48h dashboard watch between each environment.

4-PR Split Plan (mandatory — 1,000 LOC limit)

PR 1

Database Migrations

bees.memory_events, canonical_statement, preferred_filepath, skill_credential_grants, hive_skills, hive flags + sandbox_metadata. Zero behavior change; backfill + replay tests.

~450 LOC

PR 2

Sandbox Infrastructure

BeesSandboxService + BeesSandboxToolsBase + sandbox tools + HiveWorkspace. Behind sandbox_enabled flag only.

~550 LOC

PR 3

Memory + Healing

MarkdownMemoryProjector + ContextManager.build_extended + daily-log writer + HealingSupervisor + healing rule table + Tier-2 reflection audit.

~500 LOC

PR 4

Skill Library

SkillLibraryService + SkillExecutorTool + skill linter + authoring/revision jobs + quarantine watchdog + skill_credential_grants wiring + new API endpoints.

~550 LOC

Dependencies: PRs 2-4 depend on 1 · PR 3 depends on 2 · PR 4 depends on 2 and 3

Per-Flag Rollout Order

1

memory_events + canonical_statement

Unflagged (pure additive DB work). Watch one week.

2

sandbox_enabled = true

Internal NeuralArc testing hive. One week.

3

file_memory_enabled = true

Internal hive. Second week.

4

self_healing_enabled = true

Internal hive. Third week.

5

skill_authoring_enabled = true

Internal hive. Fourth week.

6

Alpha cohort (5 external opt-in hives)

Repeat steps 2-5 at 72h cadence.

7

Gradual rollout

10% → 50% → 100% via account_id-hash bucket.

Testing Strategy

pytest -m unit

Unit Tests

HealingSupervisor tier transitions

HEALING_RULES table-driven (one per rule)

MarkdownMemoryProjector.render stability

SkillLibraryService Gate 1 heuristics

Skill .py AST linter (20+ fixtures)

MemoryEventLog write coverage

pytest -m integration

Integration Tests

memory_events writes against real MemoryService

canonical_statement backfill (100-row dataset)

hive_skills reconciler drift detection

Feature-flag 16-combination smoke matrix

pytest -m sandbox

Sandbox Tests

get_or_start full lifecycle

2x concurrent Dramatiq jobs on same hive

SkillExecutorTool timeout + output-cap

Credential injection scoping

pytest -m e2e

End-to-End

Three-run story (§10.2) as single e2e test

pytest -m property

Property Tests (Hypothesis)

append_trajectory is idempotent

render is deterministic for fixed inputs

memory_events replay reconstructs state

Coverage

Coverage Targets

New code ≥ 80% line coverage

HealingSupervisor rule table: 100%

Success Metrics (v1 post-rollout)

🛡️

Healing Effectiveness

≥ 40% of Tier-1 eligible failures self-heal at Tier 1

≥ 20% of Tier-2 semantic failures heal at Tier 2

Total user-visible failure rate drops ≥ 15%

📚

Skill Uptake

≥ 30% of hives with ≥ 20 tasks have ≥ 1 skill within 14 days

Median success-after-use rate ≥ 85%

🧠

Memory Continuity

0 regressions on existing MemoryService retrieval quality

A/B against baseline on 100-prompt golden set

💰

Cost & Latency

Per-task incremental cost ≤ $0.015 p95

p95 user-visible latency change < +5%

Open Risks

R1

Daytona SPOF

All sandboxed features degrade together on outage

get_or_start health check auto-flips per-hive flag off after 60s of 5xx; graceful degrade to in-process tools

R2

Skill Authoring Noise

Low-value skills clutter library

Gate 2 Haiku + "existing skill matches?" check + quarantine watchdog + 100-skill soft cap + DELETE endpoint

R3

Event Log Growth

Unbounded growth of bees.memory_events

Monthly partitions + TTL archive to S3 for > 12 months

R4

PII in Daily Logs

PII leakage via daily logs surfaced back to agent

Hive-scoped isolation, 120-word reflection cap, compliance masking, listed in DSAR export