Waterson / AI & Engineering Blog

Claude Code Memory & Skill Architecture: A Three-Layer Framework for AI Agent Teams

How we organized memory, skills, and project instructions to reduce token waste by 35% and onboard new team members instantly.

By Waterson AI Team  ·  April 9, 2026  ·  8 min read

At a Glance

ProblemMemory files growing unbounded, context window waste, no way to share AI setup with teammates
SolutionThree-layer architecture: CLAUDE.md + lean MEMORY.md + on-demand Skills
Key result~35% token reduction, zero-training team onboarding, 25+ reusable skills
Applies toClaude Code, but concepts work with any AI coding agent
TemplateDownload the .md framework file

The Problem with Unstructured AI Memory

After using Claude Code intensively for several months — managing 5 AIA CEU courses, 20+ blog articles, and a growing library of automation scripts — we hit a familiar wall.

Our memory files kept growing. Every time we wanted the AI to remember something, we added it. After a few months, a typical conversation was loading 5,000+ tokens of memory before anything useful was even said. Worse, most of that memory wasn't relevant to what we were currently doing.

The real cost of bloated memory: Every token loaded into the context window at conversation start is a token that can't be used for actual work. On a 200k-token window, 5,000 tokens of memory overhead doesn't sound like much — until you realize that 80% of it is being loaded in conversations where it's completely irrelevant.

We also had a team onboarding problem. Whenever a new team member started using Claude Code on our project, they had to manually teach the AI all our conventions. There was no way to share an AI "configuration" the way you share a .eslintrc or a tsconfig.json.

The third problem was discoverability. We had built some excellent workflows — multi-agent AIA course rewriting, security scanning before every push, automated article publishing. But to use them, you had to remember the exact slash command. New people didn't know what tools existed.

The Solution: Three-Layer Architecture

Layer 1: CLAUDE.md → AI reads this EVERY conversation (~800 tokens)
Layer 2: Memory (MEMORY.md) → Lean index, pointers only (~200 tokens)
Layer 3: Skills (SKILL.md) → Full manuals, loaded ONLY when triggered

The core insight: not all context needs to be loaded all the time. Separate the always-relevant from the occasionally-relevant, and let the AI load the latter on demand.

Layer 1: CLAUDE.md (Always Loaded)

This is the AI's "onboarding guide." Every time anyone opens Claude Code in your project directory, the AI reads this file automatically. It's the contract between your team and the AI.

What goes here:

What does NOT go here:

The Intent Detection Table — The Key Innovation

Instead of requiring users to memorize slash commands, write a behavior-to-skill mapping table directly in CLAUDE.md:

## Behavior Rules — Proactive Skill Suggestion

Do NOT wait for slash commands. Detect user intent:
- Clear intent → auto-use the skill
- Ambiguous intent → suggest the skill

| When the user...                    | Auto-suggest or use...  |
|-------------------------------------|------------------------|
| Discusses writing an article        | /publish-article       |
| Mentions deploying or uploading     | /upload                |
| Says "done" or "finished editing"   | /upload                |
| Reviews content for accuracy        | /content-review        |
| Is about to git push                | /security-check        |

The AI reads this table and proactively suggests the right skill — no memorization needed. This single change eliminated about 90% of the "I didn't know that existed" onboarding friction.

Layer 2: Memory (Lean Index)

Target: under 4,000 tokens total across all memory files.

Memory should contain ONLY four types of information:

  1. Behavior rules — 3–5 lines each. How the AI should work with you.
  2. Skill pointers — One line each. "For X, use /skill-y."
  3. Design preferences — Brief. UI style, coding conventions.
  4. Project index — Links to detail files, not the details themselves.
# Memory Index

## Behavior Rules
- [feedback_workflow.md](feedback_workflow.md) — Plan before coding, never jump straight to implementation
- [feedback_delegate.md](feedback_delegate.md) — Use sub-agents for execution, don't code directly

## Skill Pointers
- Writing articles → /publish-article
- Deploying → /upload
- Security scan → /security-check (mandatory before every push)

## Preferences
- [feedback_ui_style.md](feedback_ui_style.md) — Large fonts (16-20px), spacious layout

## Active Projects
- [projects_active.md](projects_active.md) — Links to 3 currently active projects
The most common mistake: Putting full API documentation, step-by-step procedures, or reference data directly in memory files. These should live in skill/references/ and be loaded only when that skill is triggered.

Layer 3: Skills (On-Demand Loading)

Skills are the on-demand half of the system. They can be as detailed as needed — 500 lines, 50 files, full reference documentation — because they're loaded only when triggered.

.claude/skills/
├── CATALOG.md              ← Overview of all skills (human + AI readable)
├── publish-article/
│   ├── SKILL.md            ← Trigger conditions + operation manual
│   └── references/         ← Detailed docs, loaded when needed
├── security-check/
│   └── SKILL.md
├── aia-rewrite/
│   ├── SKILL.md
│   └── references/
│       ├── writing-guide.md
│       └── aia-standards.md

The frontmatter of each SKILL.md tells the AI when to load it:

---
name: publish-article
description: Generate blog articles for watersonusa.ai. Use when the user
  says "write article", "generate content", "新文章", or discusses
  creating content for the website.
---

The description field is what the AI reads to decide whether to trigger the skill. Write it comprehensively — include synonyms, alternative phrasings, and the languages your team uses.

Setup Steps

  1. Audit your existing memory. List every memory file. For each one, ask: Is this a repeatable workflow? (→ move to Skill) Is this reference data? (→ move to Skill references/) Is this used in less than 20% of conversations? (→ delete or move to Skill) What remains is your new lean Memory.
  2. Build skills from extracted workflows. For each workflow you pulled out of Memory, create a skill directory. Write the SKILL.md with frontmatter trigger conditions, ordered steps, exact commands, and quality constraints.
  3. Write CLAUDE.md with intent detection. This is the highest-leverage step. Draft the intent detection table carefully — it determines what the AI does proactively versus what it waits to be told.
  4. Compress remaining memory. Go through each remaining memory file. Can this be said in 3 lines instead of 30? Compress it. Is the detail needed every conversation? No — move it to a skill reference. Target: MEMORY.md index under 30 lines, total memory under 4,000 tokens.
  5. Separate shared from private. Commit CLAUDE.md and the skills directory to git. Keep personal preferences and private project records in ~/.claude/ (never committed). Team members who clone the repo immediately get the full setup.
Shared vs private split: Think of CLAUDE.md and skills as the team's .eslintrc — committed to git, the same for everyone. Personal memory is like your editor settings — stays on your machine.

Results

Metric Before After
Memory token consumption 5,000+ per conversation < 4,000 (−35%)
New member onboarding Manual teaching session (hours) Clone repo → open Claude Code → auto
Skill triggering Memorize slash commands Intent detection (natural language)
Team config sharing Manual file transfer git clone
Context window waste High — unused info loaded every time Low — on-demand loading

How This Compares to Industry Best Practices

Our three-layer architecture aligns with the latest AI agent memory management patterns from 2025–2026. Here's how what we built maps to the terminology researchers and practitioners are using:

What we already do

Technique Industry term Our implementation
Three-layer separation Memory Tiering CLAUDE.md + MEMORY.md + Skills
Index pointers Pointer Index System MEMORY.md with one-line pointers
On-demand loading Progressive Disclosure / Selective Re-injection Skills loaded only when triggered
Sub-agent summarization Sub-agent Distillation 12-role agents report summaries to orchestrator
Team sharing via repo Shared Project Config CLAUDE.md + Skills in GitHub
Behavior-based triggering Intent Detection CLAUDE.md behavior-to-skill mapping table

What you could add next

Technique What it does How to implement
AutoCompact Auto-compress conversation when context hits ~92% Add to CLAUDE.md: "In long conversations, proactively summarize completed tasks"
AutoDream Background agent consolidates memory (merge duplicates, prune stale) Build a /memory-cleanup skill that runs periodically
Memory Decay Old memories auto-expire Add dates to memory files, flag anything > 90 days for review
MCP Integration Pull context from GitHub Issues, Slack, Jira Already using Supabase for storyboard sync — extend the pattern
Small Model Filter Lightweight model pre-filters memory relevance Use Haiku to score memory relevance before injecting into context
Bottom line: If you've followed the three-layer setup, you're already implementing most of what the research community is writing about. The next frontier is automation — making the system self-maintaining rather than manually curated.

Advanced: Multi-Agent Skills

Once the three-layer structure is in place, skills can orchestrate complex multi-agent workflows without any added memory overhead. Our AIA course rewrite skill runs 11 parallel agents in two waves:

Wave 1 (parallel, 5 agents):
  - ResearchAgent    → finds current standards and competitor products
  - DraftAgent       × 3 → drafts three content sections concurrently
  - FactCheckAgent   → validates all product claims

Wave 2 (parallel, 5 agents):
  - ADAReviewAgent   → checks accessibility compliance
  - SEOAgent         → meta tags, schema, keyword density
  - LegalAgent       → regulatory claims audit
  - StyleAgent       → tone and reading level
  - CitationAgent    → formats all source notes

Wave 3 (sequential, 1 agent):
  - IntegrationAgent → merges all outputs, resolves conflicts, deploys

This entire workflow lives in the skill SKILL.md file. It contributes zero tokens to memory. It's only loaded when someone says "rewrite the course" or triggers the /aia-rewrite command.

Advanced: External Tool Integration

Skills can define tool fallback chains, making them resilient to quota limits and API outages:

## Research Step
1. Try: gemini -m gemini-2.5-flash -p "{{query}}" --output-format text
2. Fallback: Claude Sonnet sub-agent with WebSearch tool
3. Fallback: Manual research prompt to user

Because this fallback logic lives in the skill file rather than in memory, it doesn't cost tokens during conversations where research isn't needed — which is most of them.

Advanced: Multi-AI Collaboration

Don't let Claude do everything alone. Once your skill architecture is in place, you can distribute work across multiple AI providers — maximizing throughput while minimizing cost. Each AI has a different strength and a different price point.

The delegation hierarchy

AI Role When to use Cost
Claude Opus Orchestrator Complex decisions, quality control, final integration Highest — reserve for core work
Claude Sonnet Workers Writing, reviewing, code generation, auditing Medium — your main workforce
Gemini Flash/Pro Researchers Google Search grounding, fact verification, SEO analysis, proofreading Free (1,000 req/day) — use aggressively
Codex (GPT) Code reviewer HTML/CSS/JS quality, accessibility audit, competitive analysis Subscription — use until quota exhausted
Claude Haiku Lightweight tasks Memory filtering, simple formatting, quick lookups Cheapest — use for high-volume low-complexity

Configure the fallback chain in CLAUDE.md

## Multi-AI Collaboration

Gemini CLI: `echo "Y" | gemini -m gemini-2.5-flash -p "QUERY" --output-format text`
Codex CLI:  `codex exec --full-auto -C /path "TASK"`

Fallback chain:
1. Gemini Flash (free) → 2. Codex (subscription) → 3. Claude Sonnet (paid per token)

Always try free/cheaper options first for:
- Web research, fact checking
- Code review, linting
- SEO analysis, proofreading
- Bulk formatting, translation
Key principle: Burn the free tokens first. Gemini gives you 1,000 requests/day for free. Codex subscriptions include a token budget. Use them aggressively before falling back to Claude. Your skill files are the right place to encode this fallback logic — it costs nothing to have it there, and it only activates when relevant.

Real example: 42 agents, one session

42
agents dispatched in a single working session

Here's what that session produced and how the agents were distributed:

Output: one working day

None of this required the orchestrator to hold the entire context in memory. Each agent received a focused brief, worked independently, and reported a summary back. The three-layer architecture made it possible to dispatch that many agents without losing coherence.

The Core Principle

The context window is finite. Every token loaded at conversation start that turns out to be irrelevant is waste. The three-layer architecture applies the same logic as lazy loading in software: load only what you need, when you need it. The difference is that here, the "import cost" is your ability to think clearly about the actual task at hand.

FAQ

Does this only work with Claude Code?

The CLAUDE.md mechanism is specific to Claude Code. But the underlying principles — lean always-loaded context, on-demand detailed documentation, intent-based triggering — apply to any AI agent system. You can implement equivalent structures with Cursor rules files, GitHub Copilot workspace configuration, or custom system prompts.

How do I decide what goes in a skill versus what stays in memory?

Apply this filter: if you would include it in documentation handed to a new employee on their first day, it belongs in a skill. If you would say it verbally in a 10-second handoff, it belongs in memory. Procedures, reference data, multi-step workflows, and anything longer than five lines is almost always a skill.

What happens when a skill gets very large?

That's expected and fine. Skills can have a references/ subdirectory with arbitrarily many files. The skill's SKILL.md acts as a table of contents — it specifies which reference files to load for which sub-tasks. A skill for a complex publication workflow might have 10 reference files totaling thousands of lines, and none of that weight appears in your context unless you're actually publishing.

How do we handle skills that multiple projects share?

Claude Code supports two locations for skills: project-level (.claude/skills/ in the repo, shared via git) and user-level (~/.claude/skills/ on your machine, private). Generic utility skills like security scanning or git operations live at user level. Project-specific skills like course rewriting or article publishing live at project level.

Want the raw .md file to use in your own project?

The full framework template — including the audit checklist, MEMORY.md structure, and SKILL.md frontmatter format — is available as a standalone markdown file.

Download the template →
About this article: Developed by the Waterson USA AI team. Based on real-world experience managing 5 AIA CEU courses, 20+ blog articles, and 25+ skills across multiple projects using Claude Code. The framework described here is actively used in production at watersonusa.ai.