AI / LLM

Skills as Contextual Memory: Reusable Procedural Knowledge for LLM Agents

7 min readAILLM

A system prompt encodes how an agent should behave. Tool schemas encode what an agent can do. Both are procedural memory, in the cognitive sense covered in an earlier article: knowledge of how to do things, loaded at session start. For simple agents, procedural memory fits in a few hundred tokens. For general-purpose agents (the kind that need to know how to write a Jira ticket, how to deploy to a staging cluster, how to format a compliance report, how to review a pull request, and hundreds of other procedures), the naive approach of stuffing every procedure into the system prompt fails. The prompt exceeds reasonable size long before it covers the full skill set.

Skills, introduced in Anthropic's pattern guidance in late 2025, solve this by externalizing procedural memory into a filesystem-backed catalog that the agent discovers and loads dynamically (Anthropic, 2025). Each skill is a folder with an instruction file, optional reference documents, and optional scripts. The key primitive is progressive disclosure: only skill metadata loads at startup, the full skill content loads when the task calls for it, and supplementary files load only when specific scenarios demand them. Context capacity becomes, in Anthropic's phrasing, "effectively unbounded" when skills live on the filesystem rather than in the prompt.

What a skill looks like

A skill is a directory. At minimum it contains one required file: SKILL.md, an instruction document with YAML frontmatter describing the skill and its applicability. Optional companions include reference documents, scripts the agent can execute deterministically, and example files.

skills/
  deploy-staging/
    SKILL.md            # required: name, description, when to use
    reference.md        # optional: detailed policy or API reference
    scripts/
      deploy.sh         # optional: executed without loading into context
      rollback.sh
  write-jira-ticket/
    SKILL.md
    templates/
      bug.md
      feature.md
  compliance-report/
    SKILL.md
    reference.md
    scripts/
      generate.py

The SKILL.md frontmatter identifies the skill (name, description) and, crucially, the conditions under which the skill applies. Anthropic's convention is to include a short "when to use this skill" section that the agent reads when matching a task to available skills.

Progressive disclosure in three levels

flowchart TD
    START([Agent startup]) --> L1[Load skill metadata only: name, description]
    L1 --> TASK([Task arrives])
    TASK --> MATCH{Does a skill match?}
    MATCH -->|no| PROCEED[Proceed without skill]
    MATCH -->|yes| L2[Load SKILL.md full content]
    L2 --> L3{Need more detail?}
    L3 -->|no| EXEC[Execute with skill guidance]
    L3 -->|yes| L4[Load supplementary files]
    L4 --> EXEC
    EXEC --> DONE([Task complete])

Level 1. At agent startup, only the skill name and description are loaded. For a catalog of fifty skills, this is perhaps a thousand tokens of metadata. The agent sees what it can do but not how.

Level 2. When a task matches a skill description, the full SKILL.md loads into the context. The agent now has the procedure it needs for this specific task.

Level 3. If the task calls for detail beyond what SKILL.md contains (reference tables, examples, edge cases), the agent reads the supplementary files. Only the specific file relevant to the current scenario loads; the rest stays on disk.

Scripts inside a skill are a special case. They execute deterministically in a code environment; their text is not loaded into the context. A skill that includes scripts/deploy.sh can run the script without ever spending tokens on its contents. This is the same pattern Anthropic calls out in programmatic tool calling, applied to procedural memory rather than tool orchestration.

Two versions in code

The excerpt below shows the skill-loading shape without a framework. The agent scans a skills directory at startup for metadata and loads the full content on demand.

from pathlib import Path
from openai import OpenAI
import yaml

client = OpenAI()

def load_skill_metadata(skills_dir: str) -> dict:
    metadata = {}
    for skill_md in Path(skills_dir).rglob("SKILL.md"):
        content = skill_md.read_text()
        frontmatter = content.split("---")[1]
        meta = yaml.safe_load(frontmatter)
        metadata[meta["name"]] = {
            "description": meta["description"],
            "path": str(skill_md),
            "dir": str(skill_md.parent),
        }
    return metadata

def select_skill(query: str, skills: dict) -> str | None:
    catalog = "\n".join(f"- {n}: {m['description']}" for n, m in skills.items())
    r = client.responses.create(
        model="gpt-4o-mini",
        instructions=(f"Available skills:\n{catalog}\n\n"
                      "Return the exact skill name that applies, or 'none'."),
        input=query)
    chosen = r.output_text.strip()
    return chosen if chosen in skills else None

def answer_with_skill(query: str, skills_dir: str = "./skills") -> str:
    skills = load_skill_metadata(skills_dir)
    chosen = select_skill(query, skills)
    instructions = "You are a helpful assistant."
    if chosen:
        skill_body = Path(skills[chosen]["path"]).read_text()
        instructions = f"Apply this skill.\n\n{skill_body}"
    return client.responses.create(
        model="gpt-4o-mini", instructions=instructions, input=query,
    ).output_text

The LangGraph version wires skill discovery as a tool. The agent sees a list_skills tool that returns metadata and a load_skill tool that returns the full content; Level 3 files become their own tools when needed.

from pathlib import Path
from langchain_core.tools import tool
from langchain.chat_models import init_chat_model
from langgraph.prebuilt import create_react_agent
import yaml

SKILLS_DIR = Path("./skills")

@tool
def list_skills() -> str:
    """List available skills by name and description."""
    out = []
    for skill_md in SKILLS_DIR.rglob("SKILL.md"):
        meta = yaml.safe_load(skill_md.read_text().split("---")[1])
        out.append(f"{meta['name']}: {meta['description']}")
    return "\n".join(out)

@tool
def load_skill(name: str) -> str:
    """Load the full instructions for a named skill."""
    for skill_md in SKILLS_DIR.rglob("SKILL.md"):
        meta = yaml.safe_load(skill_md.read_text().split("---")[1])
        if meta.get("name") == name:
            return skill_md.read_text()
    return f"skill '{name}' not found"

agent = create_react_agent(
    model=init_chat_model("gpt-4o-mini"),
    tools=[list_skills, load_skill],
)

Full runnable versions will live at github.com/subodhjena/agentic-patterns alongside the persistence examples.

When skills help

Skills shine when the agent's domain is large, procedural, and evolving.

Broad domains with many procedures. IT operations, customer support, finance reporting, compliance workflows, software deployment. Each of these has dozens to hundreds of distinct procedures; encoding all of them in a system prompt is not feasible.

Procedures that evolve independently. A skill lives in a file; updating it does not touch the agent. Teams that own specific skill files can iterate on them without coordinating with the agent's owners.

Procedures that benefit from deterministic scripts. A deploy procedure that consists of running two shell scripts and checking their output fits cleanly as a skill with scripts. The agent does not need to generate the commands; it runs them.

Agents shared across teams. One agent serving finance, engineering, and operations can have finance skills, engineering skills, and operations skills without mixing the three. Each team owns its own skills directory.

Where skills break down

The pattern inherits most of the failure modes of tool design plus a few specific to filesystem-backed memory.

Skill discovery failures. The selector does not find the right skill for a task. Either skill descriptions are vague, or the catalog is so large that the selector's prompt overflows. Keep descriptions concrete (one to two sentences, naming specific triggers) and paginate discovery when the catalog exceeds a few dozen skills.

Stale skills. A skill that was accurate when written but no longer matches the current system. The agent executes the wrong procedure. Skills need owners and review cadence, like any piece of code.

Overlapping skills. Two skills with similar descriptions confuse the selector. Merge overlapping skills or refine descriptions until a human could cleanly choose between them.

Levels misaligned. A skill whose essential details live in a supplementary file forces every execution through Level 3. The file that mattered should be in SKILL.md. Promote details that always apply; keep Level 3 for conditional depth.

Catalog bloat. A thousand skills is a signal to reorganize. Group skills into domains, load only domain-specific catalogs per agent, and treat the skill system like any growing codebase with its own directory structure.

Leakage across tenants. A skill that contains tenant-specific credentials or policy leaks if the directory is shared across tenants. Skills that apply only to one tenant belong in a tenant-scoped directory; the loader must enforce the scoping.

Trade against system prompts and tools

Skills are the third way of encoding procedural memory, alongside system prompts and tools. Each has a distinct cost profile.

Axis System prompt Tools Skills
Load time Every session start Every session start On-demand
Token cost per session Proportional to prompt size Proportional to tool count Proportional to used skills
Scalability Hundreds of tokens Dozens of tools Thousands of skills
Update cadence Requires prompt redeploy Requires tool redeploy File-level update
Ownership Agent owners Agent owners Distributed across teams
Fit Core behavior Available actions Procedures

A system prompt tells the agent who it is. Tools tell the agent what it can do. Skills tell the agent how to perform specific tasks. All three can coexist in the same agent, and they usually do.

Neighbors in the series

Short-term and long-term memory, earlier in the Memory stage, names procedural memory as one of the four kinds; skills are the production pattern for procedural memory. Persistence and checkpointing, the previous article, covers how thread state is saved, which is complementary to how skills are loaded (skills are read-only; state is read-write). Context engineering, in the Foundations stage, describes the just-in-time retrieval pattern that progressive disclosure instantiates. The agent-computer interface article, in the Agents stage, covers tool design, which overlaps with skill scripts. Harness design, in the Production stage, describes how skills integrate into larger agent architectures.

References

  1. Anthropic. Agent skills: organized folders of procedural knowledge. December 2025.
  2. Anthropic. Building effective agents. December 2024.
  3. Anthropic. Model Context Protocol specification. 2024.
  4. LangChain. Dynamic tool loading in LangGraph. 2024.
  5. OpenAI. Assistants API: file search and uploaded files. 2024.
agentic-patternsskillsmemoryprogressive-disclosureaillm
← Back to all posts