# Tabbit Skill Creator Playbook

A complete authoring guide for building Tabbit Agent V2 skills using third-party IDE agents — Claude Code, Codex, ChatGPT, Cursor, Aider, and similar tools — running outside Tabbit itself.

This playbook is self-contained. Read it once, keep it open while writing a skill, and your output should be runnable by any Tabbit Agent without further adaptation.

---

## 0. Audience and Pre-Requisites

You are writing code in an IDE with an AI assistant. You will hand the resulting `skill/` directory (or `.zip`) to Tabbit for distribution. Your assistant is the SKILL AUTHOR, not the SKILL RUNNER. It cannot call `navigate_page`, `evaluate_script`, or `e2b_bash`. It can only edit files, run local commands, and reason about how the skill will behave once a Tabbit Agent loads it.

Prerequisites before you start:

- Local Python 3.11+ for testing sandbox scripts in isolation
- Browser DevTools or `curl` + `grep` to inspect target pages
- A clear, single-sentence statement of what the skill does and when it should trigger

Write skills the way a senior engineer writes a runbook. Assume a less-informed Agent will execute it under stress, with partial context, against rate-limited services. Every decision you can pre-bake in code or documentation is a decision the runtime Agent cannot get wrong.

---

## 1. What a Tabbit Skill / Web Agent Is

A Tabbit skill is a bundle of:
- `SKILL.md` — the runtime contract that routes the Agent through the workflow
- Pre-written page scripts (browser-side JavaScript) — selectors and DOM extraction
- Pre-written sandbox scripts (Python or Node) — data processing, file generation, network calls
- Templates, references, examples

The skill is loaded into an Agent's context at runtime via `load_skill`. The Agent then chooses which bundled scripts to invoke, in what order, with what arguments. Your job as author is to make those choices easy and the wrong choices hard.

The skill is not a program. It is institutional memory: how to solve this class of problem reliably, encoded in scripts + a runbook.

---

## 2. The Five Capability Layers

A Tabbit Agent has five distinct capability layers. Each layer has a purpose. Mixing layers within a single workflow phase is a design smell.

| Layer | What it does | What it cannot do |
|---|---|---|
| **Browser GUI** (`navigate_page`, `take_snapshot`, `take_screenshot`, `click`, `scroll`, `wait`, `type_text`, `select_option`, `press_key`, `move_mouse`, `drag_and_release`, `get_pdf_data`) | Drive the user's real browser. Use logged-in state. Visually verify. Read PDF rendered in the browser. | Bulk data processing, complex file generation, anything requiring more than a few hundred items. |
| **Page Script** (`evaluate_script`) | Run bundled JavaScript inside the current browser page. Extract structured JSON. Use the user's session and cookies. | Be authored at runtime. Bypass anti-bot measures. Make external network calls. |
| **Sandbox** (`e2b_bash`, `e2b_read`, `e2b_write`, `e2b_edit`, `e2b_glob`, `e2b_grep`) | Run scripts in a cloud Linux container. Process files. Install lightweight dependencies. Generate CSV / PDF / PPT / DOCX / Markdown. | Install software on the user's machine. Use the user's browser session. Run interactive commands. |
| **Public Web** (`web_fetch`, `web_search`) | Search public sources. Fetch public URLs without any user session. Supplement with public references. | Access logged-in pages. Read content gated by cookies. |
| **MCP / Connector** (`mcp__<connector>__<tool>`) | Call third-party SaaS APIs the user has configured (GitHub, Linear, Notion, Drive, databases, design tools). **Only HTTP/remote MCP is officially supported** — the platform initiates calls from the cloud. | Access pages that require browser cookies. Be assumed available — different users have different connectors. **Use local `stdio` MCP, local-process MCP, or any setup requiring `npx` / `uvx` / `python -m` to start an MCP server on the user's machine** — these do not become injected `mcp__...` tools and do not flow through the platform's encrypted-credential / OAuth path. |

## 3. Capability Routing — The Decision Tree

This is the most important section of the playbook. Use it for every data-acquisition step you design.

```
Q1: Does the content require a logged-in session?
    YES → Browser GUI + page script. Only option.
    NO  → Q2

Q2: Is the platform automation-sensitive (anti-bot, CAPTCHA, visible rate
     limits, e-commerce or social-platform back office)?
    YES → Browser GUI + page script, with low-frequency real-user-path
          traversal. Add stop-on-CAPTCHA logic. Do NOT use cloud HTTP fetch.
    NO  → Q3

Q3: Is the content available with a plain HTTP GET against the page URL?
    YES → web_fetch the page. If it parses cleanly, done.
    NO  → Q4 (the page is JS-rendered or content is split across requests)

Q4: Does the page load any data file via fetch()/XHR? Check the page's own
    JS files for fetch("data/...json") or "/api/..." patterns.
    YES → web_fetch THAT URL directly. Bypasses the rendering layer.
    NO  → Q5

Q5: Browser GUI + page script. The bundled script extracts from real DOM
    via take_snapshot for selector reference + evaluate_script for the
    actual extraction.
```

**The most consequential creator mistake is defaulting to browser for public, non-login pages.** Browser is heavy, slow, and burns context via `take_snapshot`. Use it when it is the only thing that works, not as a default.

For every phase of your workflow, document the answer to this decision tree in `SKILL.md`'s "Capability routing" section. The runtime Agent should not have to re-derive the answer under pressure.

---

## 4. Skill Package Layout

```
my-skill/
├── SKILL.md                      ← runtime contract (see §5)
├── scripts/
│   ├── browser/
│   │   └── extract_<thing>.js    ← page scripts, one per extraction task
│   └── sandbox/
│       ├── fetch_<thing>.py      ← network-bound data acquisition
│       ├── process_<thing>.py    ← deterministic transformations
│       ├── generate_<thing>.py   ← artifact production + validation
│       └── requirements.txt      ← pinned deps
├── templates/                    ← Jinja-style scaffolds for generated docs
├── references/                   ← taxonomy, rubrics, domain rules (Markdown)
└── examples/                     ← sample inputs/outputs, schema documentation
```

Hard rules:

- `SKILL.md` is loaded into Agent context on every invocation. Keep it under ~12k tokens. Push detail into `references/`.
- Every script must be runnable as-is. The Agent patches selectors and arguments, never control flow.
- Sandbox scripts accept `--in` and `--out` paths via argparse. No hard-coded paths.
- All paths derive from a `WORKDIR` environment variable with a sane default. The Tabbit sandbox conventionally mounts at a known path, but local testing must be able to override.
- No tokens, cookies, API keys, or service credentials anywhere in the bundle.

---

## 5. SKILL.md — Required Sections

Every `SKILL.md` contains these sections in this order:

```markdown
# <Skill Name>

## When to use
- 3-7 bullets of user intent patterns that trigger this skill.
- 1-2 bullets of intents that should NOT trigger it (anti-triggers).

## Runtime requirements
- Browser login required: yes/no (which sites)
- Sandbox required: yes/no
- MCP required: optional/required/no (which connectors; must be HTTP/remote — do NOT depend on local stdio MCP)
- Page script required: yes/no
- Public web access: yes/no
- User local folder: optional/required/no

## Capability routing
Phase-by-phase table mapping each step to ONE capability layer.
Include "why this layer for this step" — the Agent needs the reasoning
for edge cases.

## Workflow
Numbered steps. For each step:
- Tool call (which script, what args, what input file)
- Expected output (shape, approximate size)
- Failure-mode → fallback link

## Files in this skill
Table: path | layer | purpose. List EVERY file, not just headline scripts.

## Output contract
- What gets returned to the user (the final answer)
- File paths produced
- Required vs optional fields per output file
- Validation signals the Agent should check before declaring success

## Failure handling
Table: failure | detection | response. Cover at minimum:
HTTP 429, HTTP 403, HTTP 5xx, empty extraction, dependency install failure,
PDF text extraction failure, user interruption mid-run, validator failure,
MCP connector unavailable.
```

Three authorship rules learned from production failures:

**Rule 1 — Pair every "do X" with "do NOT do Y" when Y is a plausible shortcut.**

A runtime Agent under pressure will take the path that looks easiest. If parsing a snapshot text looks easier than calling `evaluate_script`, the Agent might do it. Pre-empt the shortcut explicitly:

> "Extract papers via `evaluate_script` with `scripts/browser/extract_program.js`. Do NOT parse `take_snapshot` text — the snapshot is a curated text representation, not real DOM, and section metadata is lost."

**Rule 2 — For multi-source acquisition, write priority order with the failure mode it prevents.**

Don't just list sources. Explain why the order matters:

> "Try arXiv first, OpenReview second, AAAI OJS last. The publisher's OJS returns HTTP 403 to most cloud IPs; do not retry on 403, mark `not_found`, and fall through to the next source."

**Rule 3 — For tasks that span more than ~30 items, mandate progress persistence.**

> "Write `progress.json` to `WORKDIR` after every batch of 10 items. If a later run finds the file, resume from the last completed batch."

---

## 6. Page Script Pattern

Page scripts run inside the user's browser via `evaluate_script`. The bundle is the contract: the runtime Agent must use what you ship, never write a fresh page script at runtime.

Why this is non-negotiable: the Agent's view of the page is a curated text representation, not the real DOM. Selectors written from that view are unreliable. Page structure, async loading, framework wrappers, and event handlers don't appear in the snapshot. A script written at runtime against a snapshot view will silently lose data, misclassify rows, or extract empty fields.

Template:

```javascript
/**
 * extract_<thing>.js
 *
 * Runs inside the user's browser on <target URL pattern>.
 * Returns a structured JSON object the sandbox can process.
 *
 * Args (all optional):
 *   { limit?: number, sections?: string[], debug?: boolean }
 *
 * Output:
 *   { ok, page_url, page_title, count, items: Item[], warnings: string[] }
 */
async function run(args) {
  const opts = Object.assign({ limit: null, debug: false }, args || {});

  // ── Selector pack — Agent may patch ONLY this block. ────────────────────
  const selectors = {
    container: '.list, [role="list"], ul.results',
    row:       'li.item, div.row, article',
    title:     '.title, h2, h3, [data-title]',
    link:      'a[href]',
    meta:      '.meta, .byline, time',
  };
  // ────────────────────────────────────────────────────────────────────────

  const warnings = [];
  const items = [];

  for (const container of document.querySelectorAll(selectors.container)) {
    for (const row of container.querySelectorAll(selectors.row)) {
      const title = row.querySelector(selectors.title)?.textContent?.trim() || '';
      if (!title) continue;
      items.push({
        title,
        url:  row.querySelector(selectors.link)?.href || '',
        meta: row.querySelector(selectors.meta)?.textContent?.trim() || '',
      });
      if (opts.limit && items.length >= opts.limit) break;
    }
  }

  if (items.length === 0) {
    warnings.push('extracted 0 items — selectors may be stale; inspect snapshot.');
  }

  return {
    ok: items.length > 0,
    page_url: location.href,
    page_title: document.title,
    count: items.length,
    items,
    warnings,
  };
}
```

Required features:

- Single `selectors` object near the top. The Agent's only patch zone.
- Returns JSON, never HTML.
- Returns a `warnings[]` array so the sandbox can detect drift.
- Honors a `limit` argument for testing.
- Single `async function run(args)` entry point. Tabbit expects this signature.

Forbidden features:

- No XPath unless CSS cannot express the selector. XPath is harder to patch.
- No outbound network calls. The page may have CSP. Network goes in sandbox.
- No DOM mutation. Page scripts are read-only.
- No `eval`. No dynamic code generation.
- No response larger than ~50 KB. If the page is paginated, extract page N and let the Agent invoke again.

---

## 7. Sandbox Script Patterns

Three patterns cover almost all sandbox work. Pick one per script — don't mix.

### 7.1 Network-bound acquisition (`fetch_*.py`)

```python
import argparse, hashlib, json, os, re, time
from collections import defaultdict
from pathlib import Path
import requests
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential

WORKDIR = Path(os.environ.get("WORKDIR", "/work/my-skill"))
HEADERS = {"User-Agent": "Tabbit-MySkill/1.0 (purpose; contact)"}
_LAST = defaultdict(float)
RATE = 1.0  # seconds per host

def _gate(url):
    host = url.split("/", 3)[2]
    gap = time.time() - _LAST[host]
    if gap < RATE:
        time.sleep(RATE - gap)
    _LAST[host] = time.time()

@retry(stop=stop_after_attempt(3),
       wait=wait_exponential(multiplier=1.5, min=2, max=20),
       retry=retry_if_exception_type(requests.RequestException),
       reraise=True)
def _get(url, **kwargs):
    _gate(url)
    r = requests.get(url, headers=HEADERS, timeout=30, **kwargs)
    if r.status_code == 429:
        raise requests.HTTPError(f"429 from {url}", response=r)
    r.raise_for_status()
    return r

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--in",  dest="inp", required=True, type=Path)
    ap.add_argument("--out", required=True, type=Path)
    args = ap.parse_args()
    # ... per-item processing, one line per N items to stderr ...
```

Required practices:

- Per-host rate gate (1 req/sec default).
- Tenacity with exponential backoff, capped at 3 attempts.
- HTTP 429 → raise. Tenacity retries with backoff. If it still 429s, mark item `not_found` and continue. Never loop on 429.
- HTTP 403 → do NOT retry. 403 is a deliberate block. Mark `not_found` and continue.
- HTTP 5xx → let tenacity handle. After max retries, mark `not_found`.
- All paths from `WORKDIR` env var.
- Progress to stderr, one line per batch (not per item). The Agent will surface these via its monitor.

### 7.2 Deterministic transformation (`process_*.py`)

Zero network calls. Pure function of input files. Use for:

- Parsing PDFs / HTML to plain text
- Extracting structured fields from text
- Deduplication, normalization
- Computing per-item metrics
- Clustering, tagging, classification

These scripts should be deterministic and cacheable. Re-running with the same input produces the same output.

### 7.3 Artifact generation (`generate_*.py`)

Reads processed data, renders via templates, validates the output. Two modes:

```python
ap.add_argument("--validate", type=Path)  # validation mode
# ... else: full render mode
```

Use a minimal template engine — Python's `str.format` or a ~50-line custom one — over a heavy dependency. **If you build a custom template engine, do NOT support nested `{% for %}` blocks naively with non-greedy regex.** Nested loops with non-greedy regex match the wrong endfor pair. Either pre-render inner content as strings in the data layer, or use a real template library.

Validator should exit non-zero on failure and check at minimum:

- Word count above a minimum (catches truncated generation)
- Required sections present
- Sampled links return < 400 status (catches dead references)

---

## 8. Failure Handling — Concrete Rules

Document these as a table in `SKILL.md`. The Agent will not improvise correct behavior under stress.

| Failure pattern | Detection | Mandated response |
|---|---|---|
| Page returns 0 items after scroll + snapshot | `count == 0` from extract script after two scroll passes | Switch to sandbox fallback. Do NOT rewrite the page script. |
| Selectors broken (rows extract but fields empty) | `items.length > 0 && first.title == ""` | Patch only the `selectors` block. Retry once. If still broken, fall back. |
| HTTP 429 from data source | Tenacity exhausts retries | Pause that source for the run. Continue with others. Surface in coverage stats. |
| HTTP 403 from data source | First call returns 403 | Mark `not_found`. Do NOT retry. Try next source in priority order. |
| HTTP 5xx from data source | Any | Tenacity handles with backoff. After max retries, mark `not_found`. |
| PDF text extraction < 500 chars | After `pypdf` extraction | Try alternative source (HTML rendering, abstract). Mark `analysis_depth: "abstract_only"`. |
| Dependency install fails | `pip install` returns non-zero | Report the package. Ask user before falling back to degraded mode. |
| User interrupts mid-run | Any | `plan_track` must have recorded `processed_ids`. Resume on next session. |
| Output validation fails | `generate_*.py --validate` returns non-zero | Surface validator output. Do NOT silently regenerate. Ask user before retry. |
| MCP connector unavailable | Connector error | Fall back to sandbox file path. Do NOT retry on the same connector. |

Three universal rules above any specific failure:

1. **Maximum 2 retries on any single network call.** Past that, mark and move on.
2. **Never fabricate values when a source fails.** Emit `null`. Let the report show the gap honestly.
3. **Never use destructive shortcuts without explicit user confirmation.** No `rm -rf`, no `--no-verify`, no `--force` without an in-run user approval.

---

## 9. Long-Run Discipline on Shared-Context Runtimes

Tabbit runtime has token budgets and automatic context compression. Long tasks that don't manage context lose intermediate work when compression hits. Bake these practices into your skill design.

### 9.1 Persist everything to disk

Every script writes its output to `WORKDIR/<file>.json` before returning. The Agent's in-memory state is never the source of truth. If a context reset happens, the next session reads the JSON and resumes.

### 9.2 Progress log discipline

Print one line per N items to stderr, not one line per item. The Agent surfaces these via its monitor. A skill that emits 200 progress lines triggers 200 conversation events; a skill that emits 8 emits 8.

```python
for i, item in enumerate(items, 1):
    process(item)
    if i % 25 == 0 or i == len(items):
        print(f"[{i:>4}/{len(items)}] processed", file=sys.stderr)
```

### 9.3 Snapshot hygiene

`take_snapshot` returns a large text blob. Do NOT hold it across multiple Agent turns. Correct pattern:

```
1. take_snapshot         ← for visual verification only, or as a selector reference
2. evaluate_script        ← consumes the live DOM directly, NOT the snapshot text
3. snapshot text can be discarded once the structured JSON is in the sandbox
```

If your skill ever instructs the Agent to "parse the snapshot text," that is a design bug. The snapshot is for human visual verification or for the Agent to reference when patching selectors. It is never the data source. The data source is the live DOM via `evaluate_script`.

### 9.4 Plan track is your savefile

Use `plan_track` to record:

- Total items
- Processed items (by stable ID, not by index)
- Last successful checkpoint timestamp
- Known blockers (per-source 429s, etc.)

Treat it as a savefile. After a context reset, the Agent reads `plan_track` first, then `progress.json`, then resumes.

---

## 10. Pitfalls — Things That Look Smart but Aren't

Real shortcuts capable Agents have taken in real runs that produced bad outcomes. Forbid them explicitly in your `SKILL.md`.

### Pitfall 1 — Parsing snapshot text instead of using `evaluate_script`

**Why it looks smart:** the snapshot is already in context, and Python regex feels faster than another tool call.

**Why it is bad:** the snapshot is a curated text representation. It loses CSS classes, data attributes, and section boundaries. Custom regex parsers introduce subtle bugs — for example, classifying every row under one `<section>` heading as the same type, when the actual DOM had nested type information the snapshot dropped.

**Prevention:** in `SKILL.md`, say literally: "Do not parse `take_snapshot` text. Use `evaluate_script` with the bundled script."

### Pitfall 2 — Following metadata-source PDF links

**Why it looks smart:** Crossref / DOI services give you a direct PDF URL alongside the metadata.

**Why it is bad:** publisher PDF servers frequently return HTTP 403 to cloud IPs. The Crossref-supplied link is designed for citation resolution, not bulk download. Following it leads to a graveyard of 403s.

**Prevention:** explicit priority order in `SKILL.md`: "For each paper, find the arXiv ID via Semantic Scholar's `externalIds.ArXiv`, then download from arXiv. Do NOT download directly from publisher OJS URLs even when Crossref provides them — those return 403 to cloud IPs."

### Pitfall 3 — Treating 429 as fatal and switching sources

**Why it looks smart:** "429 means this source is broken, move on."

**Why it is bad:** 429 from a shared cloud IP is almost always transient. Switching sources compounds the problem by hitting the next service from the same blocked IP. The right answer is usually to sleep 60–90 seconds.

**Prevention:** failure-handling table entry: "429 → wait, do not switch source."

### Pitfall 4 — Re-reading large files in every turn

**Why it looks smart:** "I need to remember what was in `papers.json`."

**Why it is bad:** burns context. The Agent should read the file ONCE, hold a small reference (paper count, ID list), and pass paths to scripts via `--in path/to/papers.json` instead of piping JSON through its own context.

**Prevention:** SKILL.md: "Pass file paths to scripts. Do not pipe JSON content through the Agent's context."

### Pitfall 5 — Mocking dependencies in tests

**Why it looks smart:** mocked tests are fast and deterministic.

**Why it is bad:** for skills that hit live APIs, mocked tests pass while the actual integration fails. The skill author should provide either (a) real integration tests against the actual services, gated behind an env var, or (b) explicit "this code has not been tested against live APIs" notes in `SKILL.md`.

**Prevention:** include a `tests/` directory with at least one live-API smoke test, gated by `RUN_LIVE_TESTS=1`. Document what is covered and what is not.

### Pitfall 6 — Hard-coding the sandbox `WORKDIR`

**Why it looks smart:** the sandbox has a known mount path.

**Why it is bad:** local testing (which you should do before publishing) cannot override the path. CI cannot override the path. A future runtime change breaks every skill that hard-coded the path.

**Prevention:** all paths via `os.environ.get("WORKDIR", "/work/my-skill")`. Document the env var in `SKILL.md`.

### Pitfall 7 — "Optimizing" by skipping bundled scripts

**Why it looks smart:** "this bundled script does X+Y+Z, I only need X."

**Why it is bad:** the bundled script is tested. Cherry-picking parts means re-implementing untested logic at runtime under stress.

**Prevention:** if the bundled script genuinely does more than needed, expose its sub-functions as separate entry points (`--mode=quick`, `--no-cluster`, `--stage=fetch`). The Agent should never have to rewrite logic at runtime.

### Pitfall 8 — Letting the Agent write JavaScript at runtime

**Why it looks smart:** "the bundled extractor failed, let me just write a quick one."

**Why it is bad:** runtime JS authoring fails on selectors, async loading, framework wrappers, and CSP. The bundled script exists because runtime authoring is unreliable. If the bundled script fails, the right answer is to fall back to sandbox-based discovery or surface the failure to the user — not to rewrite the script.

**Prevention:** SKILL.md failure-handling table: "If `evaluate_script` returns empty after one selector patch, go to sandbox fallback. Do NOT write a new page script at runtime."

### Pitfall 9 — Installing a stdio MCP server at runtime

**Why it looks smart:** "the MCP I want exists as a `npx @some/mcp-server` package — let me just spawn it in the sandbox and call it via the MCP SDK."

**Why it is bad:** local-process / stdio MCP servers are NOT supported by the platform. A sandbox-spawned stdio MCP server:
- does NOT become an injected `mcp__<connector>__<tool>` tool the Agent can call cleanly
- does NOT flow through the platform's encrypted-credential / OAuth path (so every secret you wire up is plaintext in the sandbox)
- breaks as soon as the sandbox snapshot rotates (the spawned process is gone)
- silently disagrees with the user's expectation that "MCP" means platform-managed auth

The only officially supported MCP transport is HTTP/remote MCP, which the platform initiates from the cloud and which the user authorizes once through the normal connector flow.

**Prevention:** In SKILL.md and the skill's Capability Routing, treat MCP as "HTTP/remote connector the user has pre-authorized" exclusively. If the only known MCP for a service is stdio, write the SKILL.md to:
1. Ask the user whether an HTTP/remote MCP equivalent exists for the same service.
2. If no, ask whether the user accepts a plain sandbox script (`e2b_bash` calling the underlying API directly with an env-provided key) as a non-MCP fallback.
3. Never silently bake `npx <some-mcp-server>` or `python -m <mcp-server>` into the skill's setup steps.

---

## 11. Pre-Publish Validation Checklist

Before handing the skill to Tabbit, verify every item:

### Documentation
- [ ] `SKILL.md` has all 7 required sections
- [ ] Every "do X" step has a "do NOT do Y" if a plausible shortcut exists
- [ ] Failure handling table covers 429, 403, empty extraction, dep failure, user interrupt, validator failure, MCP unavailable
- [ ] `Files in this skill` table lists every file in the bundle
- [ ] `Output contract` lists every produced file and its required fields

### Scripts
- [ ] Browser scripts have a single `selectors` block near the top
- [ ] Browser scripts return JSON (never HTML)
- [ ] Browser scripts emit a `warnings[]` array
- [ ] Sandbox scripts read paths from `--in`/`--out` and `WORKDIR` env var
- [ ] Network calls have per-host rate gate and tenacity retries with cap
- [ ] No hard-coded absolute paths anywhere
- [ ] No tokens, cookies, or API keys committed
- [ ] `requirements.txt` is pinned (no unbounded `>=`)

### Behavior
- [ ] Skill runs end-to-end against a real target page locally (use `node`/`python` directly)
- [ ] Skill handles "source X is unavailable" without crashing
- [ ] Skill produces all promised output files
- [ ] Validator script (if present) exits non-zero on a deliberately broken output
- [ ] Skill resumes correctly from a mid-run interruption

### Resilience
- [ ] Progress is written to `WORKDIR/progress.json` between batches
- [ ] If interrupted and re-run, the skill resumes from `progress.json`
- [ ] No script holds >50 KB of state in memory across batches
- [ ] No script prints >1 line per item by default (use `--verbose` for per-item logs)

### Tabbit-conventions
- [ ] No reference to "Claude," "Codex," "ChatGPT," "the IDE assistant" in user-facing output
- [ ] No reference to internal runtime paths (the Agent uses `load_skill`-returned paths)
- [ ] Skill metadata (`name`, `description`, `version`) follows your team's naming conventions
- [ ] `CHANGELOG.md` is present and current
- [ ] Sandbox tools are referenced as `e2b_*` (not the older `w2b_*` naming)
- [ ] If MCP is required, it is HTTP/remote — no `npx`/`uvx`/`python -m` setup commands in `SKILL.md` or scripts
- [ ] No attempt to spawn a stdio MCP server from `e2b_bash` and treat it as a real `mcp__...` tool

---

## 12. Prompt Templates for IDE Agents

### 12.1 Initial skill generation

Paste this into Claude Code, Codex, ChatGPT, Cursor, or any equivalent:

```
Build a Tabbit Agent V2 skill named "<NAME>" that does <ONE-LINE GOAL>.

The skill will be loaded into a Tabbit Agent at runtime via load_skill and
executed against the Tabbit runtime, which exposes these tool categories:
- Browser GUI: navigate_page, take_snapshot, take_screenshot, click,
  scroll, wait, type_text, select_option, press_key, get_pdf_data
- Page script: evaluate_script (runs bundled JS in the user's browser)
- Sandbox: e2b_bash, e2b_read, e2b_write, e2b_edit (cloud Linux container)
- Public web: web_fetch, web_search
- MCP connectors: mcp__<connector>__<tool> (only those the user configured;
  MUST be HTTP/remote MCP — local stdio MCP / npx-spawned MCP is not supported)

Routing rules (apply in order to every data-acquisition step):
1. Login required? → Browser + page script.
2. Automation-sensitive platform? → Browser + page script, low frequency.
3. Public + plain HTTP works? → web_fetch the page.
4. Public + JS-rendered + page loads a data/*.json or /api/* file?
   → web_fetch THAT URL directly.
5. Public + JS-only with no JSON source? → Browser + page script.

Authoring constraints:
- Bundle every script. The runtime Agent must NEVER write complex
  JavaScript or Python at runtime. It patches selectors and arguments only.
- Page scripts have a single `selectors` object near the top as the only
  Agent-patchable section. They return JSON and include a warnings array.
  Single entry point: `async function run(args)`.
- Sandbox scripts accept --in/--out paths, read WORKDIR from env var,
  per-host rate gate at 1 req/sec, tenacity retries with cap of 3.
- For every "do X" step in SKILL.md, write the "do NOT do Y" if Y is a
  plausible shortcut. Especially: "do NOT parse snapshot text — use
  evaluate_script."
- Failure handling table must cover: HTTP 429 (wait, do not switch source),
  HTTP 403 (do not retry, fall back), empty extraction (sandbox fallback),
  validator failure (surface, do not silently regenerate), MCP unavailable
  (fall back to sandbox file), stdio-only MCP requested (ask for HTTP/remote
  equivalent OR offer non-MCP E2B script fallback — never silently install).
- All paths from $WORKDIR with a sane default. No absolute hard-coded paths.
- No tokens, cookies, or API keys anywhere.
- Persist intermediate state to WORKDIR/progress.json after each batch.

Produce:
- SKILL.md with the 7 standard sections: When to use, Runtime requirements,
  Capability routing, Workflow, Files in this skill, Output contract,
  Failure handling.
- scripts/browser/extract_*.js using the selectors-block pattern.
- scripts/sandbox/fetch_*.py, process_*.py, generate_*.py using argparse
  and WORKDIR.
- scripts/sandbox/requirements.txt pinned.
- templates/*.md (avoid nested {% for %} blocks unless you use a real
  template library — naive non-greedy regex engines break on nesting).
- references/*.md (taxonomies, rubrics, decision rules).
- examples/*.json showing input/output schema.
- CHANGELOG.md with initial version 0.1.0.

Do NOT include fallback code paths that fabricate output when a source
fails. Surface nulls and let the report show gaps honestly.
```

### 12.2 Skill review

Paste alongside an existing skill directory:

```
Review the Tabbit skill in this directory against the Tabbit Skill Creator
Playbook. Check:

1. SKILL.md has all 7 sections, and the "do not do Y" pattern is present
   wherever a plausible shortcut exists.
2. Every bundled JS page script has a single selectors block, returns JSON,
   emits a warnings array, and uses the `async function run(args)` signature.
3. Every sandbox script reads WORKDIR from env, uses --in/--out paths, has
   per-host rate gate, and bounded retries (max 3 attempts).
4. Failure handling table includes 429 (wait), 403 (do not retry, fall
   back), empty extraction (fallback), validator failure (surface).
5. No hard-coded absolute paths. No tokens or credentials.
6. Template engine, if custom, handles nested loops correctly OR templates
   avoid nesting.
7. Progress is persisted to WORKDIR/progress.json between batches.
8. No script holds large state across batches in memory.

Output a markdown report with:
- Pass/fail for each check, with the file/line that fails.
- Suggested patches as diffs.
- Anything that looks correct but might surprise a runtime Agent under
  stress (token budgets, context compression, shared IP rate limits).
```

### 12.3 Skill local-test runner

```bash
# Use a local WORKDIR — never the production sandbox path
export WORKDIR=$(pwd)/test-run
mkdir -p $WORKDIR

# Stage 1: discover / fetch
python scripts/sandbox/fetch_<thing>.py --discover --year 2026 \
    --out $WORKDIR/items.json

# Quick sanity check
python -c "import json; d=json.load(open('$WORKDIR/items.json')); \
    print(f'discovered {len(d[\"items\"])} items')"

# Stage 2: resolve / enrich
python scripts/sandbox/fetch_<thing>.py --resolve $WORKDIR/items.json \
    --out $WORKDIR/resolved.json

# Stage 3: process
python scripts/sandbox/process_<thing>.py --in $WORKDIR/resolved.json \
    --out $WORKDIR/processed.json

# Stage 4: generate + validate
python scripts/sandbox/generate_<thing>.py --in $WORKDIR/processed.json \
    --template templates/main.md --out $WORKDIR/output.md
python scripts/sandbox/generate_<thing>.py --validate $WORKDIR/output.md
```

A skill that will not run locally will not run reliably in Tabbit. Local testing catches `WORKDIR` bugs, missing dependencies, broken templates, and validator regressions before they hit a real user.

---

## 13. Common Skill Archetypes

### 13.1 Public-data harvester

Academic conferences, government datasets, public registries, open APIs.

- Stage 1: `web_fetch` raw page HTML; grep `<script src>` for referenced JS files.
- Stage 2: `web_fetch` each JS file; grep for `fetch("data/...json")` patterns.
- Stage 3: If JSON data files found → `web_fetch` them directly → done.
- Stage 4: Only if no JSON source → browser + `evaluate_script`.
- Stage 5: Sandbox script enriches via supplementary sources (S2, arXiv, OpenReview) by title match.
- Stage 6: Sandbox processes, generates report, validates.

### 13.2 Logged-in platform extractor

Social platforms, e-commerce backends, enterprise SaaS, content platforms.

- Stage 1: `take_snapshot` to verify login state.
- Stage 2: `navigate_page` to target list / search / feed.
- Stage 3: `scroll` + (re-snapshot only as needed) for content that loads on scroll.
- Stage 4: `evaluate_script` with bundled JS to extract structured items.
- Stage 5: Optional `click` into each item + repeat extraction for detail pages.
- Stage 6: Sandbox script deduplicates, processes, generates artifact.
- Stage 7: Optional MCP delivery (Drive / Notion / Linear).
- Throughout: low frequency, real-user-path traversal, stop on CAPTCHA.

### 13.3 File generator

CSV / PDF / PPT / DOCX / HTML reports from sandbox data.

- Stage 1: Sandbox script reads input data.
- Stage 2: Sandbox script generates artifact via templates + libraries (openpyxl, python-pptx, python-docx, reportlab, weasyprint).
- Stage 3: Sandbox script validates artifact (opens it, checks key sections, verifies size > minimum).
- Stage 4: Write artifact to a user-authorized local folder if requested; otherwise surface the sandbox path.
- Stage 5: Optional: open via `navigate_page` for visual confirmation.

### 13.4 MCP-driven workflow

GitHub, Linear, Notion, Drive, databases.

- Stage 0: **The MCP your skill depends on must be HTTP/remote.** If the only known MCP for the target service is local stdio (started via `npx`, `uvx`, `python -m`, etc.), do NOT bake the install into your skill. Either find an HTTP/remote equivalent or fall back to a plain sandbox script that calls the underlying API directly with an env-provided key.
- Stage 1: Probe MCP availability via the runtime. Degrade gracefully with an explicit user message if missing — never silently fail.
- Stage 2: Use MCP for primary read operations (list issues, list PRs, read documents).
- Stage 3: Sandbox script processes MCP output into analysis or report.
- Stage 4: Use MCP for write-back if applicable. **Require explicit user confirmation for write operations**, especially destructive ones.
- Throughout: never include tokens or credentials. Trust the platform's MCP layer for auth. Do NOT attempt to spawn a stdio MCP server from `e2b_bash` and call it via the MCP SDK — the result is not a real `mcp__...` tool and its credentials are plaintext in the sandbox.

---

## 14. Versioning and Maintenance

Use `vMAJOR.MINOR.PATCH` in `SKILL.md` frontmatter.

- Bump PATCH for selector fixes, template wording, small bug fixes.
- Bump MINOR for new workflow steps, new failure modes handled, new output fields.
- Bump MAJOR for breaking output-schema changes (downstream consumers care).
- Keep `CHANGELOG.md` adjacent to `SKILL.md`.

When a target site changes:

1. Run the local test (§12.3) — see what breaks.
2. Patch the `selectors` block of the affected page script (and only that block).
3. Add a regression test case to `tests/`.
4. Bump PATCH, update `CHANGELOG.md`.

When a new failure mode surfaces in production:

1. Add it to the failure-handling table in `SKILL.md`.
2. Bump MINOR.
3. Note in `CHANGELOG.md` which runtime constraint or Agent confusion motivated the change.

---

## 15. The Authoring Principle

A skill is a runbook for an Agent. The reader of that runbook is another AI — running in a different environment, under time pressure, with imperfect memory, against rate-limited services. Make decisions for that Agent wherever you can. Where you cannot, make the cost of getting it wrong visible: explicit anti-patterns, explicit failure modes, explicit retry caps, explicit fallback paths.

The five layers do five different things well:

- **Browser GUI** does browser things — logged-in state, visual verification, real-user paths on sensitive platforms.
- **Page script** does in-page extraction — bundled JavaScript against live DOM, returning structured JSON.
- **Sandbox** does scripting and files — processing, transforming, generating CSVs and PDFs and reports.
- **Public web** does public-data things — fetching open URLs, searching open sources, supplementing with public references.
- **MCP** does proper-API things — structured operations on third-party SaaS the user has configured. Only HTTP/remote MCP counts; local stdio MCP is out of scope.

A skill encodes how to combine these into a reliable, verifiable workflow for one class of problem. Write each skill with the respect a senior engineer gives a runbook that someone else will be on call to execute.
