` heading as the same type, when the actual DOM had nested type information the snapshot dropped. **Prevention:** in `SKILL.md`, say literally: "Do not parse `take_snapshot` text. Use `evaluate_script` with the bundled script." ### Pitfall 2 — Following metadata-source PDF links **Why it looks smart:** Crossref / DOI services give you a direct PDF URL alongside the metadata. **Why it is bad:** publisher PDF servers frequently return HTTP 403 to cloud IPs. The Crossref-supplied link is designed for citation resolution, not bulk download. Following it leads to a graveyard of 403s. **Prevention:** explicit priority order in `SKILL.md`: "For each paper, find the arXiv ID via Semantic Scholar's `externalIds.ArXiv`, then download from arXiv. Do NOT download directly from publisher OJS URLs even when Crossref provides them — those return 403 to cloud IPs." ### Pitfall 3 — Treating 429 as fatal and switching sources **Why it looks smart:** "429 means this source is broken, move on." **Why it is bad:** 429 from a shared cloud IP is almost always transient. Switching sources compounds the problem by hitting the next service from the same blocked IP. The right answer is usually to sleep 60–90 seconds. **Prevention:** failure-handling table entry: "429 → wait, do not switch source." ### Pitfall 4 — Re-reading large files in every turn **Why it looks smart:** "I need to remember what was in `papers.json`." **Why it is bad:** burns context. The Agent should read the file ONCE, hold a small reference (paper count, ID list), and pass paths to scripts via `--in path/to/papers.json` instead of piping JSON through its own context. **Prevention:** SKILL.md: "Pass file paths to scripts. Do not pipe JSON content through the Agent's context." ### Pitfall 5 — Mocking dependencies in tests **Why it looks smart:** mocked tests are fast and deterministic. **Why it is bad:** for skills that hit live APIs, mocked tests pass while the actual integration fails. The skill author should provide either (a) real integration tests against the actual services, gated behind an env var, or (b) explicit "this code has not been tested against live APIs" notes in `SKILL.md`. **Prevention:** include a `tests/` directory with at least one live-API smoke test, gated by `RUN_LIVE_TESTS=1`. Document what is covered and what is not. ### Pitfall 6 — Hard-coding the sandbox `WORKDIR` **Why it looks smart:** the sandbox has a known mount path. **Why it is bad:** local testing (which you should do before publishing) cannot override the path. CI cannot override the path. A future runtime change breaks every skill that hard-coded the path. **Prevention:** all paths via `os.environ.get("WORKDIR", "/work/my-skill")`. Document the env var in `SKILL.md`. ### Pitfall 7 — "Optimizing" by skipping bundled scripts **Why it looks smart:** "this bundled script does X+Y+Z, I only need X." **Why it is bad:** the bundled script is tested. Cherry-picking parts means re-implementing untested logic at runtime under stress. **Prevention:** if the bundled script genuinely does more than needed, expose its sub-functions as separate entry points (`--mode=quick`, `--no-cluster`, `--stage=fetch`). The Agent should never have to rewrite logic at runtime. ### Pitfall 8 — Letting the Agent write JavaScript at runtime **Why it looks smart:** "the bundled extractor failed, let me just write a quick one." **Why it is bad:** runtime JS authoring fails on selectors, async loading, framework wrappers, and CSP. The bundled script exists because runtime authoring is unreliable. If the bundled script fails, the right answer is to fall back to sandbox-based discovery or surface the failure to the user — not to rewrite the script. **Prevention:** SKILL.md failure-handling table: "If `evaluate_script` returns empty after one selector patch, go to sandbox fallback. Do NOT write a new page script at runtime." ### Pitfall 9 — Installing a stdio MCP server at runtime **Why it looks smart:** "the MCP I want exists as a `npx @some/mcp-server` package — let me just spawn it in the sandbox and call it via the MCP SDK." **Why it is bad:** local-process / stdio MCP servers are NOT supported by the platform. A sandbox-spawned stdio MCP server: - does NOT become an injected `mcp____` tool the Agent can call cleanly - does NOT flow through the platform's encrypted-credential / OAuth path (so every secret you wire up is plaintext in the sandbox) - breaks as soon as the sandbox snapshot rotates (the spawned process is gone) - silently disagrees with the user's expectation that "MCP" means platform-managed auth The only officially supported MCP transport is HTTP/remote MCP, which the platform initiates from the cloud and which the user authorizes once through the normal connector flow. **Prevention:** In SKILL.md and the skill's Capability Routing, treat MCP as "HTTP/remote connector the user has pre-authorized" exclusively. If the only known MCP for a service is stdio, write the SKILL.md to: 1. Ask the user whether an HTTP/remote MCP equivalent exists for the same service. 2. If no, ask whether the user accepts a plain sandbox script (`e2b_bash` calling the underlying API directly with an env-provided key) as a non-MCP fallback. 3. Never silently bake `npx ` or `python -m ` into the skill's setup steps. --- ## 11. Pre-Publish Validation Checklist Before handing the skill to Tabbit, verify every item: ### Documentation - [ ] `SKILL.md` has all 7 required sections - [ ] Every "do X" step has a "do NOT do Y" if a plausible shortcut exists - [ ] Failure handling table covers 429, 403, empty extraction, dep failure, user interrupt, validator failure, MCP unavailable - [ ] `Files in this skill` table lists every file in the bundle - [ ] `Output contract` lists every produced file and its required fields ### Scripts - [ ] Browser scripts have a single `selectors` block near the top - [ ] Browser scripts return JSON (never HTML) - [ ] Browser scripts emit a `warnings[]` array - [ ] Sandbox scripts read paths from `--in`/`--out` and `WORKDIR` env var - [ ] Network calls have per-host rate gate and tenacity retries with cap - [ ] No hard-coded absolute paths anywhere - [ ] No tokens, cookies, or API keys committed - [ ] `requirements.txt` is pinned (no unbounded `>=`) ### Behavior - [ ] Skill runs end-to-end against a real target page locally (use `node`/`python` directly) - [ ] Skill handles "source X is unavailable" without crashing - [ ] Skill produces all promised output files - [ ] Validator script (if present) exits non-zero on a deliberately broken output - [ ] Skill resumes correctly from a mid-run interruption ### Resilience - [ ] Progress is written to `WORKDIR/progress.json` between batches - [ ] If interrupted and re-run, the skill resumes from `progress.json` - [ ] No script holds >50 KB of state in memory across batches - [ ] No script prints >1 line per item by default (use `--verbose` for per-item logs) ### Tabbit-conventions - [ ] No reference to "Claude," "Codex," "ChatGPT," "the IDE assistant" in user-facing output - [ ] No reference to internal runtime paths (the Agent uses `load_skill`-returned paths) - [ ] Skill metadata (`name`, `description`, `version`) follows your team's naming conventions - [ ] `CHANGELOG.md` is present and current - [ ] Sandbox tools are referenced as `e2b_*` (not the older `w2b_*` naming) - [ ] If MCP is required, it is HTTP/remote — no `npx`/`uvx`/`python -m` setup commands in `SKILL.md` or scripts - [ ] No attempt to spawn a stdio MCP server from `e2b_bash` and treat it as a real `mcp__...` tool --- ## 12. Prompt Templates for IDE Agents ### 12.1 Initial skill generation Paste this into Claude Code, Codex, ChatGPT, Cursor, or any equivalent: ``` Build a Tabbit Agent V2 skill named "" that does . The skill will be loaded into a Tabbit Agent at runtime via load_skill and executed against the Tabbit runtime, which exposes these tool categories: - Browser GUI: navigate_page, take_snapshot, take_screenshot, click, scroll, wait, type_text, select_option, press_key, get_pdf_data - Page script: evaluate_script (runs bundled JS in the user's browser) - Sandbox: e2b_bash, e2b_read, e2b_write, e2b_edit (cloud Linux container) - Public web: web_fetch, web_search - MCP connectors: mcp____ (only those the user configured; MUST be HTTP/remote MCP — local stdio MCP / npx-spawned MCP is not supported) Routing rules (apply in order to every data-acquisition step): 1. Login required? → Browser + page script. 2. Automation-sensitive platform? → Browser + page script, low frequency. 3. Public + plain HTTP works? → web_fetch the page. 4. Public + JS-rendered + page loads a data/*.json or /api/* file? → web_fetch THAT URL directly. 5. Public + JS-only with no JSON source? → Browser + page script. Authoring constraints: - Bundle every script. The runtime Agent must NEVER write complex JavaScript or Python at runtime. It patches selectors and arguments only. - Page scripts have a single `selectors` object near the top as the only Agent-patchable section. They return JSON and include a warnings array. Single entry point: `async function run(args)`. - Sandbox scripts accept --in/--out paths, read WORKDIR from env var, per-host rate gate at 1 req/sec, tenacity retries with cap of 3. - For every "do X" step in SKILL.md, write the "do NOT do Y" if Y is a plausible shortcut. Especially: "do NOT parse snapshot text — use evaluate_script." - Failure handling table must cover: HTTP 429 (wait, do not switch source), HTTP 403 (do not retry, fall back), empty extraction (sandbox fallback), validator failure (surface, do not silently regenerate), MCP unavailable (fall back to sandbox file), stdio-only MCP requested (ask for HTTP/remote equivalent OR offer non-MCP E2B script fallback — never silently install). - All paths from $WORKDIR with a sane default. No absolute hard-coded paths. - No tokens, cookies, or API keys anywhere. - Persist intermediate state to WORKDIR/progress.json after each batch. Produce: - SKILL.md with the 7 standard sections: When to use, Runtime requirements, Capability routing, Workflow, Files in this skill, Output contract, Failure handling. - scripts/browser/extract_*.js using the selectors-block pattern. - scripts/sandbox/fetch_*.py, process_*.py, generate_*.py using argparse and WORKDIR. - scripts/sandbox/requirements.txt pinned. - templates/*.md (avoid nested {% for %} blocks unless you use a real template library — naive non-greedy regex engines break on nesting). - references/*.md (taxonomies, rubrics, decision rules). - examples/*.json showing input/output schema. - CHANGELOG.md with initial version 0.1.0. Do NOT include fallback code paths that fabricate output when a source fails. Surface nulls and let the report show gaps honestly. ``` ### 12.2 Skill review Paste alongside an existing skill directory: ``` Review the Tabbit skill in this directory against the Tabbit Skill Creator Playbook. Check: 1. SKILL.md has all 7 sections, and the "do not do Y" pattern is present wherever a plausible shortcut exists. 2. Every bundled JS page script has a single selectors block, returns JSON, emits a warnings array, and uses the `async function run(args)` signature. 3. Every sandbox script reads WORKDIR from env, uses --in/--out paths, has per-host rate gate, and bounded retries (max 3 attempts). 4. Failure handling table includes 429 (wait), 403 (do not retry, fall back), empty extraction (fallback), validator failure (surface). 5. No hard-coded absolute paths. No tokens or credentials. 6. Template engine, if custom, handles nested loops correctly OR templates avoid nesting. 7. Progress is persisted to WORKDIR/progress.json between batches. 8. No script holds large state across batches in memory. Output a markdown report with: - Pass/fail for each check, with the file/line that fails. - Suggested patches as diffs. - Anything that looks correct but might surprise a runtime Agent under stress (token budgets, context compression, shared IP rate limits). ``` ### 12.3 Skill local-test runner ```bash # Use a local WORKDIR — never the production sandbox path export WORKDIR=$(pwd)/test-run mkdir -p $WORKDIR # Stage 1: discover / fetch python scripts/sandbox/fetch_.py --discover --year 2026 \ --out $WORKDIR/items.json # Quick sanity check python -c "import json; d=json.load(open('$WORKDIR/items.json')); \ print(f'discovered {len(d[\"items\"])} items')" # Stage 2: resolve / enrich python scripts/sandbox/fetch_.py --resolve $WORKDIR/items.json \ --out $WORKDIR/resolved.json # Stage 3: process python scripts/sandbox/process_.py --in $WORKDIR/resolved.json \ --out $WORKDIR/processed.json # Stage 4: generate + validate python scripts/sandbox/generate_.py --in $WORKDIR/processed.json \ --template templates/main.md --out $WORKDIR/output.md python scripts/sandbox/generate_.py --validate $WORKDIR/output.md ``` A skill that will not run locally will not run reliably in Tabbit. Local testing catches `WORKDIR` bugs, missing dependencies, broken templates, and validator regressions before they hit a real user. --- ## 13. Common Skill Archetypes ### 13.1 Public-data harvester Academic conferences, government datasets, public registries, open APIs. - Stage 1: `web_fetch` raw page HTML; grep `