I save 13,300 tokens every day on Claude Code

I run the Developer Relations team at Postman, and most of what my team does every day now runs through Claude Code skills. Writing blog posts. Staging content to WordPress. Auditing competitor sentiment on Reddit. Syncing meetup calendars, generating newsletters, tracking event registrations, hunting conference CFPs, managing the whole editorial pipeline end to end. Nineteen skills, bundled into one DevRel plugin.

That plugin had been running for months. The skills fired when they should. The output was good. And in all that time, not once did anyone check what it cost in tokens.

When I finally sat down and benchmarked it, the numbers were worse than I expected. None of the 19 skills declared allowed-tools. Not one. The worst single skill was burning 11,400 tokens every time it ran. Fixing everything cut 13,300 tokens off a full run of the plugin. That's the whole story below: how I measured it, every finding, the complete before-and-after table, and what the savings are actually worth once you convert them to dollars (spoiler: less than you'd hope, and it still matters).

If you want the prescriptive version of these lessons, I wrote up the rules I'd hand anyone building a skill in a separate post. This one is the case study those rules came from, run across a whole plugin instead of a single skill.

Where a Claude Code plugin actually spends tokens

Before the results, it helps to know where the money goes, because the three cost buckets are not priced the same.

The most expensive token in your entire plugin is the always-on one. Every skill's YAML description field gets loaded into the system prompt of every Claude Code session, whether or not the user ever invokes that skill. One bloated description is a tax every user pays, every session, forever. It's the most valuable real estate you own, and most people fill it with junk.

Next is the per-trigger cost. When Claude decides a skill is relevant, the entire SKILL.md body loads into context. A skill with a 19 KB body costs roughly 4,760 tokens every time it fires, even if the workflow only needed 20% of what's in there.

Third is runtime cost: tool output, polling loops, and the model narrating its way through a workflow while it works.

There's a fourth cost that lives outside the plugin, and it's the MCP server's tool schemas. The Postman MCP Server's full mode exposes over 100 tools. On clients that eagerly load schemas, that's 40,000 to 70,000 tokens spent before the user has typed a single word. Recent Claude Code versions defer MCP tool loading, which mostly neutralizes this, but it still shaped a few decisions in my optimization pass.

The benchmark: scoring skills across 6 dimensions

I ran the audit with skillit, a Claude Code plugin I built for exactly this. It scores each SKILL.md across six dimensions, 0 to 2 points each, for a maximum of 12.

Dimension	What it measures
Description quality	Is the always-on description tight and specific?
Body size and focus	Is the body lean, or padded with reference material?
Progressive disclosure	Does bulk content live in `references/` files, loaded on demand?
Tool scoping	Is `allowed-tools` declared with an explicit list?
Frontmatter validity	Is the frontmatter complete and well-formed?
Prompting craft	Are the instructions clear, imperative, and unambiguous?

The whole thing runs in parallel. One command spins up 19 scorer agents at once:

/skillit:skill-audit

Results came back in under two minutes, and they exposed three systemic problems across the entire plugin. Here's each one.

Finding 1: not one skill scoped allowed-tools

Every skill scored 0 out of 2 on tool scoping. Not most of them. All 19.

The allowed-tools field controls which tools Claude can call during a skill invocation. Leave it out and the model has access to every tool in the session. That's a correctness problem before it's ever a security one. A skill that should only touch WebSearch and Write can wander off and call Bash, Edit, or any MCP tool mid-workflow. It also means sub-agents and clients that resolve schemas from the allowlist have to load far more tool schemas than the skill will ever use.

Scoping every skill also flushed out three latent permission bugs I didn't know I had. The blog-copyeditor skill was told to write output files but had no Write permission declared. The blog-wordpress-stage skill ran Python through Bash with no Bash in scope. Those had been silent mismatches for months. Declaring allowed-tools turned them into explicit, visible errors I could fix.

The change itself is one line per skill. Working out the right tool list for each one is the part that took time, because it meant reading every file. Some skills need almost nothing:

# cfp-hunter: search and write, nothing else
allowed-tools: ["WebSearch", "Write"]
 
# blog-wordpress-stage: a full multi-step staging workflow
allowed-tools: ["Bash", "Read", "Write", "Edit"]

Finding 2: descriptions were bloated, and you pay for them every session

The combined always-on description footprint across all 19 skills was 5,090 characters before I touched anything. A few were badly oversized:

Skill	Before (chars)	Sentences	Problem
sentiment-apitools	637	7	Full PLG-skew caveat, every tool name, a usage recommendation, all of it body content
blog-wordpress-scheduler	342	3	Scheduling rules (Tue/Thu-first logic, 2-week windows, Mon/Wed fallback) in the description
meetup-calendar	314	4	Three operating modes enumerated as separate sentences
blog-wordpress-stage	245	4	A step summary that just duplicated the body

The sentiment-apitools description was a 637-character, seven-sentence block that loaded into every session whether or not anyone ever ran a sentiment analysis. After trimming every description down to one or two sentences covering what the skill does and what it produces, the combined footprint dropped to 4,071 characters. A 20% cut on a cost that every user pays in every session.

At the four-characters-per-token rule of thumb, the description trims alone saved about 255 tokens per session across the plugin. That's the token every user pays even when they run nothing.

Finding 3: inline Python scripts were the real tax

The per-trigger damage was worst in the skills that had pasted entire Python scripts straight into their SKILL.md bodies. Every invocation loaded every line of those scripts, whether or not the workflow ever reached the step that used them.

meetup-calendar was the worst offender by a wide margin: 1,177 lines, 45.6 KB, five full Python scripts inline. They handled Google Sheets JWT auth, grid parsing, Luma event fetching, fuzzy name matching, and stat syncing. Estimated cost per invocation: 11,400 tokens.

blog-wordpress-stage carried nine Python blocks across a 611-line body: Google Doc conversion, markdown-to-HTML rendering, image uploads to the WordPress Media API, tag lookup and creation, post staging.

blog-wordpress-scheduler had a shared auth setup, a complete US public-holiday calculation, a 140-line Tue/Thu-first slot-finding algorithm, and a dashboard calendar writer, all across 608 lines.

luma-stats was a single 200-line Python script making up most of its 346-line body.

Same pattern every time. Implementation detail the model only needs at one specific step, loaded upfront on every single call, whether that step ran or not.

The fix: progressive disclosure with a references/ directory

The repair is boring, which is the point. Implementation detail moves out of the SKILL.md body and into a references/ subdirectory. The body keeps only the orchestration: numbered steps, with a pointer to the reference file at the exact step that needs it.

Before, from luma-stats:

### Step 2: Write and run the data-fetch script
 
Write the following Python script to /tmp/luma-stats.py, then run it.
 
[200 lines of Python inline]

After:

# SKILL.md body:
# Read references/luma-stats.py, write it to /tmp/luma-stats.py, then run it:
python3 /tmp/luma-stats.py [filter-arg]

The script now lives in references/luma-stats.py. Claude reads it when it reaches that step, not a moment before. The always-loaded body dropped from 346 lines to 131, a 62% cut, and the workflow behaves exactly the same.

This is the same move I made on the Postman MCP plugin earlier, where two skills went from 19 KB to 7.7 KB and 10.6 KB to 7.2 KB. Here I applied it across nine skills and pulled out 15 reference files:

Skill	Files moved to references/	Lines removed from body
blog-wordpress-stage	gdoc-to-markdown.py, wp-md-to-html.py, wp-upload-images.py, wp-check-post.py, wp-manage-tags.py, wp-stage-post.py	358
meetup-calendar	get-google-token.py, meetup-parse.py, luma-fetch.py, meetup-match.py	351
blog-wordpress-scheduler	wp-shared-setup.py, wp-write-calendar.py, wp-find-slot.py	242
luma-stats	luma-stats.py	215
blog-write	final-checklist.md, post-types.md	109
newsletter-agentsandapis	example-newsletter.md, quality-checklist.md	87
cfp-hunter	personas.md, cfp-sites.md	69
event-sponsorships	classification.md	53
blog-dashboard-cleanup	state-file-ops.md	44
blog-wordpress-stats	fetch-wp-posts.py	38

The full benchmark data

All 19 skills, before and after. Token estimates use the four-characters-per-token rule of thumb on raw file size.

Skill	Grade	Before (~tokens)	After (~tokens)	Saved	Reduction
blog-wordpress-stage	6→11	6,050	3,400	2,650	44%
meetup-calendar	3→7	11,400	8,775	2,625	23%
blog-wordpress-scheduler	6→11	5,975	3,900	2,075	35%
luma-stats	7→11	2,810	1,165	1,645	59%
sentiment-apitools	5→9	6,200	4,930	1,270	20%
blog-write	8→10	8,175	7,305	870	11%
event-sponsorships	8→11	1,765	1,100	665	38%
cfp-hunter	6→10	1,210	650	560	46%
newsletter-agentsandapis	5→10	2,700	2,180	520	19%
blog-wordpress-stats	8→11	1,125	905	220	20%
blog-dashboard-cleanup	8→11	975	790	185	19%
blog-ideas	7→9	1,570	1,620	0	allowed-tools only
blog-copyeditor	7→9	2,940	3,020	0	allowed-tools only
blog-create-from-gdoc	7→9	2,035	2,045	0	allowed-tools only
blog-prod-updates	7→9	2,625	2,710	0	allowed-tools only
blog-header-image	8→10	6,765	6,775	0	allowed-tools + bug fix
influencer-autoagent	6→8	3,625	3,720	0	allowed-tools + rubric fix
social-media-manager	8→10	2,325	2,400	0	allowed-tools only
blog-pipeline	10→12	880	890	0	allowed-tools only

Total body savings for a combined run of all 19 skills: 13,300 tokens, or 23% overall. Always-on savings: 255 tokens on every session, paid back whether or not the user runs a single one of the optimized skills. Average grade went from 6.5 out of 12 to 9.5.

What this actually saves in dollars

I want to be straight about this, because it's the part everyone gets wrong when they write up an optimization. Token counts convert directly to API cost. At Claude Sonnet 5's standard input rate of $3 per million tokens, here's what the savings are worth per invocation:

Skill	Tokens saved	$ saved per invocation
blog-wordpress-stage	2,650	0.80¢
meetup-calendar	2,625	0.79¢
blog-wordpress-scheduler	2,075	0.62¢
luma-stats	1,645	0.49¢
sentiment-apitools	1,270	0.38¢
blog-write	870	0.26¢

Fractions of a cent per call. These are input tokens shaved off a SKILL.md body, not a full agentic run. Run every skill once, the 13,300-token combined total, and the body savings come to about four cents. The always-on trim is worth roughly 0.08 cents a session.

At my team's usage, a handful of people running Claude Code through the workday, that's a dollar or two a month. Nobody's writing a press release about that. What scales is session count: run the full pipeline once a day for a month and the body savings are worth about $1.20, then the always-on tax climbs linearly as sessions pile up.

So the dollar figure is modest and I'm not going to pretend otherwise. The 23% smaller context footprint on every call is the part that matters, and it matters regardless of scale. That's headroom the model spends on the actual task instead of on skill overhead. When a plugin's whole job is to feed Claude good instructions, the tokens you don't spend on overhead are the ones it gets to spend thinking.

(Pricing as of writing: Claude Sonnet 5 standard rate, $3 per million input tokens. An introductory rate of $2 per million runs through August 31, 2026, which knocks a third off every figure above.)

5 rules I'd apply from day one

Retrofitting 19 skills taught me what I'd do differently if I were starting clean.

Set allowed-tools when you write the skill, not later. Working out the right list for 19 existing skills meant reading every file. Decide it while the skill is fresh in your head and it's a 30-second call instead of an hour of archaeology.

Treat the description as the most expensive real estate you own. It loads every session, for every user, forever. Two tight sentences is the ceiling: what the skill does, what it produces. Nothing else earns a spot.

Any script longer than 20 lines belongs in references/. The body should say "read references/myscript.py and run it," not carry the whole implementation. Output templates, scoring rubrics, persona lists, and example documents all follow the same rule.

Thin descriptions need expanding, not just trimming. The cfp-hunter skill had a 75-character one-liner. The scorer flagged it because the output artifact, the search constraints, and the argument hint were all missing, so I grew it to 203 characters. The fix doesn't always run the direction you expect.

Make async polling cheap. Any workflow that returns a 202 and polls should specify backoff (2s, 4s, 8s) and be told to report only the final outcome. Leave it unspecified and the model cheerfully narrates every round trip, filling your context with status updates nobody asked for.

Run the audit on your own plugin

If you've got a skill or a plugin lying around that nobody's ever measured, it takes about two minutes to find out what it's costing you.

# Score every skill in the project at once
/skillit:skill-audit
 
# Apply fixes to a specific skill
/skillit:skill-optimize skills/my-skill
 
# Apply fixes to everything at once
/skillit:skill-optimize --all
 
# Re-score after editing
/skillit:skill-validate skills/my-skill/SKILL.md

After optimizing, check /context and /cost in your session to see the real footprint change. My estimates here use the four-characters-per-token rule of thumb; actual savings depend on Claude's tokenizer, and code tokenizes differently from prose. The audit for all 19 skills ran in under two minutes. The optimization pass took longer, because that's the part where you read files and move content around. But the scoring handed me a list sorted worst-grade-first, so my time went to meetup-calendar at 3 out of 12 and 45.6 KB before it went anywhere near blog-pipeline at 10 out of 12 and 3.5 KB.

Most of us building on Claude Code have a plugin like this somewhere: working fine, quietly unmeasured. Mine was fine too. It's just that "fine" was costing 23% more context than it needed to on every call, and I never would have known without pointing a scorer at it.

If you run skillit on your own skills, I'd genuinely like to hear what grade you got on the first pass. Subscribe here for more posts like this, and follow me on YouTube at @seeqcode where I build this stuff in the open.