Methodology

How we pick, test, score, and write about AI agent skills. Transparent by default.

What we review

GearScope reviews AI agent skills only. These are the tools, rule packs, MCP servers, skill packs, and prompt libraries that make AI coding agents better at their job. We do not review general AI platforms, foundational models, or non-agent utilities.

Skills are selected from public sources: GitHub trending repos, MCP registries, Hermes Agent skill packs, Cursor rule collections, and community recommendations. We track them in a public pipeline before review.

Test depths

Not every skill gets the same level of scrutiny. We are honest about how deep we went. Each review carries one of four depth labels, ordered from the strongest evidence to the weakest.

Sandboxed gold standard

Hands-on, run in a clean, isolated Linux sandbox with the full session captured

Same effort as Hands-on, plus the install and tests are run inside a clean, throwaway Linux sandbox so nothing touches the reviewer's machine. The test script and the captured log are public, so anyone can reproduce the exact session that produced the verdict. This is our preferred tier whenever the skill can be sandboxed.

Hands-on

4+ hours of active use

Installed, configured, and used on a real project. We write code with it, hit edge cases, and evaluate output quality against our own baseline. Used when sandboxing isn't possible (desktop apps, GPU-heavy workloads, OS-specific extensions). The review explains why the sandbox tier wasn't used.

Smoke test

1-2 hours of testing

Installed and run through the documented quickstart. Verified core claims work. Checked docs, config options, and error handling. Did not use in production.

Desk review

30-60 min of reading

Read the source, docs, and community feedback. Did not install or run. Used when the skill requires hardware, paid API keys, or environments we cannot replicate.

The depth label is on every review. Sandboxed reviews additionally link to the exact test script and raw log so anyone can re-run the test on their own machine. If we only desk-reviewed something, the review says so. No hiding it in the fine print.

Functional verification

Test depth tells you where we ran the tests. Functional verification tells you what we tested.

A sandboxed install that doesn't crash is still just "the script started without errors." Functional verification means we ran the skill on representative input and asserted that the output matches the documented claim. Where possible, we test both a positive case (skill does what it says) and a negative case (skill correctly errors on bad input). Where not possible, we say so.

Each review carries one of four functional-verification states, shown on the badge next to the depth label:

  • functional ✓ — We ran the skill on representative input and verified the output. The Smoke Tests section shows what was asserted.
  • functional ~ (partial) — Some functional aspects verified, others not. The reason for the partial-status is displayed under the badges so you can see exactly what was and wasn't tested.
  • functional ✗ (no) — The skill installs and starts but we could not verify its actual output. Common reasons: needs a GPU, paid API key, hardware, or external service we don't have. The reason is displayed on the page.
  • functional ✗ (not attempted) — We didn't try, usually because the install was broken enough that we couldn't get to the functional layer.

Transparency is the point. When a functional test fails, we publish it as a fail. When we couldn't run one, we say why, on the page, not buried in a footer. A review with functional ✓ is stronger evidence than one with functional ~ or ; we'd rather be honest about the gap than fake the verification.

Scoring dimensions

Each skill is scored across five dimensions on a 1-5 scale. The overall gear rating is a weighted average, not a simple mean. Quality and docs count more than install speed.

DimensionWhat it measuresWeight
qualityCode quality, output accuracy, reliability in real use30%
docsDocumentation clarity, examples, getting-started guide25%
easeInstall friction, config complexity, dependency count15%
valueTime saved vs. doing it manually or with a simpler alternative20%
fitHow well it solves the problem it claims to solve10%

Score calibration

  • 5 gears: Exceptional. Would recommend to every agent user without reservation.
  • 4 gears: Very good. Minor issues that do not undermine the core value.
  • 3 gears: Average. Does what it says, nothing more. Fine to use if you need it.
  • 2 gears: Below average. Significant issues that limit usefulness.
  • 1 gear: Not recommended. Broken, misleading, or superseded by better options.

A 3 is not bad. Most skills land there. We do not inflate scores to be nice.

Verdicts

  • KEEP IT: Install this. It solves a real problem well.
  • TRY IT: Worth exploring if the use case matches yours. Not universally recommended.
  • SKIP IT: Not worth your time. Better alternatives exist or the skill is broken.

Editorial policy

No paid reviews. No affiliate links. No sponsorships. GearScope does not accept money from skill authors in exchange for coverage or favorable scores. This is non-negotiable.

Corrections

If a skill author or reader finds an error in a review, email reviews@gearscope.xyz with the subject "correction: [skill-name]". If the error affects the verdict, we add an Editor's Note at the top of the review with the correction and the date. We do not silently edit published reviews.

Staleness

Skills move fast. A review written today may be stale in a month if the skill updates frequently. Each review has a "last verified" date. When readers flag a stale review via the "report stale" link, we prioritize re-verification.

Track record

When we change a verdict (KEEP IT to SKIP IT or vice versa), the original verdict stays visible in the review history. Owning mistakes is the fastest way to build trust. We plan to publish a public track record page as the review count grows.

Who writes these

Reviews are written by GearScope. Each review is signed with the test depth and date so you can judge credibility for yourself. We are a named project, not an anonymous blog, and we stand behind what we publish.

Found something wrong or want to suggest a skill for review? Reach us at reviews@gearscope.xyz.