Methodology

Plugin & theme detection

How we figure out which WordPress plugins and themes are installed on any public site - from a single page visit or from billions of pages in a CommonCrawl archive, using the same fingerprint corpus.

What it is

Fingerprints, not guesses

Detected plugins

Yoast SEO21.2high
WooCommerce8.1.0high
Elementor3.16.4medium
Jetpack-faint

Illustrative - real audits link each detection back to the exact signal (asset hash, REST route, generator tag) that triggered it.

Most plugins and themes leak their identity into the rendered HTML of a page - asset URLs, CSS class names, JS globals on window.*, REST namespaces under /wp-json/, even a generator meta tag carrying the version string verbatim. Detection is the exercise of turning those leaked signals into a list of installed software.

I maintain a fingerprint corpus drawn directly from each plugin's own source code on WordPress.org's SVN mirror. A fingerprint entry looks like "this CSS class appears only in plugin X" or "this JS global is set only by plugin Y" - deterministic mappings, not keyword matches. When the same site exposes multiple independent signals for the same plugin, confidence climbs.

The same corpus powers two different inputs - a one-off audit of example.com and a batch scan across billions of pages in a CommonCrawl WAT archive both resolve through the same matcher. The live audit can see more (REST probes, full DOM), but the CC pass covers orders of magnitude more sites, which is what the hosting-mix and version-distribution data on plugin pages is built from.

How it works

From raw HTML to a resolved plugin list

Three signal sources feed a single matcher. CommonCrawl supplies breadth; direct crawls supply depth; the fingerprint corpus supplies the ground truth every signal resolves against.

Signal sources

CommonCrawl WAT

Monthly WAT archives

~4.9 billion URLs per crawl

primary

Asset-path references

/wp-content/plugins/slug/* in any page

Generator meta tags

WordPress + plugin version stamps

Direct crawls

Full homepage DOM

Via residential proxy, real rendered HTML

REST route probes

GET /wp-json/ reveals registered namespaces

style.css headers

Theme version + author from SVN-style headers

Fingerprint corpus

Asset hashes

Distinctive file signatures from plugin source

JS global names

window.* variables plugins leak on load

CSS class markers

Plugin-specific DOM hooks

Matcher

Signal lookup

Each signal resolves to zero or more plugin slugs

Confidence scorer

Multi-signal agreement boosts confidence

Version extractor

?ver= params, asset paths, readme headers

Per-site detection

Plugin list

Slug + version + confidence tier

Active theme

Slug + version from style.css

WordPress core version

From generator meta or asset path patterns

Coverage gaps

What detection can and can't see

Blind spots

partial

JS-only rendered sites

React / Vue frontends with empty server HTML

limited

Auth-walled pages

Content behind a paywall or member gate

defeats WAT

Renamed assets

Sites that rewrite /wp-content/* to anonymize paths

These gaps shrink the fidelity of the version-distribution and hosting-mix statistics I publish on plugin pages, but they don't affect a plugin's score - scoring uses the plugin's own source code, independent of any specific site.

JS-only frontends. Sites that render their entire page in the browser (the HTML arriving empty, populated by JavaScript after load) are invisible to CommonCrawl's static crawler and partially invisible to our direct crawls. We catch them via REST-route probes and wp-json/ namespace discovery, but the fingerprint signal is thinner.

Auth-walled content. Plugins that only render markup behind a login or paywall leave no trace in a public crawl. We don't authenticate into third-party sites, so those remain uncounted in the prevalence stats - a clear trade-off I've made in favor of user privacy.

Rewritten asset paths. Some sites (especially those running WAFs or performance rewrites) strip or rename the /wp-content/plugins/ prefix. WAT-path fingerprints fail; direct fingerprints (CSS classes, JS globals) usually still work, but confidence drops to a single signal.

Go deeper

Behind the detection surface

Sub-pages below walk through individual subsystems in depth - how I actually built each piece, the non-obvious decisions, and what didn't work along the way.

For nerds only

Hi - Mika here. I built WP-Safety solo, so the methodology below is genuinely how it works, not a marketing sketch. The deep-dives are where I go long on the non-obvious details. Strictly optional - the plugin and CVE pages carry the full story without any of this.

Mika Sipilä·Founder, WP-Safety.org

CommonCrawl pipeline

How I narrow a petabyte-scale archive down to tens of GB actually transferred, via WAT-first fingerprinting, per-WARC-file byte-range indexing, and a co-located signing proxy for S3 egress.

Read the deep-dive

Hosting provider detection

Signal-first hosting attribution from HTTP headers and HTML alone, at crawl scale - definitive vs supplementary rules, confidence tiers, and multi-signal resolution for edge-vs-origin ambiguity (Cloudflare in front of Kinsta and friends).

Read the deep-dive

Run detection on your own site.

Drop any URL into the public audit tool and see the exact fingerprint signals that resolved to each detected plugin.

Run a free audit Back to docs