Methodology

Plugin & theme detection

How we figure out which WordPress plugins and themes are installed on any public site - from a single page visit or from billions of pages in a CommonCrawl archive, using the same fingerprint corpus.

What it is

Fingerprints, not guesses

example.com
WordPress 6.4.2
  • Yoast SEO21.2high
  • WooCommerce8.1.0high
  • Elementor3.16.4medium
  • Jetpack-faint
Illustrative - real audits link each detection back to the exact signal (asset hash, REST route, generator tag) that triggered it.

Most plugins and themes leak their identity into the rendered HTML of a page - asset URLs, CSS class names, JS globals on window.*, REST namespaces under /wp-json/, even a generator meta tag carrying the version string verbatim. Detection is the exercise of turning those leaked signals into a list of installed software.

I maintain a fingerprint corpus drawn directly from each plugin's own source code on WordPress.org's SVN mirror. A fingerprint entry looks like "this CSS class appears only in plugin X" or "this JS global is set only by plugin Y" - deterministic mappings, not keyword matches. When the same site exposes multiple independent signals for the same plugin, confidence climbs.

The same corpus powers two different inputs - a one-off audit of example.com and a batch scan across billions of pages in a CommonCrawl WAT archive both resolve through the same matcher. The live audit can see more (REST probes, full DOM), but the CC pass covers orders of magnitude more sites, which is what the hosting-mix and version-distribution data on plugin pages is built from.

How it works

From raw HTML to a resolved plugin list

Three signal sources feed a single matcher. CommonCrawl supplies breadth; direct crawls supply depth; the fingerprint corpus supplies the ground truth every signal resolves against.
Signal sources
Monthly WAT archives
~4.9 billion URLs per crawl
primary
Asset-path references
/wp-content/plugins/slug/* in any page
Generator meta tags
WordPress + plugin version stamps
Full homepage DOM
Via residential proxy, real rendered HTML
REST route probes
GET /wp-json/ reveals registered namespaces
style.css headers
Theme version + author from SVN-style headers
Asset hashes
Distinctive file signatures from plugin source
JS global names
window.* variables plugins leak on load
CSS class markers
Plugin-specific DOM hooks
Signal lookup
Each signal resolves to zero or more plugin slugs
Confidence scorer
Multi-signal agreement boosts confidence
Version extractor
?ver= params, asset paths, readme headers
Plugin list
Slug + version + confidence tier
Active theme
Slug + version from style.css
WordPress core version
From generator meta or asset path patterns
Coverage gaps

What detection can and can't see

partial
JS-only rendered sites
React / Vue frontends with empty server HTML
limited
Auth-walled pages
Content behind a paywall or member gate
defeats WAT
Renamed assets
Sites that rewrite /wp-content/* to anonymize paths

These gaps shrink the fidelity of the version-distribution and hosting-mix statistics I publish on plugin pages, but they don't affect a plugin's score - scoring uses the plugin's own source code, independent of any specific site.

JS-only frontends. Sites that render their entire page in the browser (the HTML arriving empty, populated by JavaScript after load) are invisible to CommonCrawl's static crawler and partially invisible to our direct crawls. We catch them via REST-route probes and wp-json/ namespace discovery, but the fingerprint signal is thinner.

Auth-walled content. Plugins that only render markup behind a login or paywall leave no trace in a public crawl. We don't authenticate into third-party sites, so those remain uncounted in the prevalence stats - a clear trade-off I've made in favor of user privacy.

Rewritten asset paths. Some sites (especially those running WAFs or performance rewrites) strip or rename the /wp-content/plugins/ prefix. WAT-path fingerprints fail; direct fingerprints (CSS classes, JS globals) usually still work, but confidence drops to a single signal.

Go deeper

Behind the detection surface

Sub-pages below walk through individual subsystems in depth - how I actually built each piece, the non-obvious decisions, and what didn't work along the way.

Mika Sipilä
For nerds only

Hi - Mika here. I built WP-Safety solo, so the methodology below is genuinely how it works, not a marketing sketch. The deep-dives are where I go long on the non-obvious details. Strictly optional - the plugin and CVE pages carry the full story without any of this.

Mika Sipilä·Founder, WP-Safety.org

Run detection on your own site.

Drop any URL into the public audit tool and see the exact fingerprint signals that resolved to each detected plugin.