Plugin & theme detection
How we figure out which WordPress plugins and themes are installed on any public site - from a single page visit or from billions of pages in a CommonCrawl archive, using the same fingerprint corpus.
Fingerprints, not guesses
- Yoast SEO21.2high
- WooCommerce8.1.0high
- Elementor3.16.4medium
- Jetpack-faint
Most plugins and themes leak their identity into the rendered HTML of a page - asset URLs, CSS class names, JS globals on window.*, REST namespaces under /wp-json/, even a generator meta tag carrying the version string verbatim. Detection is the exercise of turning those leaked signals into a list of installed software.
I maintain a fingerprint corpus drawn directly from each plugin's own source code on WordPress.org's SVN mirror. A fingerprint entry looks like "this CSS class appears only in plugin X" or "this JS global is set only by plugin Y" - deterministic mappings, not keyword matches. When the same site exposes multiple independent signals for the same plugin, confidence climbs.
The same corpus powers two different inputs - a one-off audit of example.com and a batch scan across billions of pages in a CommonCrawl WAT archive both resolve through the same matcher. The live audit can see more (REST probes, full DOM), but the CC pass covers orders of magnitude more sites, which is what the hosting-mix and version-distribution data on plugin pages is built from.
From raw HTML to a resolved plugin list
What detection can and can't see
These gaps shrink the fidelity of the version-distribution and hosting-mix statistics I publish on plugin pages, but they don't affect a plugin's score - scoring uses the plugin's own source code, independent of any specific site.
JS-only frontends. Sites that render their entire page in the browser (the HTML arriving empty, populated by JavaScript after load) are invisible to CommonCrawl's static crawler and partially invisible to our direct crawls. We catch them via REST-route probes and wp-json/ namespace discovery, but the fingerprint signal is thinner.
Auth-walled content. Plugins that only render markup behind a login or paywall leave no trace in a public crawl. We don't authenticate into third-party sites, so those remain uncounted in the prevalence stats - a clear trade-off I've made in favor of user privacy.
Rewritten asset paths. Some sites (especially those running WAFs or performance rewrites) strip or rename the /wp-content/plugins/ prefix. WAT-path fingerprints fail; direct fingerprints (CSS classes, JS globals) usually still work, but confidence drops to a single signal.
Behind the detection surface
Sub-pages below walk through individual subsystems in depth - how I actually built each piece, the non-obvious decisions, and what didn't work along the way.

Hi - Mika here. I built WP-Safety solo, so the methodology below is genuinely how it works, not a marketing sketch. The deep-dives are where I go long on the non-obvious details. Strictly optional - the plugin and CVE pages carry the full story without any of this.
CommonCrawl pipeline
How I narrow a petabyte-scale archive down to tens of GB actually transferred, via WAT-first fingerprinting, per-WARC-file byte-range indexing, and a co-located signing proxy for S3 egress.
Hosting provider detection
Signal-first hosting attribution from HTTP headers and HTML alone, at crawl scale - definitive vs supplementary rules, confidence tiers, and multi-signal resolution for edge-vs-origin ambiguity (Cloudflare in front of Kinsta and friends).
Run detection on your own site.
Drop any URL into the public audit tool and see the exact fingerprint signals that resolved to each detected plugin.