Deep dive

Hosting provider detection at scale

How the pipeline infers where a WordPress site is hosted from HTTP headers and HTML patterns alone, across tens of millions of domains per crawl - no WHOIS, no traceroutes, no active probing.

What it is

Signal-first, not lookup-first

Observed HTTP signals per domain
  • Hx-kinsta-cache: HIT kinsta.header
  • Hserver: nginxnoise
  • Hcf-ray: 84a7b2c91f4e-LHR cloudflare.edge
  • Hcontent-type: text/html; charset=UTF-8noise
  • A<script src="/wp-content/cache/min/1/abc.js">noise
  • A<link href="//kinsta-cdn.com/assets/wp.css"> kinsta.asset
Fingerprint matcher
  • Rules~240
  • Rule typesheader · asset · CDN
  • Tiersstrong · edge · hint
Rules are curated by hand and versioned per crawl. Multi-signal agreement escalates confidence; a single edge-only match (Cloudflare, say) never overrides an origin signal.
Resolved providers for this domain
  • KinstaStrong
    Origin-side: header rule + asset-URL rule agree. Logged as primary host.
  • CloudflareEdge
    Edge-only signal (cf-ray). Recorded as CDN layer, not as origin host.
A worked example. Real crawls see thousands of sites per CDN / host combination; the same matcher runs unchanged against every one. Edge-only signals (Cloudflare, Sucuri) are recorded alongside the origin rather than replacing it, so sites hosted on Kinsta behind Cloudflare show up as both.

Every competing approach to hosting attribution I evaluated starts from the IP address. WHOIS, rDNS, ASN lookups, traceroutes, active banner probes - all of them treat the network layer as the ground truth and the application layer as corroboration. That flips the wrong way around at CommonCrawl scale. A CC archive has already spent real compute crawling each site's surface, which means the response headers and HTML are free bytes sitting in the WAT records the pipeline already reads. Going back out to WHOIS for 100M+ domains to learn something the headers already disclose is the opposite of frugal.

So the detector flips the layers. The primary signal is whatever the hosting provider chose to tell me about itself in its own response - an x-kinsta-cache header, a cache key under wpenginepowered.com, an asset URL pointing at kinsta-cdn. Network-layer evidence can still confirm or disambiguate (I do use it for the live-audit path) but it is never the first thing I reach for when the WARC already contains the answer.

The code path is server/audit/hosting-detect.ts for the ~25 hard-coded provider signatures and server/audit/html-scanner.ts for the per-plugin and per-theme fingerprint matcher that runs against the same HTML. They share the input (WAT + a HEAD-style slice of HTML) and produce independent outputs (provider hit vs plugin/theme hit) that the aggregation step joins on the domain. That separation is deliberate: plugin and provider detection share zero rule state, so changing one can't quietly regress the other.

Rule engine

Definitive vs supplementary signals

Signal taxonomy
Definitive signals - any one fires a detection on its own
primary signal
URL regex against HTML
/wp-content/plugins/{slug}/, /wp-content/themes/{slug}/
One match is enough
The slug is effectively the answer; no multi-signal check needed
primary signal
<meta name='generator' content='…'>
WordPress core and many plugins stamp themselves here
Used for core-version pinning
Also filters out WP's default ?ver= query params so they don't pollute plugin version guesses
CSS class
body class, wrapper div, plugin-specific utility
HTML comment
<!-- Generated by X -->
HTML attribute
data-* markers, script attributes
REST namespace
/wp-json/{ns}/ with WP-core namespaces (wp, oembed, batch) filtered out
JS global
window.* markers left behind by inline scripts
Shortcode output
DOM fingerprints of rendered shortcodes
high
≥2 detection methods agreed, OR asset_path fired on its own
medium
One definitive signal, no supplementary corroboration
(filtered)
No definitive signal - detection is dropped, never published as a match

The split between definitive and supplementary signals is the single most important design choice in the rule engine, because it's what keeps false positives from exploding as the corpus grows. A CSS class named wp-block-group is consistent with many plugins but distinctive to none. Treating it as enough to publish a detection would let any inherited utility class quietly attribute the wrong plugin to millions of sites.

So supplementary signals only count when a definitive signal has already anchored the match. Asset-path and generator are the anchors because both are effectively ground truth: a URL path under /wp-content/plugins/acme/ means the plugin "acme" is installed, and the generator meta names itself explicitly. Everything else is scored only to decide whether a match is high or medium confidence, not whether the match exists at all.

Confidence tiers change what downstream consumers do, not whether the match is recorded. A high detection drives the plugin-is-installed badge on a site's audit page and counts toward the ecosystem-scale adoption statistics. A medium detection still gets recorded, but only flows into the "likely detected" badge and is excluded from the headline prevalence numbers so the statistics don't drift toward false positives at the long tail.

The WordPress-core REST namespaces - wp, oembed, batch - are filtered out of the REST-namespace rule family explicitly (server/audit/html-scanner.ts, lines 407-410) so core namespaces don't get mistaken for plugin-specific REST routes. Small detail, outsized impact on false-positive rate.

Rules in practice

What a real rule entry looks like

The following are abridged but representative entries from the running corpus. Host signatures are hand-curated TS arrays; plugin / theme fingerprints are LLM-generated JSON rows.

Host signatureKinsta
{
  id: 'kinsta',
  label: 'Kinsta',
  tier: 'origin',
  patterns: [
    'x-kinsta-cache',    // header key
    'kinsta.cloud',      // CDN host
    'kinsta.com',        // asset host
    'kin-cdn',           // asset CDN prefix
  ],
}
Match if any pattern is found in the response headers or HTML. Tier origin means this attributes the site as hosted on Kinsta, not merely proxied through one of Kinsta's products.
Host signatureCloudflare
{
  id: 'cloudflare',
  label: 'Cloudflare',
  tier: 'edge',         // not attributed as origin host
  patterns: [
    'cf-ray',           // canonical edge header
    'cloudflare',       // generic marker
    'cdnjs.cloudflare.com', // asset CDN
  ],
}
Tier edge means a Cloudflare hit is recorded as a CDN layer, never replaces an origin-tier match. Same site can have a Cloudflare edge hit and a Kinsta origin hit simultaneously.
Plugin fingerprintwoocommerce
{
  slug: 'woocommerce',
  asset_paths: [
    '/wp-content/plugins/woocommerce/assets/'
  ],
  generator_patterns: [
    '<meta name="generator" content="WooCommerce'
  ],
  css_classes: ['woocommerce', 'wc-block-grid'],
  rest_endpoints: ['/wp-json/wc/'],
  js_globals: ['wc_add_to_cart_params'],
}
asset_paths and generator_patterns are definitive - either alone triggers a detection. css_classes, rest_endpoints, and js_globals are supplementary and only promote the confidence tier to high once a definitive signal has fired.
Plugin fingerprintyoast-seo
{
  slug: 'wordpress-seo',
  asset_paths: [
    '/wp-content/plugins/wordpress-seo/'
  ],
  generator_patterns: [
    'Yoast SEO v'
  ],
  html_comments: [
    '<!-- This site is optimized with the Yoast'
  ],
  html_attributes: ['data-yoast-schema-graph'],
}
HTML-comment and attribute signals catch installs where the asset path is rewritten but the plugin's own markup still fires. Two independent definitive signals (asset_paths and generator_patterns) here means confidence lands at high even on heavily-firewalled sites.
Multi-signal resolution

Edge, origin, and the Cloudflare problem

The interesting ambiguity in provider detection is Cloudflare in front of a real origin. A site proxied through Cloudflare emits Cloudflare signals (cf-ray, cdnjs.cloudflare.com asset URLs, server: cloudflare) in the response headers, but the origin host is usually leaking alongside in other fields - a WP Engine wpe-heartbeat header, a Kinsta cache-status line, an asset-URL still pointing at the origin CDN.

A naïve detector picks whichever signal fires first and attributes the site to that one provider. That's wrong twice over: it undercounts CDNs (because sites where origin also leaks are attributed only to the origin) and it undercounts origin hosts (because sites where Cloudflare signals fire first are attributed only to Cloudflare). Either way the hosting-mix statistics are systematically biased.

So the detector records both, not one. Cloudflare is tagged as an edge-tier hit; the origin provider is tagged as the origin-tier hit. The per-site table keeps them as separate rows with an explicit tier column so downstream queries can aggregate by either. "Sites on Kinsta" and "sites behind Cloudflare" answer different questions and the storage reflects that.

The 25 provider signatures in server/audit/hosting-detect.ts cover the hosts I see most often in practice - Cloudflare, WP Engine, Kinsta, Cloudways, SiteGround, Pressable, Flywheel, DreamHost, and the long tail behind them. Each signature is an array of byte-literal patterns matched against headers and HTML, with no regex compilation overhead; at CC-batch scale even a 1% per-site cost compounds into hours of wall-clock.

Corpus maintenance

Where the rules actually come from

Plugin and theme fingerprints are not written by hand. Writing them by hand for ~55,000 plugins is not a thing you do. Fingerprints are generated by an LLM pass over each plugin's source - a dedicated small-LLM research job in server/analysis/gemini/fingerprint-gen.ts that reads the plugin's PHP entry file, its readme, any shipped CSS/JS, and emits a structured JSON object with the seven rule families (asset paths, generator patterns, css classes, html comments, html attributes, js globals, shortcode output, rest endpoints).

The output lands in two SQLite tables that the detector loads at request time: plugin_fingerprints_wat (signals visible in the WAT-only narrowing phase) and plugin_fingerprints_direct (richer signals only visible once the WARC HTML is available). Each row is one plugin or theme slug; the fields are JSON-in-TEXT arrays parsed into typed structures in server/audit/html-scanner.ts (lines 84-163).

Why two tables instead of one. The WAT-first narrowing described on the CommonCrawl pipeline page only sees HTTP metadata and a slice of header-ish HTML, not the full response body. That phase can match on asset-path signals but can't usefully match on, say, a shortcode output pattern - those need the rendered DOM. Splitting the tables means the WAT pass never carries fingerprints it can't use, and the WARC pass never re-evaluates signals the WAT pass already ruled on.

Provider signatures are different - they're hand-curated because there are only 25 of them and the long tail of hosting providers is much smaller than the long tail of WordPress plugins. Each signature sits as a plain TS array of byte-pattern strings; adding a new provider is a 3-line patch plus a regression test against a known site.

Failure modes

Cases that shipped as bugs and then got rules

Each of these was a real misclassification I caught while reviewing per-provider pages. Each produced one rule change; each is worth writing down because the class of mistake generalises.

  • OvercountGeneric server: nginx attributed as a provider
    An early signature included nginx as a weak provider marker. That's just a web server. Tens of thousands of sites got a spurious "nginx-hosted" tag. Fix: removed the rule; provider signatures now require a provider-specific token, not a web-server identifier.
  • UndercountCloudflare hits masking Kinsta origins
    Before the edge-vs-origin split, whichever rule matched first won. Cloudflare rules are common and fire early; actual origin hosts behind CF ended up undercounted by a material percentage. Fix: tiered the corpus (edge vs origin) so multiple hits on the same site coexist by layer.
  • False positiveWP-core REST namespaces picked up by plugin rules
    A plugin fingerprint's REST-namespace rule was generated with /wp-json/wp/ as a signal. That's the WordPress core namespace, present on every WP site. Fix: the scanner explicitly filters wp, oembed, and batch namespaces out of any REST-signal match (html-scanner.ts lines 407-410); fingerprint generator post-processor rejects them at write time.
  • Version driftPlugin that renamed its asset directory between major versions
    A fingerprint pinned to /assets/v1/ stopped matching after the plugin shipped a v2 that moved assets to /build/. Detection silently fell to zero on an otherwise healthy plugin. Fix: every fingerprint carries a generated-at timestamp and a plugin-version pin; a staleness query flags corpus rows whose last match is more than N weeks old, and those get regenerated against the latest source.
Limits

What this approach misses

by design
Private / intranet sites
If the site isn't in CommonCrawl, this detector never sees it. Live audit path covers individual domains on demand.
structural
Self-hosted with no signature
Pure vanilla self-hosted WordPress with no caching layer, no provider-specific CDN, and a generic reverse proxy fingerprints as nothing. Shows up as 'unknown origin' and I'd rather it than a guess.
adversarial
Header stripping
Origins that actively strip provider-identifying headers at the edge (enterprise WAFs, aggressive nginx configs) are invisible to the detector.

Unknown origins are honest, not failures. A real fraction of sites genuinely have no provider-identifying signal in their response. Attributing those to "Other" or "Self-hosted" is the correct answer; guessing would pollute the stats. The aggregation step carries the unmatched percentage explicitly on every provider page and every ecosystem chart on this site.

Adversarial stripping is rare but real. A hosting provider that actively rewrites outbound responses to look "neutral" - removing vanity headers, rewriting cache keys, stripping asset URLs - becomes indistinguishable from any other nginx in front of a PHP origin. There's no in-band defence against this; it shows up as a systematic undercount for that provider specifically and a matching overcount of "unknown origin" in the same bucket.

Version of the corpus matters. Fingerprints generated for plugin version 3.4 may or may not match HTML emitted by plugin version 3.7. The pipeline retains the corpus version used for each observation so "this fingerprint stopped matching after plugin version X" is a query the dataset can answer, not a silent drift.

Further reading

Prior art and related work

  • The closest large-scale open fingerprint corpus. Wappalyzer indexes 2,500+ technologies across many categories; my corpus is narrower (WordPress plugins, themes, hosting providers) but orders of magnitude denser on each. The rule-family split (categories, implies, excludes, confidence) informed my tier design; the differences are mostly about scale and the multi-signal resolution story.

  • Commercial, methodology largely unpublished. Useful for directional comparison on coverage; not useful as a methodology reference because the ruleset and scoring are closed. One of the reasons I wrote this page in the first place - the ecosystem-scale WordPress hosting-mix dataset I produce should be methodologically transparent, even if competing commercial datasets aren't.

  • DataCommonCrawl overview(for context)

    CC publishes host-level aggregates per crawl without any hosting-provider dimension. This pipeline joins those aggregates with the fingerprint matcher output to produce "N% of WordPress sites are on provider X" statistics that are otherwise not available publicly.

  • Planned arXiv preprint: "Passive hosting attribution at web-crawl scale: methodology and dataset"

    Work in progress. The methodology on this page is the first draft of the paper's Methods section; the dataset described here will be cited as the reproducibility artifact. Link will land here when the preprint goes up.

About the author

Browse the output dataset.

Every provider page on this site is generated from the detection dataset this pipeline produces.