Hosting provider detection at scale
How the pipeline infers where a WordPress site is hosted from HTTP headers and HTML patterns alone, across tens of millions of domains per crawl - no WHOIS, no traceroutes, no active probing.
Signal-first, not lookup-first
- Hx-kinsta-cache: HIT kinsta.header
- Hserver: nginxnoise
- Hcf-ray: 84a7b2c91f4e-LHR cloudflare.edge
- Hcontent-type: text/html; charset=UTF-8noise
- A<script src="/wp-content/cache/min/1/abc.js">noise
- A<link href="//kinsta-cdn.com/assets/wp.css"> kinsta.asset
- Rules~240
- Rule typesheader · asset · CDN
- Tiersstrong · edge · hint
- KinstaStrongOrigin-side: header rule + asset-URL rule agree. Logged as primary host.
- CloudflareEdgeEdge-only signal (cf-ray). Recorded as CDN layer, not as origin host.
Every competing approach to hosting attribution I evaluated starts from the IP address. WHOIS, rDNS, ASN lookups, traceroutes, active banner probes - all of them treat the network layer as the ground truth and the application layer as corroboration. That flips the wrong way around at CommonCrawl scale. A CC archive has already spent real compute crawling each site's surface, which means the response headers and HTML are free bytes sitting in the WAT records the pipeline already reads. Going back out to WHOIS for 100M+ domains to learn something the headers already disclose is the opposite of frugal.
So the detector flips the layers. The primary signal is whatever the hosting provider chose to tell me about itself in its own response - an x-kinsta-cache header, a cache key under wpenginepowered.com, an asset URL pointing at kinsta-cdn. Network-layer evidence can still confirm or disambiguate (I do use it for the live-audit path) but it is never the first thing I reach for when the WARC already contains the answer.
The code path is server/audit/hosting-detect.ts for the ~25 hard-coded provider signatures and server/audit/html-scanner.ts for the per-plugin and per-theme fingerprint matcher that runs against the same HTML. They share the input (WAT + a HEAD-style slice of HTML) and produce independent outputs (provider hit vs plugin/theme hit) that the aggregation step joins on the domain. That separation is deliberate: plugin and provider detection share zero rule state, so changing one can't quietly regress the other.
Definitive vs supplementary signals
The split between definitive and supplementary signals is the single most important design choice in the rule engine, because it's what keeps false positives from exploding as the corpus grows. A CSS class named wp-block-group is consistent with many plugins but distinctive to none. Treating it as enough to publish a detection would let any inherited utility class quietly attribute the wrong plugin to millions of sites.
So supplementary signals only count when a definitive signal has already anchored the match. Asset-path and generator are the anchors because both are effectively ground truth: a URL path under /wp-content/plugins/acme/ means the plugin "acme" is installed, and the generator meta names itself explicitly. Everything else is scored only to decide whether a match is high or medium confidence, not whether the match exists at all.
Confidence tiers change what downstream consumers do, not whether the match is recorded. A high detection drives the plugin-is-installed badge on a site's audit page and counts toward the ecosystem-scale adoption statistics. A medium detection still gets recorded, but only flows into the "likely detected" badge and is excluded from the headline prevalence numbers so the statistics don't drift toward false positives at the long tail.
The WordPress-core REST namespaces - wp, oembed, batch - are filtered out of the REST-namespace rule family explicitly (server/audit/html-scanner.ts, lines 407-410) so core namespaces don't get mistaken for plugin-specific REST routes. Small detail, outsized impact on false-positive rate.
What a real rule entry looks like
The following are abridged but representative entries from the running corpus. Host signatures are hand-curated TS arrays; plugin / theme fingerprints are LLM-generated JSON rows.
{
id: 'kinsta',
label: 'Kinsta',
tier: 'origin',
patterns: [
'x-kinsta-cache', // header key
'kinsta.cloud', // CDN host
'kinsta.com', // asset host
'kin-cdn', // asset CDN prefix
],
}origin means this attributes the site as hosted on Kinsta, not merely proxied through one of Kinsta's products. {
id: 'cloudflare',
label: 'Cloudflare',
tier: 'edge', // not attributed as origin host
patterns: [
'cf-ray', // canonical edge header
'cloudflare', // generic marker
'cdnjs.cloudflare.com', // asset CDN
],
}edge means a Cloudflare hit is recorded as a CDN layer, never replaces an origin-tier match. Same site can have a Cloudflare edge hit and a Kinsta origin hit simultaneously. {
slug: 'woocommerce',
asset_paths: [
'/wp-content/plugins/woocommerce/assets/'
],
generator_patterns: [
'<meta name="generator" content="WooCommerce'
],
css_classes: ['woocommerce', 'wc-block-grid'],
rest_endpoints: ['/wp-json/wc/'],
js_globals: ['wc_add_to_cart_params'],
}asset_paths and generator_patterns are definitive - either alone triggers a detection. css_classes, rest_endpoints, and js_globals are supplementary and only promote the confidence tier to high once a definitive signal has fired. {
slug: 'wordpress-seo',
asset_paths: [
'/wp-content/plugins/wordpress-seo/'
],
generator_patterns: [
'Yoast SEO v'
],
html_comments: [
'<!-- This site is optimized with the Yoast'
],
html_attributes: ['data-yoast-schema-graph'],
}asset_paths and generator_patterns) here means confidence lands at high even on heavily-firewalled sites. Edge, origin, and the Cloudflare problem
The interesting ambiguity in provider detection is Cloudflare in front of a real origin. A site proxied through Cloudflare emits Cloudflare signals (cf-ray, cdnjs.cloudflare.com asset URLs, server: cloudflare) in the response headers, but the origin host is usually leaking alongside in other fields - a WP Engine wpe-heartbeat header, a Kinsta cache-status line, an asset-URL still pointing at the origin CDN.
A naïve detector picks whichever signal fires first and attributes the site to that one provider. That's wrong twice over: it undercounts CDNs (because sites where origin also leaks are attributed only to the origin) and it undercounts origin hosts (because sites where Cloudflare signals fire first are attributed only to Cloudflare). Either way the hosting-mix statistics are systematically biased.
So the detector records both, not one. Cloudflare is tagged as an edge-tier hit; the origin provider is tagged as the origin-tier hit. The per-site table keeps them as separate rows with an explicit tier column so downstream queries can aggregate by either. "Sites on Kinsta" and "sites behind Cloudflare" answer different questions and the storage reflects that.
The 25 provider signatures in server/audit/hosting-detect.ts cover the hosts I see most often in practice - Cloudflare, WP Engine, Kinsta, Cloudways, SiteGround, Pressable, Flywheel, DreamHost, and the long tail behind them. Each signature is an array of byte-literal patterns matched against headers and HTML, with no regex compilation overhead; at CC-batch scale even a 1% per-site cost compounds into hours of wall-clock.
Where the rules actually come from
Plugin and theme fingerprints are not written by hand. Writing them by hand for ~55,000 plugins is not a thing you do. Fingerprints are generated by an LLM pass over each plugin's source - a dedicated small-LLM research job in server/analysis/gemini/fingerprint-gen.ts that reads the plugin's PHP entry file, its readme, any shipped CSS/JS, and emits a structured JSON object with the seven rule families (asset paths, generator patterns, css classes, html comments, html attributes, js globals, shortcode output, rest endpoints).
The output lands in two SQLite tables that the detector loads at request time: plugin_fingerprints_wat (signals visible in the WAT-only narrowing phase) and plugin_fingerprints_direct (richer signals only visible once the WARC HTML is available). Each row is one plugin or theme slug; the fields are JSON-in-TEXT arrays parsed into typed structures in server/audit/html-scanner.ts (lines 84-163).
Why two tables instead of one. The WAT-first narrowing described on the CommonCrawl pipeline page only sees HTTP metadata and a slice of header-ish HTML, not the full response body. That phase can match on asset-path signals but can't usefully match on, say, a shortcode output pattern - those need the rendered DOM. Splitting the tables means the WAT pass never carries fingerprints it can't use, and the WARC pass never re-evaluates signals the WAT pass already ruled on.
Provider signatures are different - they're hand-curated because there are only 25 of them and the long tail of hosting providers is much smaller than the long tail of WordPress plugins. Each signature sits as a plain TS array of byte-pattern strings; adding a new provider is a 3-line patch plus a regression test against a known site.
Cases that shipped as bugs and then got rules
Each of these was a real misclassification I caught while reviewing per-provider pages. Each produced one rule change; each is worth writing down because the class of mistake generalises.
- OvercountGeneric
server: nginxattributed as a providerAn early signature includednginxas a weak provider marker. That's just a web server. Tens of thousands of sites got a spurious "nginx-hosted" tag. Fix: removed the rule; provider signatures now require a provider-specific token, not a web-server identifier. - UndercountCloudflare hits masking Kinsta originsBefore the edge-vs-origin split, whichever rule matched first won. Cloudflare rules are common and fire early; actual origin hosts behind CF ended up undercounted by a material percentage. Fix: tiered the corpus (
edgevsorigin) so multiple hits on the same site coexist by layer. - False positiveWP-core REST namespaces picked up by plugin rulesA plugin fingerprint's REST-namespace rule was generated with
/wp-json/wp/as a signal. That's the WordPress core namespace, present on every WP site. Fix: the scanner explicitly filterswp,oembed, andbatchnamespaces out of any REST-signal match (html-scanner.tslines 407-410); fingerprint generator post-processor rejects them at write time. - Version driftPlugin that renamed its asset directory between major versionsA fingerprint pinned to
/assets/v1/stopped matching after the plugin shipped a v2 that moved assets to/build/. Detection silently fell to zero on an otherwise healthy plugin. Fix: every fingerprint carries a generated-at timestamp and a plugin-version pin; a staleness query flags corpus rows whose last match is more than N weeks old, and those get regenerated against the latest source.
What this approach misses
Unknown origins are honest, not failures. A real fraction of sites genuinely have no provider-identifying signal in their response. Attributing those to "Other" or "Self-hosted" is the correct answer; guessing would pollute the stats. The aggregation step carries the unmatched percentage explicitly on every provider page and every ecosystem chart on this site.
Adversarial stripping is rare but real. A hosting provider that actively rewrites outbound responses to look "neutral" - removing vanity headers, rewriting cache keys, stripping asset URLs - becomes indistinguishable from any other nginx in front of a PHP origin. There's no in-band defence against this; it shows up as a systematic undercount for that provider specifically and a matching overcount of "unknown origin" in the same bucket.
Version of the corpus matters. Fingerprints generated for plugin version 3.4 may or may not match HTML emitted by plugin version 3.7. The pipeline retains the corpus version used for each observation so "this fingerprint stopped matching after plugin version X" is a query the dataset can answer, not a silent drift.
Prior art and related work
The closest large-scale open fingerprint corpus. Wappalyzer indexes 2,500+ technologies across many categories; my corpus is narrower (WordPress plugins, themes, hosting providers) but orders of magnitude denser on each. The rule-family split (categories, implies, excludes, confidence) informed my tier design; the differences are mostly about scale and the multi-signal resolution story.
- Prior artBuiltWith technology profiles
Commercial, methodology largely unpublished. Useful for directional comparison on coverage; not useful as a methodology reference because the ruleset and scoring are closed. One of the reasons I wrote this page in the first place - the ecosystem-scale WordPress hosting-mix dataset I produce should be methodologically transparent, even if competing commercial datasets aren't.
- DataCommonCrawl overview(for context)
CC publishes host-level aggregates per crawl without any hosting-provider dimension. This pipeline joins those aggregates with the fingerprint matcher output to produce "N% of WordPress sites are on provider X" statistics that are otherwise not available publicly.
- Planned arXiv preprint: "Passive hosting attribution at web-crawl scale: methodology and dataset"
Work in progress. The methodology on this page is the first draft of the paper's Methods section; the dataset described here will be cited as the reproducibility artifact. Link will land here when the preprint goes up.
Adjacent deep-dives
The upstream pipeline that feeds this detector: WAT-first narrowing, byte-range WARC fetches, per-worker SQLite shards merged into the DuckDB analytical store.
The overall detection methodology page covering plugin and theme detection; this provider page focuses specifically on the hosting-layer rules and confidence resolution.
Browse the output dataset.
Every provider page on this site is generated from the detection dataset this pipeline produces.
