CommonCrawl pipeline
How I turn a multi-petabyte public web archive into a per-domain map of installed WordPress plugins - without downloading the whole archive.
Narrowing a petabyte into a shopping list
- Historical CC corpus (all crawls)petabyte scale
- Single crawl, WARC~hundreds of TB
- Single crawl, WAT~tens of TB
- URLs per crawl~3+ billion
- Bytes we actually movetens of GB
- Wall-clock per crawlhours, not days
- Parallelism (WAT)N workers per VM
- Parallelism (WARC)500 concurrent fetches
CommonCrawl publishes a free monthly snapshot of the public web. A single recent crawl is roughly 3+ billion URLs packaged into two parallel archive formats: WARC (the full HTTP response - headers + body, running to hundreds of TB compressed) and WAT (a compact JSON index extracted from each WARC - headers, links, metadata, but no body, still in the tens of TB). The full historical corpus across every crawl since 2008 is petabyte-scale.
Downloading a full crawl's WARC archive from any generic cloud network is economically infeasible - the egress bill and the time-to-completion both grow linearly with bytes moved. So the question isn't "how do I download the archive?" The question is "how do I download as few bytes as possible while still answering the detection question?"
Our answer has three moves. First, scan the WAT archive end-to-end and identify only the domains that actually run WordPress - which is a small fraction of the crawl by URL count. Second, build a per-WARC-file record index: for each matched page, the exact WARC filename, byte offset, and length. Third, use HTTP Range requests through a co-located signing proxy to fetch only those byte ranges. The result is a pipeline that processes a full crawl in hours on a single VM instead of weeks across a fleet, and moves tens of GB of bytes instead of tens of TB.
Three phases, one VM
Why WAT-first saves terabytes
A WAT file is small. It holds the HTTP response headers and a JSON-encoded metadata summary for every page in the matching WARC segment, but not the response bodies. WordPress leaves distinctive fingerprints in the metadata - the X-Powered-By header, <meta name="generator"> tags, asset URLs under /wp-content/plugins/ - all of which appear in the WAT record without needing the body.
Scanning the full WAT archive end-to-end is therefore both cheap and tractable: streaming decompression, no full-body parsing, and most pages can be rejected in the first few hundred bytes. That single pass produces a list of domains that run WordPress along with the specific WARC records where their WordPress pages live - byte offset + length resolved directly out of the WAT metadata.
That list is the whole point. Instead of downloading gigabytes of WARCs to find the one page we care about, we now have a precise shopping list: "from WARC file X, we want bytes 84,019,234 through 84,058,001; from file Y, we want bytes 1,120,044,500 through 1,120,071,800; etc."
CommonCrawl's S3 bucket supports HTTP Range requests, so the WARC phase fetches each record by byte offset. We cluster the requests by WARC filename and sort by offset before firing, which turns scattered seeks into sequential reads at the origin - meaningfully faster than the random-access pattern you'd get from fetching in discovery order. A single crawl ends up moving tens of GB of actual bytes to reconstruct the full-body HTML of every WordPress page in the corpus, out of a WARC archive measured in petabytes.
A 49-line signing proxy
The CommonCrawl archive lives in S3. Pulling from S3 at the volume and concurrency this pipeline needs is a three-way tradeoff between where the scanner VM runs, what it costs to compute, and what it costs to move bytes. Two naïve options fail on different axes:
Option A - scanner runs in the same cloud as the bucket. Zero egress cost, fast intra-region network, credentials are trivial to get. But compute pricing for large multi-core CPU instances in that cloud is the most expensive of the major providers. A VM big enough to process a full crawl in hours costs meaningfully more per hour there than on alternative hosts.
Option B - scanner runs on a cheaper cloud and fetches from S3 directly. Compute is cheap. But cross-network S3 egress rate-limits, throttles, and bills per-GB. A 500-concurrent fetch pattern against the bucket from an outside cloud triggers throttling long before the scanner's NIC is saturated; wall-clock balloons from hours to days; the egress line-item alone overtakes the compute savings.
Option C - split the roles. A tiny purpose-built signing service (cc-s3-signer.js, 49 lines of Express) runs inside the bucket's cloud and provider. It holds the credentials, exposes a single POST /sign-batch endpoint, and returns a batch of presigned S3 URLs valid for one hour. That's the whole service.
The heavy scanner VM can then live on whichever cheaper cloud wins on compute price. It asks the signer for a batch of URLs, then fetches them directly from S3 over HTTPS - presigned URLs let the scanner pull without any credentials, SDK, or region-specific retry code.
The economic result: pennies of compute for a stateless signer running 24/7 in the expensive cloud, plus hours of cheap big compute where it's a fraction of the price, minus the egress trap that sinks Option B. The signer is tiny enough that even in the most expensive region, its bill is rounding error relative to the full-crawl compute cost on the scanner.
Per-worker SQLite, merged to one DuckDB
WAT scanning is embarrassingly parallel: each segment can be processed independently, and the output is append-only observations (plugin X detected on domain Y with signal Z). The naïve approach - one shared SQLite database, N writer processes - deadlocks and slows to a crawl under contention. Even with WAL mode, many concurrent writers fighting for the same file is the wrong pattern.
So each worker gets its own SQLite file. Zero write contention during the scan; each worker writes its local store at full speed. This turns the parallelism cost from "tune busy-timeouts and pray" to "scale workers linearly until you saturate the NIC or the signing proxy." The only coordination is at segment-claim time (which is cheap - a single row update to mark a segment as "taken").
When the scan completes, there are typically a few dozen per-worker SQLite files on disk. Each contains its own view of the crawl - overlapping in some places (the same plugin hit from different segments), disjoint in others. We want a single query-friendly dataset, ideally small enough to ship to the main server cheaply and query efficiently.
DuckDB is the merge target. The aggregation step reads every per-worker file, unions their observations, deduplicates by (domain, plugin_slug, crawl_id), and writes a single columnar DuckDB file. Column-store compression plus dedup collapse the footprint dramatically - a set of per-worker SQLites totaling many gigabytes will typically merge into hundreds of megabytes of DuckDB. Analytical queries against it (hosting mix per plugin, version distribution per site, top plugins by domain count) run orders of magnitude faster than they would against the raw SQLite union.
That merged DuckDB is the artifact that gets uploaded back to the main infrastructure. Everything the public plugin pages render about live-site distribution comes out of this file.
Hosting provider detection at scale
- Hx-kinsta-cache: HIT kinsta.header
- Hserver: nginxnoise
- Hcf-ray: 84a7b2c91f4e-LHR cloudflare.edge
- Hcontent-type: text/html; charset=UTF-8noise
- A<script src="/wp-content/cache/min/1/abc.js">noise
- A<link href="//kinsta-cdn.com/assets/wp.css"> kinsta.asset
- Rules~240
- Rule typesheader · asset · CDN
- Tiersstrong · edge · hint
- KinstaStrongOrigin-side: header rule + asset-URL rule agree. Logged as primary host.
- CloudflareEdgeEdge-only signal (cf-ray). Recorded as CDN layer, not as origin host.
One of the harder parts of the pipeline is not about detecting plugins at all - it's figuring out where each WordPress site is actually hosted, from the thinnest possible HTTP signal. The WARC phase already gives me the headers, so if I can match a hosting provider directly off the headers and a handful of HTML patterns, I never have to touch anything heavier (no WHOIS lookups, no traceroutes, no active probing).
The fingerprint corpus for providers is curated by hand and grows with each crawl. A typical entry maps a distinctive header - X-Kinsta-Cache, X-Cache-Enabled on WPEngine, a CDN host pattern, a script URL under a provider's asset domain - to a provider name and a confidence tier. Each match is recorded per domain; the aggregation step produces "N% of sites running plugin X are on provider Y" statistics across the entire corpus. The provider detection deep-dive walks through the rule engine internals in full.
The hard part isn't writing one such rule. It's making the whole system behave sensibly across tens of millions of heterogeneous sites during a single crawl. Ambiguous matches have to resolve in the right direction; providers that share CDNs can't cannibalize each other's detection counts; edge cases like "site obviously behind Cloudflare but also hosted somewhere specific" need to record both rather than picking one at random.
Getting this right across a full crawl took real work - iterating on rule precedence, fingerprint corpus coverage, and the way multi-signal matches compose. The output is the only ecosystem-scale WordPress hosting-mix dataset I know of that's public and methodologically transparent. Every provider page on this site is generated from it.
The compression math, written out
Numbers are order-of-magnitude, tuned to a recent CC archive. They compound multiplicatively, which is why the pipeline can do real work on a reasonable VM budget instead of needing cluster scale.
Each step multiplies the savings of the step before. WAT-first narrowing eliminates the long tail; byte-range fetches eliminate the within-WARC waste; the merge pass eliminates per-worker redundancy. The net effect is that what started as petabyte-scale input becomes a hundreds-of-MB analytical store an individual developer can reason about on a laptop.
What the pipeline can't see
These are properties of CommonCrawl itself, not our pipeline on top of it. The pipeline inherits whatever coverage the underlying crawl has.
Monthly cadence. CommonCrawl publishes one major crawl per month. Our aggregate statistics on plugin hosting mix, version distribution, and prevalence therefore lag real-world adoption by up to ~30 days. Individual audits through the public form are live - the monthly cadence only affects the corpus-wide statistics.
Surface web only. CommonCrawl crawls what an unauthenticated visitor can see from the public web. Sites behind an authentication wall, a paywall, or intranet boundaries are invisible - both to CC and to this pipeline. WordPress installations inside corporate VPNs or private networks never appear in the statistics; in aggregate those are a small fraction but enterprise WordPress is probably underrepresented for this reason.
Snapshot drift. A site that enables a plugin the day after a CC crawl won't show up in the corpus until the next crawl picks it up. Over any single crawl, the view is a precise snapshot of a month-ago ecosystem - internally consistent, but a month behind the live web.
JS-only renders and rewritten asset paths are addressed on the main detection page - the CC pipeline inherits those same gaps.
Standards & prior art
- The CommonCrawl dataset & WARC formatCommonCrawl's own documentation is the canonical reference for both the corpus and the WARC/WAT/WET file formats derived from it. WARC itself is standardized as ISO 28500:2017.
- Parallel Crawlers - Cho & Garcia-Molina (WWW 2002)Classic paper on partitioning a crawl across independent workers without contention. The "per-worker independent store, merge later" pattern I use at the aggregation step maps cleanly onto the paper's page-partitioning scheme.
- HTTP/1.1 Range Requests - RFC 7233 / RFC 9110The exact byte-range retrieval mechanism underneath every WARC fetch in the pipeline. Modern rewrite in RFC 9110 section 14 supersedes 7233 but the semantics are the same.
- DuckDB: an embeddable analytical database - Raasveldt & Mühleisen (CIDR 2020)The paper describing the database we merge per-worker SQLites into. DuckDB's columnar storage + vectorized execution is what makes the cross-crawl aggregate queries cheap enough to run as part of the pipeline.
Adjacent deep-dives
The rule engine that runs against the WARC-derived headers and HTML this pipeline produces. Definitive vs supplementary signals, tiered confidence, edge-vs-origin resolution.
Consumer of the plugin-version inventory this pipeline produces: AST-level inter-procedural data-flow tracking to surface unsanitized sink flows per plugin version.
See the corpus in action.
Every plugin page shows a version-distribution chart and hosting mix - both produced by this pipeline.
