Deep dive

CommonCrawl pipeline

How I turn a multi-petabyte public web archive into a per-domain map of installed WordPress plugins - without downloading the whole archive.

What it is

Narrowing a petabyte into a shopping list

Per-crawl economics
  • Historical CC corpus (all crawls)petabyte scale
  • Single crawl, WARC~hundreds of TB
  • Single crawl, WAT~tens of TB
  • URLs per crawl~3+ billion
  • Bytes we actually movetens of GB
  • Wall-clock per crawlhours, not days
  • Parallelism (WAT)N workers per VM
  • Parallelism (WARC)500 concurrent fetches
Exact per-crawl figures drift month to month; these are the order-of-magnitude numbers. The "petabyte scale" line is the full historical CC corpus across many years of crawls, not any single crawl. What matters for this pipeline is the ratio - a naïve "download everything" approach for one crawl would need ~1000× more bytes than the narrowed pipeline actually transfers.

CommonCrawl publishes a free monthly snapshot of the public web. A single recent crawl is roughly 3+ billion URLs packaged into two parallel archive formats: WARC (the full HTTP response - headers + body, running to hundreds of TB compressed) and WAT (a compact JSON index extracted from each WARC - headers, links, metadata, but no body, still in the tens of TB). The full historical corpus across every crawl since 2008 is petabyte-scale.

Downloading a full crawl's WARC archive from any generic cloud network is economically infeasible - the egress bill and the time-to-completion both grow linearly with bytes moved. So the question isn't "how do I download the archive?" The question is "how do I download as few bytes as possible while still answering the detection question?"

Our answer has three moves. First, scan the WAT archive end-to-end and identify only the domains that actually run WordPress - which is a small fraction of the crawl by URL count. Second, build a per-WARC-file record index: for each matched page, the exact WARC filename, byte offset, and length. Third, use HTTP Range requests through a co-located signing proxy to fetch only those byte ranges. The result is a pipeline that processes a full crawl in hours on a single VM instead of weeks across a fleet, and moves tens of GB of bytes instead of tens of TB.

How it works

Three phases, one VM

The pipeline runs on a single ephemeral scanner VM, which is provisioned from a snapshot at the start of a crawl and destroyed on completion. State is local to the VM until the final upload step.
Phase 1 - narrow
Fetch WAT path list
~80–90k segments, one file per URL batch
no contention
N parallel workers
Each claims segments, streams + scans in memory
Fingerprint match
Plugin asset paths, generator tags, host signals
Phase 2 - target
sequential reads
Coordinate clustering
Group matches by WARC filename; sort by offset
Signing proxy
Co-located tiny service issues presigned S3 URLs
HTTP Range fetches
500-way concurrent; only the bytes we need
Phase 3 - merge
small footprint
Merge into DuckDB
Columnar → dedup + compress across workers
Provider aggregation
Hosting mix per plugin, version distribution per site
POST to main DB
Batched observations to /api/internal/observations
The trick

Why WAT-first saves terabytes

CC-MAIN-…/segment-00037.warc.gz1 WARC file · ~1.2 GB
84 MB321 MB940 MB0 GB1.2 GB
Skipped - not WordPress Range-fetched - WordPress pages Naïve pull 1.2 GB · Narrowed pull ~18 MB
…and the same pattern across every WARC file in the crawl
Each cyan mark is a WordPress-page byte range the pipeline range-fetches; the hatched background is every other byte in the archive, left alone. The same sparse pattern repeats across ~80-90k WARC files per crawl.

A WAT file is small. It holds the HTTP response headers and a JSON-encoded metadata summary for every page in the matching WARC segment, but not the response bodies. WordPress leaves distinctive fingerprints in the metadata - the X-Powered-By header, <meta name="generator"> tags, asset URLs under /wp-content/plugins/ - all of which appear in the WAT record without needing the body.

Scanning the full WAT archive end-to-end is therefore both cheap and tractable: streaming decompression, no full-body parsing, and most pages can be rejected in the first few hundred bytes. That single pass produces a list of domains that run WordPress along with the specific WARC records where their WordPress pages live - byte offset + length resolved directly out of the WAT metadata.

That list is the whole point. Instead of downloading gigabytes of WARCs to find the one page we care about, we now have a precise shopping list: "from WARC file X, we want bytes 84,019,234 through 84,058,001; from file Y, we want bytes 1,120,044,500 through 1,120,071,800; etc."

CommonCrawl's S3 bucket supports HTTP Range requests, so the WARC phase fetches each record by byte offset. We cluster the requests by WARC filename and sort by offset before firing, which turns scattered seeks into sequential reads at the origin - meaningfully faster than the random-access pattern you'd get from fetching in discovery order. A single crawl ends up moving tens of GB of actual bytes to reconstruct the full-body HTML of every WordPress page in the corpus, out of a WARC archive measured in petabytes.

The egress problem

A 49-line signing proxy

A
Scanner in the bucket's cloud
Run the big VM next to the data
Expensive compute
Scanner VM
multi-core, hours of work
intra-cloud · free
S3 bucket
CommonCrawl archive
Compute$$$Egressfree Big-cloud CPU is the most expensive line item at this volume.
B
Scanner elsewhere, direct pull
Cheaper compute, but reach across networks
Throttled + billed egress
Scanner VM
cheaper cloud
cross-network · $ / GB · throttled
S3 bucket
CommonCrawl archive
Compute$Egress$$$ 500-way concurrent fetches trigger throttling. Wall-clock balloons from hours to days; egress overtakes the compute savings.
C
Split the roles - tiny signer + cheap scanner
Best of both sides, 49-line signing service
Cheap compute + free egress
Scanner VM
cheaper cloud · plain HTTPS pulls from S3
1. ask for signed URLs
2. presigned URL batch
Signer
49 lines · holds creds
3. intra-cloud · free
S3 bucket
CommonCrawl archive
Compute$EgressfreeSigner¢ Scanner pulls directly from S3 via HTTPS - no SDK, no credentials, no region-specific retry code. Signer bill is rounding error.

The CommonCrawl archive lives in S3. Pulling from S3 at the volume and concurrency this pipeline needs is a three-way tradeoff between where the scanner VM runs, what it costs to compute, and what it costs to move bytes. Two naïve options fail on different axes:

Option A - scanner runs in the same cloud as the bucket. Zero egress cost, fast intra-region network, credentials are trivial to get. But compute pricing for large multi-core CPU instances in that cloud is the most expensive of the major providers. A VM big enough to process a full crawl in hours costs meaningfully more per hour there than on alternative hosts.

Option B - scanner runs on a cheaper cloud and fetches from S3 directly. Compute is cheap. But cross-network S3 egress rate-limits, throttles, and bills per-GB. A 500-concurrent fetch pattern against the bucket from an outside cloud triggers throttling long before the scanner's NIC is saturated; wall-clock balloons from hours to days; the egress line-item alone overtakes the compute savings.

Option C - split the roles. A tiny purpose-built signing service (cc-s3-signer.js, 49 lines of Express) runs inside the bucket's cloud and provider. It holds the credentials, exposes a single POST /sign-batch endpoint, and returns a batch of presigned S3 URLs valid for one hour. That's the whole service.

The heavy scanner VM can then live on whichever cheaper cloud wins on compute price. It asks the signer for a batch of URLs, then fetches them directly from S3 over HTTPS - presigned URLs let the scanner pull without any credentials, SDK, or region-specific retry code.

The economic result: pennies of compute for a stateless signer running 24/7 in the expensive cloud, plus hours of cheap big compute where it's a fraction of the price, minus the egress trap that sinks Option B. The signer is tiny enough that even in the most expensive region, its bill is rounding error relative to the full-crawl compute cost on the scanner.

Parallelism pattern

Per-worker SQLite, merged to one DuckDB

Per-worker files converge into one columnar store
worker-1.sqlite~8 GBworker-2.sqlite~8 GBworker-3.sqlite~8 GBworker-4.sqlite~8 GBworker-5.sqlite~8 GBworker-6.sqlite~8 GBMerge passunion · dedup · re-encodeobservations.duckdb~200 MB
On-disk footprint before and after the merge
Before6 × per-worker SQLite
~50 GB
After1 × merged DuckDB
~200 MB
~250×smaller on-disk
Sizes and worker count are illustrative - actual figures vary per crawl and per VM size. The ratio holds: columnar compression plus cross-worker deduplication shrinks the on-disk footprint by roughly two orders of magnitude, and analytical queries against the merged store run meaningfully faster than they would against a union-view over the raw SQLite files.

WAT scanning is embarrassingly parallel: each segment can be processed independently, and the output is append-only observations (plugin X detected on domain Y with signal Z). The naïve approach - one shared SQLite database, N writer processes - deadlocks and slows to a crawl under contention. Even with WAL mode, many concurrent writers fighting for the same file is the wrong pattern.

So each worker gets its own SQLite file. Zero write contention during the scan; each worker writes its local store at full speed. This turns the parallelism cost from "tune busy-timeouts and pray" to "scale workers linearly until you saturate the NIC or the signing proxy." The only coordination is at segment-claim time (which is cheap - a single row update to mark a segment as "taken").

When the scan completes, there are typically a few dozen per-worker SQLite files on disk. Each contains its own view of the crawl - overlapping in some places (the same plugin hit from different segments), disjoint in others. We want a single query-friendly dataset, ideally small enough to ship to the main server cheaply and query efficiently.

DuckDB is the merge target. The aggregation step reads every per-worker file, unions their observations, deduplicates by (domain, plugin_slug, crawl_id), and writes a single columnar DuckDB file. Column-store compression plus dedup collapse the footprint dramatically - a set of per-worker SQLites totaling many gigabytes will typically merge into hundreds of megabytes of DuckDB. Analytical queries against it (hosting mix per plugin, version distribution per site, top plugins by domain count) run orders of magnitude faster than they would against the raw SQLite union.

That merged DuckDB is the artifact that gets uploaded back to the main infrastructure. Everything the public plugin pages render about live-site distribution comes out of this file.

Ecosystem stats

Hosting provider detection at scale

Observed HTTP signals per domain
  • Hx-kinsta-cache: HIT kinsta.header
  • Hserver: nginxnoise
  • Hcf-ray: 84a7b2c91f4e-LHR cloudflare.edge
  • Hcontent-type: text/html; charset=UTF-8noise
  • A<script src="/wp-content/cache/min/1/abc.js">noise
  • A<link href="//kinsta-cdn.com/assets/wp.css"> kinsta.asset
Fingerprint matcher
  • Rules~240
  • Rule typesheader · asset · CDN
  • Tiersstrong · edge · hint
Rules are curated by hand and versioned per crawl. Multi-signal agreement escalates confidence; a single edge-only match (Cloudflare, say) never overrides an origin signal.
Resolved providers for this domain
  • KinstaStrong
    Origin-side: header rule + asset-URL rule agree. Logged as primary host.
  • CloudflareEdge
    Edge-only signal (cf-ray). Recorded as CDN layer, not as origin host.
A worked example. Real crawls see thousands of sites per CDN / host combination; the same matcher runs unchanged against every one. Edge-only signals (Cloudflare, Sucuri) are recorded alongside the origin rather than replacing it, so sites hosted on Kinsta behind Cloudflare show up as both.

One of the harder parts of the pipeline is not about detecting plugins at all - it's figuring out where each WordPress site is actually hosted, from the thinnest possible HTTP signal. The WARC phase already gives me the headers, so if I can match a hosting provider directly off the headers and a handful of HTML patterns, I never have to touch anything heavier (no WHOIS lookups, no traceroutes, no active probing).

The fingerprint corpus for providers is curated by hand and grows with each crawl. A typical entry maps a distinctive header - X-Kinsta-Cache, X-Cache-Enabled on WPEngine, a CDN host pattern, a script URL under a provider's asset domain - to a provider name and a confidence tier. Each match is recorded per domain; the aggregation step produces "N% of sites running plugin X are on provider Y" statistics across the entire corpus. The provider detection deep-dive walks through the rule engine internals in full.

The hard part isn't writing one such rule. It's making the whole system behave sensibly across tens of millions of heterogeneous sites during a single crawl. Ambiguous matches have to resolve in the right direction; providers that share CDNs can't cannibalize each other's detection counts; edge cases like "site obviously behind Cloudflare but also hosted somewhere specific" need to record both rather than picking one at random.

Getting this right across a full crawl took real work - iterating on rule precedence, fingerprint corpus coverage, and the way multi-signal matches compose. The output is the only ecosystem-scale WordPress hosting-mix dataset I know of that's public and methodologically transparent. Every provider page on this site is generated from it.

Byte budget

The compression math, written out

Numbers are order-of-magnitude, tuned to a recent CC archive. They compound multiplicatively, which is why the pipeline can do real work on a reasonable VM budget instead of needing cluster scale.

1
Raw WARC corpus per crawl
If naïvely pulled in full
~hundreds of TB
Out of scope financially and operationally.
2
WAT only (metadata)
URL index + response-header slice
~tens of TB
10-20 % of raw; ingestible but still heavy.
3
WP-relevant URLs (WAT filter)
Asset-path signals in headers/HTML slice
~hundreds of GB
A small % of WAT; narrows the WARC targets.
4
WARC byte-range fetches
Only the records that matter, pinpoint reads
~tens of GB
HTTP Range on the subset; actual bytes moved.
5
Merged DuckDB observation store
Columnar + dedup after the merge pass
~hundreds of MB
Analytical queries run against this directly.

Each step multiplies the savings of the step before. WAT-first narrowing eliminates the long tail; byte-range fetches eliminate the within-WARC waste; the merge pass eliminates per-worker redundancy. The net effect is that what started as petabyte-scale input becomes a hundreds-of-MB analytical store an individual developer can reason about on a laptop.

Limits

What the pipeline can't see

by design
Monthly cadence
CC publishes one crawl per month; our stats lag accordingly
out of scope
Surface-web only
Auth-walled and paywalled sites are invisible to CC
structural
Snapshot fidelity
A plugin enabled the day after a crawl is invisible until next month

These are properties of CommonCrawl itself, not our pipeline on top of it. The pipeline inherits whatever coverage the underlying crawl has.

Monthly cadence. CommonCrawl publishes one major crawl per month. Our aggregate statistics on plugin hosting mix, version distribution, and prevalence therefore lag real-world adoption by up to ~30 days. Individual audits through the public form are live - the monthly cadence only affects the corpus-wide statistics.

Surface web only. CommonCrawl crawls what an unauthenticated visitor can see from the public web. Sites behind an authentication wall, a paywall, or intranet boundaries are invisible - both to CC and to this pipeline. WordPress installations inside corporate VPNs or private networks never appear in the statistics; in aggregate those are a small fraction but enterprise WordPress is probably underrepresented for this reason.

Snapshot drift. A site that enables a plugin the day after a CC crawl won't show up in the corpus until the next crawl picks it up. Over any single crawl, the view is a precise snapshot of a month-ago ecosystem - internally consistent, but a month behind the live web.

JS-only renders and rewritten asset paths are addressed on the main detection page - the CC pipeline inherits those same gaps.

Further reading

Standards & prior art

  • The CommonCrawl dataset & WARC format
    CommonCrawl's own documentation is the canonical reference for both the corpus and the WARC/WAT/WET file formats derived from it. WARC itself is standardized as ISO 28500:2017.
  • Parallel Crawlers - Cho & Garcia-Molina (WWW 2002)
    Classic paper on partitioning a crawl across independent workers without contention. The "per-worker independent store, merge later" pattern I use at the aggregation step maps cleanly onto the paper's page-partitioning scheme.
  • HTTP/1.1 Range Requests - RFC 7233 / RFC 9110
    The exact byte-range retrieval mechanism underneath every WARC fetch in the pipeline. Modern rewrite in RFC 9110 section 14 supersedes 7233 but the semantics are the same.
  • DuckDB: an embeddable analytical database - Raasveldt & Mühleisen (CIDR 2020)
    The paper describing the database we merge per-worker SQLites into. DuckDB's columnar storage + vectorized execution is what makes the cross-crawl aggregate queries cheap enough to run as part of the pipeline.
About the author

See the corpus in action.

Every plugin page shows a version-distribution chart and hosting mix - both produced by this pipeline.