Architecture

System architecture

The ingestion, analysis, detection, and delivery layers that make up WP-Safety, end-to-end. One diagram per layer, with the trade-offs that shaped each one called out explicitly.

What it is

Four layers, one pipeline

System at a glance
  • 55k+
    Plugins tracked
  • 8k+
    Themes tracked
  • 34k+
    CVEs indexed
  • 4.9B
    URLs per CC crawl
Core stack
  • Nuxt 3App + SSR
  • SQLite · WALMain DB
  • DuckDBCC observations
  • LLM cascadeResearch + PoC

The system is organized as four layers that run concurrently but loosely - each layer produces artifacts that the next one consumes, and each can be restarted independently when its inputs change. A failed LLM pass doesn't corrupt the CVE feed; a broken audit worker doesn't stop the CommonCrawl scanner.

Ingestion syncs from WordPress.org, the Wordfence Intelligence webhook, NVD, and the plugin SVN mirror. Analysis turns plugin source into AST taint flows, code signals, and LLM-annotated risk assessments. Detection scans the CommonCrawl corpus (and the occasional live site audit) to identify installed software. Delivery is the Nuxt app, the public API, and the Safety Radar WordPress plugin that ships the results to humans.

Data lives in three places that don't try to replicate each other: SQLite for the main product database (fast, simple, zero-ops), DuckDB for analytical queries over the CommonCrawl-derived observation dataset, and the filesystem for plugin source mirrors and generated artifacts. Each store owns the queries it's good at; nothing is kept in sync by hand.

Layer 1

Ingestion

Sources of truth for plugin metadata, CVEs, and source code. Each ingestor writes directly into the main SQLite database on its own schedule.

Ingestion layer
External sources
Plugin API
api.wordpress.org/plugins/info/1.2/
Theme API
api.wordpress.org/themes/info/1.2/
source-of-truth
SVN mirror
Every version of every plugin, ever released
Wordfence webhook
Push-based; new CVE arrives within minutes
NVD sync
Nightly pull for official CVE records
PatchStack Core
Community disclosure feed
plugins
~55,000 rows · slug + metadata
themes
~8,000 rows
vulnerabilities
~34,000 rows · CVE + severity + patch diff
Layer 2

Analysis

Deterministic static analysis feeds LLM synthesis feeds verified reproduction. Each stage's output is the next stage's ground truth - the LLM only ever sees signals, never raw plugin source it might hallucinate about.

Analysis layer
PHP AST parser
php-parser, runs per plugin version
deterministic
Taint tracker
Sources → sinks, inter-procedural
Code signals
Escaping coverage, raw SQL, nonce checks
Lightweight research LLM
Research-plan generation from patch diffs
Small fingerprint LLM
Risk-assessment prose + deduction explanations
Frontier-LLM fallback
Used when the lightweight tier can't produce usable output
Ephemeral VM
Provisioned from a snapshot per task, destroyed on completion
escalation
Agent cascade
Lightweight LLM → frontier LLM
Frontier judge
Independent verdict on every attempt
plugin_taint_flows
Graph per plugin version
plugin_risk_assessments
Score + deductions
vulnerabilities (enriched)
poc_*, research_*, verification artifacts
Layer 3

Detection

Two ingress points with shared fingerprint infrastructure. The CC pipeline runs in batch (monthly, at scale); the live audit runs on-demand (seconds, single site). Both resolve through the same matcher against the same fingerprint corpus.

Detection layer
WAT scan
Dedicated scanner VM, N-way parallel across CC shards
bandwidth-heavy
WARC range fetch
Targeted HTTP-Range downloads of matched URLs
DuckDB observations
Columnar store for analytical queries
Public audit form
Any visitor, any URL, rate-limited
Residential proxy
Real-IP rotation, redirect-safe following
Worker process
Detached audit-worker.mjs, per job
Asset-path signals
/wp-content/plugins/slug/*
Direct DOM signals
JS globals, CSS classes, REST routes
Confidence scorer
Multi-signal agreement
Layer 4

Delivery

Three surfaces the human-facing data reaches through. Everything reads from the same main database - the app, the API, and the WP plugin never disagree because there's only one source.

Delivery layer
data/wp-safety.db
SQLite · WAL mode · every read path goes here
Nuxt 3 SSR
Plugin / theme / CVE / provider pages
Edge cache
Nitro SWR, per-route TTL
Dashboard
Monitored sites, alerts, saved scans
/api/v1/batch-lookup
Bulk plugin security scores
/api/v1/plugin-score/{slug}
Single plugin lookup
Bearer token auth
Rate-limited, SQLite-backed counters
Installed on WP
PHP + WP-CLI compatible
Reports site inventory
Plugin versions reported to /api/v1/site-plugins
Shows risk badges
Inline in the WP admin plugin list
Data layer

Three stores, zero hand-synced replication

The pipeline's data plane is split across three stores, each owning the queries it's good at. A write to one cannot corrupt the others, and a bug in any single store's computation can be fixed by re-running just that computation - nothing gets out of sync.

SQLite · WAL
main DB
Authoritative product database. Every page on the site reads from here; every ingestor and enrichment stage writes directly into it. Single-region, single-host, fast and simple.
Representative tables
  • plugins · ~55k rows
  • themes · ~8k rows
  • vulnerabilities · ~34k rows
  • plugin_taint_flows
  • plugin_risk_assessments
  • plugin_fingerprints_wat
  • plugin_fingerprints_direct
Not for
Wide analytical scans over tens of millions of observations. Those live next door in DuckDB.
DuckDB
analytical
Columnar analytical store for the CommonCrawl observation dataset. Produced by merging per-worker SQLite shards at the end of each crawl-processing run. Read-only to the application; new crawls rewrite it.
Good at
  • "Sites running plugin X, by provider"
  • "Version distribution of plugin Y across crawl"
  • "Plugins co-installed with Z, ranked"
  • "Adoption curve across historical crawls"
Not for
Row-level transactional writes. No OLTP semantics, no per-request mutation.
Filesystem
blobs
Plugin source mirrors (on-demand SVN exports today, full SVN mirror in progress), generated artifacts (Playwright traces, video bundles, fingerprint JSON), and anything else too big for a DB column and too static for a cache.
What lives here
  • Plugin source trees per slug/version
  • PoC Playwright trace ZIPs
  • PoC video MP4s + YouTube mirrors
  • LLM research-prompt exports
Not for
Queryable structured data. Files get referenced by path from the main DB; they never store the thing that gets queried.
Three stores, each owning the queries it's good at. Nothing is kept in sync by hand: SQLite is written by ingestors and the analysis stages, DuckDB is recomputed per CC crawl from a merge of per-worker shards, and the filesystem is referenced by path from rows that need it. A single bug in one store cannot corrupt the others.
Operational boundaries

What the architecture doesn't handle

by design
Real-time latency
CVE → scored plugin: minutes to hours, not seconds
intentional
Cross-region replication
Single-region SQLite - fine for current scale, not HA
standard SaaS
Multi-tenant isolation
Shared DB · user data separated by row, not by process

These are honest constraints. Each one is the result of a trade-off I made to keep the system small, auditable, and cheap to run at current scale. If you need microsecond CVE propagation or regional data residency, I'm not the right vendor today.

Ingest cadence. The Wordfence webhook is near-real-time; NVD sync and WP.org reruns are hourly-to-daily. A plugin's score can therefore lag a disclosure by up to a few hours if the ingestor queue is backed up or the risk-assessment pass is waiting on LLM quota. I publish the fetched_at timestamp on every plugin page so you can see exactly how stale the data is.

Single-region. The main SQLite database runs on one host. Better-sqlite3 in WAL mode handles our current read load comfortably, and we haven't needed the complexity of a Postgres cluster or multi-region replication. That's the right call at current scale; it's the wrong call if we ever need HA for enterprise SLAs, and I'd migrate if that changes.

Recrawl latency. CommonCrawl publishes new archives monthly. The hosting-mix and version-distribution statistics on plugin pages therefore update on that cadence - between crawls, the ecosystem view is frozen. Individual site audits through the public form are live; only the aggregate statistics ride the CC clock.

Go deeper

Subsystem deep-dives

One page per non-obvious subsystem, with the design decisions and trade-offs spelled out in enough detail to reproduce the thing from scratch.

Mika Sipilä
For nerds only

Hi - Mika here. I built WP-Safety solo, so the methodology below is genuinely how it works, not a marketing sketch. The deep-dives are where I go long on the non-obvious details. Strictly optional - the plugin and CVE pages carry the full story without any of this.

Mika Sipilä·Founder, WP-Safety.org

Open data, auditable pipeline.

The methodology pages go deeper on each layer. Start with security scoring, detection, or PoC verification - whichever is most useful for what you're evaluating.