Status: public evidence note for @bilig/headless
This note keeps the public performance claim auditable from checked-in repo artifacts instead of README copy alone.
The decision artifact is
packages/benchmarks/baselines/workpaper-vs-hyperformula.json.
Current checked-in metadata:
2026-05-06T14:54:57.091Zarm64, Node v24.3.05 measured samples after 2 warmup samples@bilig/headless3.2.0, local checkout commit
6de904b8876f920f287b63a95934c479acf78307, GPL-v3 license keyThe current scorecard claim is a mean-latency claim across directly comparable headless spreadsheet-engine workloads:
| Lane | Comparable Workloads | WorkPaper Mean Wins | HyperFormula Mean Wins |
|---|---|---|---|
| Overall | 46 |
46 |
0 |
| Public | 38 |
38 |
0 |
| Holdout | 8 |
8 |
0 |
The overall directional mean-ratio geomean is 0.521767150331573. The overall
directional p95-ratio geomean is 0.5359737705859149. Ratios below 1.0 mean
WorkPaper is faster for that metric.
The closest overall mean win is lookup-approximate-duplicates at
0.9108460643406784. The closest public-lane mean win is
build-mixed-content at 0.9017762124360226.
This is not a blanket “faster on every p95 row” claim. The current worst p95
ratio is 1.043096403103571 on lookup-approximate-duplicates, so the honest
public claim is 46/46 mean wins with an overall p95 geomean lead and one known
p95 holdout that still needs margin work.
The 46/46 count is about mean latency: for each comparable workload row,
WorkPaper’s average measured time is lower than HyperFormula’s average measured
time. Mean wins are useful for the headline because they summarize the normal
cost of each workload, but they do not prove every slower tail sample has been
eliminated.
Each p95 row asks a different question: “near the slow end of this workload’s sample set, which engine was faster?” A single row can lose on p95 even when its mean wins, because a small number of slower samples can move the tail without moving the average enough to flip the mean result.
The p95 geomean is an aggregate across the per-workload p95 ratios. It can stay
below 1.0 while one individual p95 row is above 1.0, because the aggregate
is balanced by the other p95 rows where WorkPaper has enough margin. Read the
current result as: WorkPaper wins every comparable mean row and leads the
overall p95 aggregate, but the repo is not claiming “faster on every p95 row”
until the known p95 holdout is fixed.
Scorecard-eligible families cover:
The scorecard excludes the config-toggle control family and dynamic-array
leadership-only family from the directly comparable win count.
Check that the committed artifact still has the expected workload coverage and shape:
pnpm workpaper:bench:competitive:check
Regenerate timing evidence only when intentionally refreshing the benchmark artifact:
pnpm workpaper:bench:competitive:generate
pnpm workpaper:bench:competitive:check
Do not change workload sizes, sampling, scoring, or definitions to preserve a claim. If a rerun moves a row red, update the artifact, update this note, and fix the production engine path rather than hiding the loss.