This walkthrough shows how to run the local WorkPaper benchmark checks for
@bilig/headless and how to read the output without overstating the result.
Use it when you want to verify that the checked-in benchmark artifact is current, smoke-test the benchmark harness on your machine, or evaluate a performance change before publishing a benchmark claim.
git clone https://github.com/proompteng/bilig.git
cd bilig
corepack enable
pnpm install --frozen-lockfile
The benchmark sources live in packages/benchmarks/src/. The checked-in public
artifact lives at
packages/benchmarks/baselines/workpaper-vs-hyperformula.json.
Start with the fast shape and coverage check:
pnpm workpaper:bench:competitive:check
Expected successful output starts with:
{
"mode": "check",
"outputPath": "/path/to/bilig/packages/benchmarks/baselines/workpaper-vs-hyperformula.json",
"workloads": [
"build-from-sheets",
"build-dense-literals"
]
}
The workloads array is longer in the real output. The check is not a fresh
timing run; it verifies that the committed artifact still has the expected
benchmark schema and workload coverage.
To inspect the headline scorecard in the committed artifact, run:
jq '{
generatedAt,
benchmark,
scorecard: {
comparableCount: .scorecard.comparableCount,
workpaperWins: .scorecard.workpaperWins,
hyperformulaWins: .scorecard.hyperformulaWins,
directionalMeanRatioGeomean: .scorecard.directionalMeanRatioGeomean,
directionalP95RatioGeomean: .scorecard.directionalP95RatioGeomean,
worstP95RatioWorkload: .scorecard.worstP95RatioWorkload
}
}' packages/benchmarks/baselines/workpaper-vs-hyperformula.json
At the time of this walkthrough, that reports:
{
"generatedAt": "2026-05-06T14:54:57.091Z",
"benchmark": {
"sampleCount": 5,
"warmupCount": 2
},
"scorecard": {
"comparableCount": 46,
"workpaperWins": 46,
"hyperformulaWins": 0,
"directionalMeanRatioGeomean": 0.521767150331573,
"directionalP95RatioGeomean": 0.5359737705859149,
"worstP95RatioWorkload": "lookup-approximate-duplicates"
}
}
Ratios below 1.0 mean WorkPaper is faster for that aggregate metric. This is
still a mean-latency benchmark claim, not a blanket “faster on every p95 row” or
“faster for every possible spreadsheet” claim.
For a quick local harness check, run a reduced one-sample benchmark:
pnpm --silent bench:workpaper:competitive -- --sample-count 1 --warmup-count 0 > /tmp/workpaper-local-benchmark.json
Then summarize it:
jq '{
suite,
scorecard: {
comparableCount: .scorecard.comparableCount,
workpaperWins: .scorecard.workpaperWins,
hyperformulaWins: .scorecard.hyperformulaWins,
directionalMeanRatioGeomean: .scorecard.directionalMeanRatioGeomean,
directionalP95RatioGeomean: .scorecard.directionalP95RatioGeomean
},
firstComparable: (
.results[]
| select(.workload == "build-from-sheets")
| {
workload,
fasterEngine: .comparison.fasterEngine,
meanRatio: .comparison.workpaperToHyperFormulaMeanRatio,
p95Ratio: .comparison.workpaperToHyperFormulaP95Ratio
}
)
}' /tmp/workpaper-local-benchmark.json
A one-sample run is intentionally noisy. Use it to confirm that the benchmark harness runs on your machine, not to replace the checked-in 5-sample artifact.
If you intentionally change runtime performance, regenerate the committed artifact with the default benchmark sampling:
pnpm workpaper:bench:competitive:generate
pnpm workpaper:bench:competitive:check
Then compare the diff in:
git diff -- packages/benchmarks/baselines/workpaper-vs-hyperformula.json
The highest-signal fields to review are:
benchmark.sampleCount and benchmark.warmupCount: verify sampling did not
shrink accidentally.scorecard.workpaperWins and scorecard.hyperformulaWins: read as
direction-of-mean wins across comparable workloads.scorecard.directionalMeanRatioGeomean: aggregate mean ratio, where lower is
better for WorkPaper.scorecard.directionalP95RatioGeomean: aggregate p95 ratio, also lower is
better for WorkPaper.scorecard.worstP95RatioWorkload: fastest way to spot the current tail-risk
row.results[].comparison.workpaperToHyperFormulaMeanRatio: per-workload mean
ratio.results[].comparison.workpaperToHyperFormulaP95Ratio: per-workload p95
ratio.Do not change workload sizes, scoring, sampling, or fixture definitions to make a claim look better. If a real rerun moves a row red, update the artifact and the public docs, then fix the production runtime path.