BENCHMARK RESULTS

Real evaluation data from 600 multi-turn sessions across 12 domains. All comparisons statistically validated with Bonferroni correction.

96.7%

Routing Accuracy

80.2%

Index Precision

91.0%

Token Reduction

<0.5%

Latency Overhead

Claude Token Savings

claude-opus-4-6 · 12 tools · 10 prompts

91%

Avg Token Reduction

13,248

Tokens Saved

$0.0662

Savings / Run

$0.0065

Cost w/ NeuroFS

#	Prompt	Baseline	NeuroFS	Reduction
1	debug my rust async runtime code	1,453	13	99.1%
2	send a status update email	1,456	592	59.3%
3	query database for monthly active users	1,454	14	99%
4	search web for Rust async benchmarks	1,459	19	98.7%
5	generate illustration of futuristic city	1,458	18	98.8%
6	read config.toml and show contents	1,459	19	98.7%
7	write Python script to parse CSV	1,455	15	99%
8	commit changes to feature branch	1,454	14	99%
9	check calendar for meetings next week	1,454	590	59.4%
10	store preferred coding language as Python	1,455	15	99%

26/42 comparisons statistically significant

Bonferroni α = 0.007 · Wilcoxon signed-rank tests

Sub-millisecond overhead · <0.5% of model inference time

12 sessions · 600 turns · 4 routing conditions

Condition	Tool P@5	Tool R@5	Index Precision	Filter Rate	Expert P@3	Expert R@3	Routing Acc.
A — All Tools (Baseline)	12.9%	16.1%	100.0%	0.0%	33.0%	48.9%	100.0%
B — Static Domain	64.8%	64.1%	46.1%	89.2%	64.9%	60.7%	92.5%
C — Stateless NeuroFS	50.7%	54.3%	56.8%	92.6%	55.8%	58.7%	74.4%
D — Stateful NeuroFS	61.1%	70.6%	80.2%	81.9%	67.4%	79.6%	96.5%

Removing cube history drops routing accuracy by 21 points

51 questions · All 6 retrieval targets met

Retrieval Precision95.4% (target: 90%)

Filter Rate94.9% (target: 60%)

Tool Precision@590.1% (target: 70%)

Tool Recall@584.1% (target: 80%)

Expert Precision@386.6% (target: 70%)

Expert Recall@389.4% (target: 75%)