BENCHMARK RESULTS

Real evaluation data from 600 multi-turn sessions across 12 domains. All comparisons statistically validated with Bonferroni correction.

96.7%
Routing Accuracy
80.2%
Index Precision
91.0%
Token Reduction
<0.5%
Latency Overhead

Claude Token Savings

claude-opus-4-6 · 12 tools · 10 prompts

91%
Avg Token Reduction
13,248
Tokens Saved
$0.0662
Savings / Run
$0.0065
Cost w/ NeuroFS
#PromptBaselineNeuroFSReduction
1debug my rust async runtime code1,4531399.1%
2send a status update email1,45659259.3%
3query database for monthly active users1,4541499%
4search web for Rust async benchmarks1,4591998.7%
5generate illustration of futuristic city1,4581898.8%
6read config.toml and show contents1,4591998.7%
7write Python script to parse CSV1,4551599%
8commit changes to feature branch1,4541499%
9check calendar for meetings next week1,45459059.4%
10store preferred coding language as Python1,4551599%
26/42 comparisons statistically significant
Bonferroni α = 0.007 · Wilcoxon signed-rank tests

Routing Latency

Sub-millisecond overhead · <0.5% of model inference time

Formal Evaluation

12 sessions · 600 turns · 4 routing conditions

ConditionTool P@5Tool R@5Index PrecisionFilter RateExpert P@3Expert R@3Routing Acc.
A — All Tools (Baseline)12.9%16.1%100.0%0.0%33.0%48.9%100.0%
B — Static Domain64.8%64.1%46.1%89.2%64.9%60.7%92.5%
C — Stateless NeuroFS50.7%54.3%56.8%92.6%55.8%58.7%74.4%
D — Stateful NeuroFS61.1%70.6%80.2%81.9%67.4%79.6%96.5%

Ablation Study

Removing cube history drops routing accuracy by 21 points

RAG Precision

51 questions · All 6 retrieval targets met

Retrieval Precision95.4% (target: 90%)
Filter Rate94.9% (target: 60%)
Tool Precision@590.1% (target: 70%)
Tool Recall@584.1% (target: 80%)
Expert Precision@386.6% (target: 70%)
Expert Recall@389.4% (target: 75%)