BENCHMARK RESULTS
Real evaluation data from 600 multi-turn sessions across 12 domains. All comparisons statistically validated with Bonferroni correction.
96.7%
Routing Accuracy
80.2%
Index Precision
91.0%
Token Reduction
<0.5%
Latency Overhead
Claude Token Savings
claude-opus-4-6 · 12 tools · 10 prompts
91%
Avg Token Reduction
13,248
Tokens Saved
$0.0662
Savings / Run
$0.0065
Cost w/ NeuroFS
| # | Prompt | Baseline | NeuroFS | Reduction |
|---|---|---|---|---|
| 1 | debug my rust async runtime code | 1,453 | 13 | 99.1% |
| 2 | send a status update email | 1,456 | 592 | 59.3% |
| 3 | query database for monthly active users | 1,454 | 14 | 99% |
| 4 | search web for Rust async benchmarks | 1,459 | 19 | 98.7% |
| 5 | generate illustration of futuristic city | 1,458 | 18 | 98.8% |
| 6 | read config.toml and show contents | 1,459 | 19 | 98.7% |
| 7 | write Python script to parse CSV | 1,455 | 15 | 99% |
| 8 | commit changes to feature branch | 1,454 | 14 | 99% |
| 9 | check calendar for meetings next week | 1,454 | 590 | 59.4% |
| 10 | store preferred coding language as Python | 1,455 | 15 | 99% |
26/42 comparisons statistically significant
Bonferroni α = 0.007 · Wilcoxon signed-rank tests
Routing Latency
Sub-millisecond overhead · <0.5% of model inference time
Formal Evaluation
12 sessions · 600 turns · 4 routing conditions
| Condition | Tool P@5 | Tool R@5 | Index Precision | Filter Rate | Expert P@3 | Expert R@3 | Routing Acc. |
|---|---|---|---|---|---|---|---|
| A — All Tools (Baseline) | 12.9% | 16.1% | 100.0% | 0.0% | 33.0% | 48.9% | 100.0% |
| B — Static Domain | 64.8% | 64.1% | 46.1% | 89.2% | 64.9% | 60.7% | 92.5% |
| C — Stateless NeuroFS | 50.7% | 54.3% | 56.8% | 92.6% | 55.8% | 58.7% | 74.4% |
| D — Stateful NeuroFS | 61.1% | 70.6% | 80.2% | 81.9% | 67.4% | 79.6% | 96.5% |
Ablation Study
Removing cube history drops routing accuracy by 21 points
RAG Precision
51 questions · All 6 retrieval targets met
Retrieval Precision95.4% (target: 90%)
Filter Rate94.9% (target: 60%)
Tool Precision@590.1% (target: 70%)
Tool Recall@584.1% (target: 80%)
Expert Precision@386.6% (target: 70%)
Expert Recall@389.4% (target: 75%)