An unofficial, end-to-end reproduction of Nexus: An Agentic Framework for Time Series Forecasting by Das et al. (2026) — the first public implementation of the 5-agent, 3-stage pipeline, with full prompts, calibration loop, and the TimesFM-2.5 baseline.
Nexus never asks one model to do everything. Raw numbers and events are first structured, then forecast at two resolutions in parallel, then synthesized — with a calibration loop that learns review guidelines from backtested error.
Every component of the paper — framework, prompts, calibration loop, data loaders, baselines, and the evaluation harness — implemented and unit-tested. A handful of large-scale tables are reproduced as partials due to budget and the single available backbone model.
| Component | Status | Notes |
|---|---|---|
| Framework — 5 agents, 3 stagesnexus.pipeline | Full | Matches paper §3 exactly. |
| Prompts (Appendix B / C / D)nexus/prompts.py | Full | Reproduced verbatim from the paper appendices. |
| Calibration loopn = 6 splits · k = 5% threshold | Full | Leak-safe split; 8 / 18 cells adopt guidelines. |
| Data loadersZillow weekly inventory + yfinance | Full | 15 cities × 7 tickers. |
| CoT baseline (Appendix C)experiments.run_table2 | Full | Direct reasoning baseline. |
| TimesFM baselinev2.5 via transformers (D7 resolved) | Full | 191 samples; matches paper within noise. |
| MAPE / RMSE / LLM-Judgenexus.eval | Full | Cross-family judge via abab6.5-chat (D2 resolved). |
| Unit testspytest | Full | 33 passed, 0 skipped. |
| Smoke testresults/smoke.json | Full | 1 ticker + 1 city end-to-end. |
| Table 2 — Multimodal7 tickers + 15 cities × 3 horizons | Full | 548 jobs · samples-per 10 (exceeds paper on Zillow). |
| Table 3 — NumericalNexus + CoT + TimesFM-2.5 | Full | 22 entities; TimesFM-2.5 ≈ paper. |
| Table 4 — Judge agreementcross-family judge | Partial | n = 26 judgments; Nexus 77% overall. |
| Table 5 — Ablation4 entities × 3 samples | Partial | Single backbone; deltas noisy below n ≥ 30. |
The reproduction reveals behavior that only surfaces when the paper is rebuilt against a different LLM and a different forecasting backbone. Treat these as anecdotes from a single weaker setup — not paper-replacing claims.
On MiniMax-M2.7 — a substantially weaker model than the paper's Gemini-3.1-Pro + Claude-4.5-Sonnet pairing — the agentic decomposition closes more of the end-to-end reasoning gap. On stocks our Nexus matches the paper's number (0.119 vs 0.111) despite the weaker backbone.
At horizon h = 26 the CoT baseline frequently refuses — citing financial-advice policy — or returns malformed output, triggering a last-value fallback. The Nexus pipeline avoids this regime entirely because no single stage frames the task as advice.
An early run on TimesFM-2.0 trailed the paper ~4× on Zillow. Installing TimesFM-2.5 from transformers closed it: the baseline now reproduces the paper within noise, confirming the whole gap was the 2.0-vs-2.5 version, not a methodology error.
MiniMax serves several model families on one endpoint, so we ran a true cross-family judge: abab6.5-chat grading MiniMax-M2.7. Nexus's win rate rose from 56% (self-judge) to 77% — landing squarely inside the paper's 64–97% range.
Nexus runs in your main 3.11+ env. TimesFM is pinned to its own 3.10–3.11 env because its dependency graph collides with everything else. Bring a MiniMax API key (the macOS Keychain entry minimax-coding-plan is read automatically) or set MINIMAX_API_KEY.
Every place the implementation knowingly departs from Nexus is tracked as a numbered deviation. Three were closed in a second pass; only D1 (no Gemini / Claude key) is materially unresolved. The full list lives in REPRODUCIBILITY_NOTES.md.
Cite the original paper for the method. If you're specifically referencing this implementation — the deviations, the MiniMax setup, the cross-family judge — please also note the repository.
This is a single-key reproduction by sunrf-renlab-ai on MiniMax-M2.7. All metrics should be read against the deviations list before being compared to the paper.
If you spot a real bug — a wrong prompt, a leaky split, a metric that doesn't match the paper's definition — please open an issue. PRs welcome on calibration breadth, ablation scale, and a Gemini / Claude backbone to close D1.