Live · v0.1 arXiv:2605.14389 · Python 3.11 · MIT

Reproducing
Nexus.

An unofficial, end-to-end reproduction of Nexus: An Agentic Framework for Time Series Forecasting by Das et al. (2026) — the first public implementation of the 5-agent, 3-stage pipeline, with full prompts, calibration loop, and the TimesFM-2.5 baseline.

Agents
5in 3 stages
Unit tests
33/ 33 passed
Calibration
n=6k=5%
Deviations
113 resolved
arXiv preprint
2605.14389
— 2026
Original paper

Nexus: An Agentic Framework for Time Series Forecasting

Das, et al. — Google & Penn State, 2026
Read paper →
§ 00 · Architecture

Five agents,
three stages.

Nexus never asks one model to do everything. Raw numbers and events are first structured, then forecast at two resolutions in parallel, then synthesized — with a calibration loop that learns review guidelines from backtested error.

Multimodal history
XNumerical series — weekly prices / inventory
ETextual events — news, releases, macro
𝒜ctx
Historical Context Agent
Cleans & aligns X + E into a structured causal timeline H — one row per step, noise filtered.
→ H
H
𝒜macro
Macro-Reasoning Agent
Top-down. One broad trajectory for the whole horizon — regime & seasonality.
→ Xmacro, Rmacro
𝒜micro
Micro-Reasoning Agent
Bottom-up. Step-by-step catalysts & local volatility, one value per timestep.
→ Xmicro, Rmicro
H · macro ∥ micro
𝒜syn
Forecast Synthesizer Agent
Weighs macro vs micro, conditioned on guidelines 𝒢, into the final forecast + reasoning.
in: H · Xmacro · Xmicro · 𝒢 → X, R
𝒜calib — offline backtest loop. n=6 folds → critique rules → 𝒢 = ∩ 𝒢ᵢ, adopted only if ≥5% better on a hidden fold, then fed back into 𝒜syn above.
Robust forecast
XPredicted values over the horizon
RInterpretable reasoning trace
Xnumerical series
Etextual events
Hstructured history
𝒢calibration guidelines
Rreasoning trace
runs in parallel
§ 01 · Status

What's in the box.

Every component of the paper — framework, prompts, calibration loop, data loaders, baselines, and the evaluation harness — implemented and unit-tested. A handful of large-scale tables are reproduced as partials due to budget and the single available backbone model.

ComponentStatusNotes
Framework — 5 agents, 3 stagesnexus.pipelineFullMatches paper §3 exactly.
Prompts (Appendix B / C / D)nexus/prompts.pyFullReproduced verbatim from the paper appendices.
Calibration loopn = 6 splits · k = 5% thresholdFullLeak-safe split; 8 / 18 cells adopt guidelines.
Data loadersZillow weekly inventory + yfinanceFull15 cities × 7 tickers.
CoT baseline (Appendix C)experiments.run_table2FullDirect reasoning baseline.
TimesFM baselinev2.5 via transformers (D7 resolved)Full191 samples; matches paper within noise.
MAPE / RMSE / LLM-Judgenexus.evalFullCross-family judge via abab6.5-chat (D2 resolved).
Unit testspytestFull33 passed, 0 skipped.
Smoke testresults/smoke.jsonFull1 ticker + 1 city end-to-end.
Table 2 — Multimodal7 tickers + 15 cities × 3 horizonsFull548 jobs · samples-per 10 (exceeds paper on Zillow).
Table 3 — NumericalNexus + CoT + TimesFM-2.5Full22 entities; TimesFM-2.5 ≈ paper.
Table 4 — Judge agreementcross-family judgePartialn = 26 judgments; Nexus 77% overall.
Table 5 — Ablation4 entities × 3 samplesPartialSingle backbone; deltas noisy below n ≥ 30.
§ 02 · Findings

Four observations
that did not fit the abstract.

The reproduction reveals behavior that only surfaces when the paper is rebuilt against a different LLM and a different forecasting backbone. Treat these as anecdotes from a single weaker setup — not paper-replacing claims.

F.01framework advantage

Nexus helps weaker models more.

On MiniMax-M2.7 — a substantially weaker model than the paper's Gemini-3.1-Pro + Claude-4.5-Sonnet pairing — the agentic decomposition closes more of the end-to-end reasoning gap. On stocks our Nexus matches the paper's number (0.119 vs 0.111) despite the weaker backbone.

Nexus vs CoT · ours
3.8 ×
Nexus vs CoT · paper
1.01 ×
F.02CoT failure mode

CoT refuses at long horizons.

At horizon h = 26 the CoT baseline frequently refuses — citing financial-advice policy — or returns malformed output, triggering a last-value fallback. The Nexus pipeline avoids this regime entirely because no single stage frames the task as advice.

CoT · stocks h = 26
1.85
Nexus · same cell
0.21
F.03backbone delta · resolved

Version, not method, drove the TimesFM gap.

An early run on TimesFM-2.0 trailed the paper ~4× on Zillow. Installing TimesFM-2.5 from transformers closed it: the baseline now reproduces the paper within noise, confirming the whole gap was the 2.0-vs-2.5 version, not a methodology error.

2.5 · Zillow h=13
0.043
Paper · same cell
0.041
F.04judge bias · resolved

Removing self-bias strengthens Nexus.

MiniMax serves several model families on one endpoint, so we ran a true cross-family judge: abab6.5-chat grading MiniMax-M2.7. Nexus's win rate rose from 56% (self-judge) to 77% — landing squarely inside the paper's 64–97% range.

Cross-judge overall
77%
Self-judge overall
56%
§ 03 · Quick start

Two environments,
five commands.

Nexus runs in your main 3.11+ env. TimesFM is pinned to its own 3.10–3.11 env because its dependency graph collides with everything else. Bring a MiniMax API key (the macOS Keychain entry minimax-coding-plan is read automatically) or set MINIMAX_API_KEY.

setup.sh — main env
# Main env (Python 3.11+)
$ uv venv && source .venv/bin/activate
$ uv pip install -e .
 
# Provide the API key (or use macOS Keychain)
$ export MINIMAX_API_KEY=sk-cp-...
setup.sh — TimesFM env
# TimesFM env — must be 3.10–3.11
$ uv venv .venv311 --python 3.11
$ source .venv311/bin/activate
$ uv pip install -e . transformers \
  torch huggingface_hub
run.sh — experiments
$ pytest # 33 tests pass
$ python -m experiments.run_smoke # 1 ticker + 1 city smoke
$ python -m experiments.run_table2_parallel --tickers 7 \
  --cities 15 --samples-per 10 --workers 8 --out results/table2_full.csv
$ python -m experiments.run_table4_parallel --judge-model \
  abab6.5-chat --out results/table4_crossjudge.json # cross-judge
 
# TimesFM-2.5 baseline — switch envs first
$ source .venv311/bin/activate
$ HF_HUB_DISABLE_XET=1 python -m experiments.run_timesfm_only \
  --tickers 7 --cities 15 --samples-per 3 --out results/timesfm25_only.csv
 
$ python -m experiments.compare_vs_paper # Δ vs paper
§ 04 · Deviations

Eleven honest diffs
from the paper.

Every place the implementation knowingly departs from Nexus is tracked as a numbered deviation. Three were closed in a second pass; only D1 (no Gemini / Claude key) is materially unresolved. The full list lives in REPRODUCIBILITY_NOTES.md.

D1open
LLM substitution
Gemini-3.1-Pro · Claude-4.5-Sonnet
MiniMax-M2.7
(only key available)
D2resolved
Judge configuration
Self-judge cross-family judge
abab6.5-chat grades M2.7
D3mitigated
Event curation
TFRBench curated set LLM-generated events
−38% to −55% Zillow MAPE
D7resolved
Forecasting backbone
TimesFM 2.0 TimesFM 2.5
via transformers; ≈ paper
D4 – D6minor
Samples · splits · intersection
Sample caps & free-text guideline ∩
small numerical drift
D8 – D9minor
Single-run · thinking off
Matches paper's single run; MiniMax
thinking disabled to save budget
D10minor
Context-agent fallback
Long history truncates ctx output
raw history appended
D11model
Long-horizon refusal
MiniMax refuses stock h=26 CoT
last-value fallback
§ 05 · Cite

If this saves you a weekend.

Cite the original paper for the method. If you're specifically referencing this implementation — the deviations, the MiniMax setup, the cross-family judge — please also note the repository.

@article{das2026nexus, title={Nexus: An Agentic Framework for Time Series Forecasting}, author={Das, et al.}, journal={arXiv preprint arXiv:2605.14389}, year={2026} } @misc{nexus_repro_2026, title={nexus-repro: an unofficial reproduction}, author={sunrf-renlab-ai}, year={2026}, howpublished={\url{github.com/ sunrf-renlab-ai/nexus-repro}} }

Caveats worth surfacing.

This is a single-key reproduction by sunrf-renlab-ai on MiniMax-M2.7. All metrics should be read against the deviations list before being compared to the paper.

If you spot a real bug — a wrong prompt, a leaky split, a metric that doesn't match the paper's definition — please open an issue. PRs welcome on calibration breadth, ablation scale, and a Gemini / Claude backbone to close D1.

Open an issue →