nexus-repro — Unofficial reproduction of the Nexus agentic forecasting framework

§ 00 · Architecture

Five agents,
three stages.

Nexus never asks one model to do everything. Raw numbers and events are first structured, then forecast at two resolutions in parallel, then synthesized — with a calibration loop that learns review guidelines from backtested error.

Multimodal history

XNumerical series — weekly prices / inventory

ETextual events — news, releases, macro

▶

𝒜_ctx

Historical Context Agent

Cleans & aligns X + E into a structured causal timeline H — one row per step, noise filtered.

→ H

▶H

𝒜_macro

Macro-Reasoning Agent

Top-down. One broad trajectory for the whole horizon — regime & seasonality.

→ X^macro, R^macro

𝒜_micro

Micro-Reasoning Agent

Bottom-up. Step-by-step catalysts & local volatility, one value per timestep.

→ X^micro, R^micro

▶H · macro ∥ micro

𝒜_syn

Forecast Synthesizer Agent

Weighs macro vs micro, conditioned on guidelines 𝒢, into the final forecast + reasoning.

in: H · X^macro · X^micro · 𝒢 → X, R

↻ 𝒜_calib — offline backtest loop. n=6 folds → critique rules → 𝒢 = ∩ 𝒢ᵢ, adopted only if ≥5% better on a hidden fold, then fed back into 𝒜_syn above.

▶

Robust forecast

XPredicted values over the horizon

RInterpretable reasoning trace

Xnumerical series

Etextual events

Hstructured history

𝒢calibration guidelines

Rreasoning trace

∥runs in parallel

§ 01 · Status

What's in the box.

Every component of the paper — framework, prompts, calibration loop, data loaders, baselines, and the evaluation harness — implemented and unit-tested. A handful of large-scale tables are reproduced as partials due to budget and the single available backbone model.

Component	Status	Notes
Framework — 5 agents, 3 stagesnexus.pipeline	Full	Matches paper §3 exactly.
Prompts (Appendix B / C / D)nexus/prompts.py	Full	Reproduced verbatim from the paper appendices.
Calibration loopn = 6 splits · k = 5% threshold	Full	Leak-safe split; 8 / 18 cells adopt guidelines.
Data loadersZillow weekly inventory + yfinance	Full	15 cities × 7 tickers.
CoT baseline (Appendix C)experiments.run_table2	Full	Direct reasoning baseline.
TimesFM baselinev2.5 via transformers (D7 resolved)	Full	191 samples; matches paper within noise.
MAPE / RMSE / LLM-Judgenexus.eval	Full	Cross-family judge via abab6.5-chat (D2 resolved).
Unit testspytest	Full	33 passed, 0 skipped.
Smoke testresults/smoke.json	Full	1 ticker + 1 city end-to-end.
Table 2 — Multimodal7 tickers + 15 cities × 3 horizons	Full	548 jobs · samples-per 10 (exceeds paper on Zillow).
Table 3 — NumericalNexus + CoT + TimesFM-2.5	Full	22 entities; TimesFM-2.5 ≈ paper.
Table 4 — Judge agreementcross-family judge	Partial	n = 26 judgments; Nexus 77% overall.
Table 5 — Ablation4 entities × 3 samples	Partial	Single backbone; deltas noisy below n ≥ 30.

§ 02 · Findings

Four observations
that did not fit the abstract.

The reproduction reveals behavior that only surfaces when the paper is rebuilt against a different LLM and a different forecasting backbone. Treat these as anecdotes from a single weaker setup — not paper-replacing claims.

F.01framework advantage

Nexus helps weaker models more.

On MiniMax-M2.7 — a substantially weaker model than the paper's Gemini-3.1-Pro + Claude-4.5-Sonnet pairing — the agentic decomposition closes more of the end-to-end reasoning gap. On stocks our Nexus matches the paper's number (0.119 vs 0.111) despite the weaker backbone.

Nexus vs CoT · ours

3.8 ×

Nexus vs CoT · paper

1.01 ×

F.02CoT failure mode

CoT refuses at long horizons.

At horizon h = 26 the CoT baseline frequently refuses — citing financial-advice policy — or returns malformed output, triggering a last-value fallback. The Nexus pipeline avoids this regime entirely because no single stage frames the task as advice.

CoT · stocks h = 26

1.85

Nexus · same cell

0.21

F.03backbone delta · resolved

Version, not method, drove the TimesFM gap.

An early run on TimesFM-2.0 trailed the paper ~4× on Zillow. Installing TimesFM-2.5 from transformers closed it: the baseline now reproduces the paper within noise, confirming the whole gap was the 2.0-vs-2.5 version, not a methodology error.

2.5 · Zillow h=13

0.043

Paper · same cell

0.041

F.04judge bias · resolved

Removing self-bias strengthens Nexus.

MiniMax serves several model families on one endpoint, so we ran a true cross-family judge: abab6.5-chat grading MiniMax-M2.7. Nexus's win rate rose from 56% (self-judge) to 77% — landing squarely inside the paper's 64–97% range.

Cross-judge overall

77%

Self-judge overall

56%

§ 03 · Quick start

Two environments,
five commands.

Nexus runs in your main 3.11+ env. TimesFM is pinned to its own 3.10–3.11 env because its dependency graph collides with everything else. Bring a MiniMax API key (the macOS Keychain entry minimax-coding-plan is read automatically) or set MINIMAX_API_KEY.

setup.sh — main env

# Main env (Python 3.11+)
$ uv venv && source .venv/bin/activate
$ uv pip install -e .
 
# Provide the API key (or use macOS Keychain)
$ export MINIMAX_API_KEY=sk-cp-...

setup.sh — TimesFM env

# TimesFM env — must be 3.10–3.11
$ uv venv .venv311 --python 3.11
$ source .venv311/bin/activate
$ uv pip install -e . transformers \
  torch huggingface_hub

run.sh — experiments

$ pytest                                                    # 33 tests pass
$ python -m experiments.run_smoke                           # 1 ticker + 1 city smoke
$ python -m experiments.run_table2_parallel --tickers 7 \
  --cities 15 --samples-per 10 --workers 8 --out results/table2_full.csv
$ python -m experiments.run_table4_parallel --judge-model \
  abab6.5-chat --out results/table4_crossjudge.json  # cross-judge
 
# TimesFM-2.5 baseline — switch envs first
$ source .venv311/bin/activate
$ HF_HUB_DISABLE_XET=1 python -m experiments.run_timesfm_only \
  --tickers 7 --cities 15 --samples-per 3 --out results/timesfm25_only.csv
 
$ python -m experiments.compare_vs_paper                    # Δ vs paper

§ 04 · Deviations

Eleven honest diffs
from the paper.

Every place the implementation knowingly departs from Nexus is tracked as a numbered deviation. Three were closed in a second pass; only D1 (no Gemini / Claude key) is materially unresolved. The full list lives in REPRODUCIBILITY_NOTES.md.

D1open

LLM substitution

Gemini-3.1-Pro · Claude-4.5-Sonnet
→ MiniMax-M2.7
(only key available)

D2resolved

Judge configuration

Self-judge → cross-family judge
abab6.5-chat grades M2.7

D3mitigated

Event curation

TFRBench curated set → LLM-generated events
−38% to −55% Zillow MAPE

D7resolved

Forecasting backbone

TimesFM 2.0 → TimesFM 2.5
via transformers; ≈ paper

D4 – D6minor

Samples · splits · intersection

Sample caps & free-text guideline ∩
→ small numerical drift

D8 – D9minor

Single-run · thinking off

Matches paper's single run; MiniMax
thinking disabled to save budget

D10minor

Context-agent fallback

Long history truncates ctx output
→ raw history appended

D11model

Long-horizon refusal

MiniMax refuses stock h=26 CoT
→ last-value fallback

§ 05 · Cite

If this saves you a weekend.

Cite the original paper for the method. If you're specifically referencing this implementation — the deviations, the MiniMax setup, the cross-family judge — please also note the repository.

@article{das2026nexus, title={Nexus: An Agentic Framework for Time Series Forecasting}, author={Das, et al.}, journal={arXiv preprint arXiv:2605.14389}, year={2026} } @misc{nexus_repro_2026, title={nexus-repro: an unofficial reproduction}, author={sunrf-renlab-ai}, year={2026}, howpublished={\url{github.com/ sunrf-renlab-ai/nexus-repro}} }

Caveats worth surfacing.

This is a single-key reproduction by sunrf-renlab-ai on MiniMax-M2.7. All metrics should be read against the deviations list before being compared to the paper.

If you spot a real bug — a wrong prompt, a leaky split, a metric that doesn't match the paper's definition — please open an issue. PRs welcome on calibration breadth, ablation scale, and a Gemini / Claude backbone to close D1.

Open an issue →

Nexus: An Agentic Framework for Time Series Forecasting

Five agents,three stages.