Open source · MIT · Forward-testing in public

StockMachina

An agentic trading system
that grades itself.

Most trading bots lie to you — dead components hiding behind green P&L, risk quietly inverted by a config default, backtests that never survive contact with the market. StockMachina is an open-source framework for intraday US equities on Interactive Brokers built on the opposite premise: every intelligent component is graded against what the market actually did next — a local LLM as second pair of eyes on every entry, a 12-check risk engine in front of every order, and it all runs on hardware you own.

Follow the forward test →See the architecture When does the repo open?

Risk checks per order

1,150+

Tests, all green

100%

Local AI inference

MIT

License

Philosophy

Green P&L can hide dead components.

We learned this the honest way: our LLM screener silently failed open for two weeks — and a profitable week hid it completely. Only a counterfactual scorecard (grading every GO/SKIP decision against forward price action) exposed that a core component was dead. StockMachina is built around that lesson: every intelligent component must be measured against reality, continuously— or you don't actually know what's driving your returns.

Lie #1 — “it's working”

Dead components hide behind lucky weeks. Answer: counterfactual scorecards. Every screener GO/SKIP graded against forward price action; exit quality measured with MFE/MAE; forecaster disagreement quantified.

Lie #2 — “risk is handled”

A config default silently inverted our R:R — losers 2.4× bigger than winners. Answer: one risk middleware, 12 fail-fast checks, in front of EVERY order. Staleness halts, reconciliation halts, hard notional caps.

Lie #3 — “the backtest says so”

Backtests flatter. Answer: realistic cost models, out-of-sample validation on split windows for every change, negative results documented — and a public forward test as the only pitch that counts.

Architecture

Event-driven core. Agentic where it pays.

Deterministic engines wired over an event bus, with LLMs placed only where judgment adds value — and always fail-open, so a dead model degrades the system instead of stopping it.

↺ nightly reflection: journal → bounded proposals → promotion gate

LLM Screener

judgment

What it does

A local LLM gives every proposed entry a GO / WEAK / SKIP verdict with a strict JSON contract — the second pair of eyes before any order exists.

In the real system

Every decision is recorded and later graded against forward price action (counterfactual scorecard). That scorecard once caught the screener silently dead for two weeks behind green P&L.

When it fails

Circuit breaker + fail-open: if the model is down, signals pass ungated — but flagged 'degraded', counted, and alerted. Absence is measured, never hidden.

click a stage to inspect it — including how it fails

$ python scripts/launch_trading.py

▸ DataEngine: 30 symbols subscribed · indicators streaming · freshness armed

▸ RiskMiddleware: 12 checks armed (kill_switch → protections)

▸ Screener: qwen3.5 via LiteLLM · circuit-breaker ready · fail-open

▸ Forecasters: kronos + timesfm + chronos (optional, typed votes)

▸ Broker reconciliation: positions + open orders synced with IBKR

▸ Entry window: 10:10–15:00 ET · EOD flatten 15:55 ET

✓ StockMachina online · paper mode · forward test day 1

Runs on a Mac mini

The trading core is lightweight: Python + SQLite + an IBKR gateway. The orchestrator node schedules, supervises, and alerts via Telegram.

GPU box optional

LLM + forecaster inference served via any OpenAI-compatible endpoint (vLLM, LM Studio, Ollama). We run a DGX Spark; a gaming GPU works too.

Degrades, never stops

Screener down? Fail-open with telemetry. Forecasters down? Typed votes go absent, gates relax. Data stale? New entries halt, exits still work.

Features

Built from real trading scars, not feature lists.

Counterfactual scorecards

Every screener GO/SKIP graded against forward MFE. False-rejection rate and verdict separation, measured — not assumed.

Exit engineering

ATR brackets that preserve intended R:R after stop-flooring. Asymmetry root-caused in research: fixed it OOS before shipping.

Let winners run — both ways

Trend-aware far targets with ATR trailing, symmetric for longs and shorts, with a chop guard that reverts to tight targets in ranges.

Entry windows, OOS-validated

Edge-of-day entries were 49% of trades and ~78% of losses across both validation halves. The midday filter shipped with evidence.

12-check risk middleware

One pluggable pipeline in front of every order. Data-staleness halts, reconciliation halts, fat-finger caps, daily-loss circuit.

MFE/MAE telemetry

Max favorable / adverse excursion tracked per position, surfaced to the nightly reflection as exit-quality metrics.

Realistic backtest costs

State-dependent slippage, spread and commission modeling. Deflated metrics and walk-forward validation against overfitting.

Fail-open agentics

LLMs advise, engines decide. Any AI component can die without stopping the system — and its absence is measured, not hidden.

Model ensemble

Architecture beats backbone.

Live-trading benchmarks keep finding the same thing we did: the agent's architecture matters more than which model you plug in. StockMachina treats models as replaceable parts behind typed contracts — and measures each one's real contribution instead of trusting the marketing.

Qwen 3.5 122B

Alibaba · open weights

LLM

The screener and the nightly reflector. Reasoning disabled for the hot path (a hard-won lesson: reasoning tokens can silently eat your entire output budget).

Apache 2.0

MiniMax M2.7

Open weights

LLM

Ops and reporting agents — daily summaries, premarket briefs, system monitoring. Fast local serving keeps the entire agent fleet at $0 cloud spend.

Apache 2.0

Kronos

Open weights

Time-Series

Candlestick-native forecaster used as a soft veto/boost on entry confidence. Its votes are typed, cached, and its live accuracy is continuously measured.

MIT

TimesFM

Google

Time-Series

Second voice of the forecast ensemble. Our own disagreement metric showed it conflicts with Kronos on direction 54% of the time — we publish that number too.

Apache 2.0

Chronos-Bolt

Amazon

Time-Series

Third ensemble voice. Every forecaster is optional and fail-open: an unreachable service degrades to an absent vote, never to a fabricated one.

Apache 2.0

FinBERT

Open weights

Sentiment

Financial sentiment scoring over news, social and influencer feeds — aggregated by a sentiment hub with per-source freshness tracking and deduplication.

Apache 2.0

Bring your own

You

LLM

Everything speaks OpenAI-compatible API through a LiteLLM gateway. Swap the screener model with one config line — then let the scorecard tell you if it's better.

Any

All inference through one OpenAI-compatible endpoint. Same client, different model, measured contribution.

The stack

Boring where it should be, sharp where it matters.

No exotic infrastructure: Python, SQLite, one broker API, one inference gateway. The sophistication lives in the risk engine, the exits, and the measurement loops — the parts that actually move P&L.

Core

Python 3.11
Event bus (in-proc)
ib_insync (IBKR)
Pydantic settings
loguru

Data

SQLite ledger + warehouse
Live 5-min bars + ticks
Freshness heartbeats
Corporate-action guards

AI serving

LiteLLM gateway
vLLM (or LM Studio / Ollama)
OpenAI-compatible API
Typed vote contracts

Intelligence

Qwen 3.5 (screener + reflector)
MiniMax (ops agents)
Kronos · TimesFM · Chronos
FinBERT sentiment

Research

Offline backtester + cost model
Walk-forward / OOS splits
Hyperopt sweeps (deflated)
Per-trade attribution

Ops

1,150+ pytest suite
Telegram alerts
Watchdog + heartbeats
Dashboard (P&L, exec quality)

Who it's for

Built for people who treat their capital like quants.

Independent quants

You run capital through IBKR and want an agentic layer with institutional risk discipline — without trusting a black box.

Engineers entering trading

You reason about systems for a living. Start with an architecture that measures itself instead of a strategy script that lies to you.

LLM-agent researchers

A real, instrumented testbed for agentic trading: screener scorecards, reflection loops, and fail-open patterns on live market data.

Builders of small desks

A private, auditable, on-prem stack where every order passes one risk choke point and every AI decision leaves a graded trace.

Not for: HFT, options multi-leg strategies, or anyone looking for a one-click money printer. StockMachina is a workbench with opinions about risk — not a vending machine.

Status · Built in public

The forward test is the roadmap.

No 18-week promises. The system is built and hardened; what remains is evidence. Paper-to-live transition is governed by numbers agreed in advance.

PHASE 01 · DONE

Build + harden

Event-driven core, 12-check risk middleware, LLM screener with scorecards, exit engineering (R:R preservation, symmetric trailing), reflection loop. 1,150+ tests green.

PHASE 02 · NOW · paper

Forward test — in public

≥100 trades with zero rule changes, profit factor > 1.3, rolling-20 consistency, zero false safety halts. Results published as they happen — including the bad days.

PHASE 03 · NEXT

Public repository

MIT-licensed release via a clean allowlist export: the full framework, the test suite, and the research journal. No cherry-picked backtests — the forward test IS the pitch.

PHASE 04 · IF gates pass

Live capital, gated

Start at 10–20% of intended capital, scale only on sustained consistency. The same quantitative gates decide — not a good week, not a gut feeling.

After the gates

Reflection lessons injected into the screener context · Bull/bear debate for high-conviction entries · Entry-alpha research continues (regime and time-of-day cohorts) · Community strategies behind the same risk middleware.

Open source

MIT. Inspectable. Forkable. Yours.

StockMachina will be released under the MIT license — no usage limits, no telemetry, no enterprise tier gating the parts that matter. The repository opens together with forward-test results, because a trading framework's pitch should be its measured behavior, not its README. It opens with the forward-test numbers attached — the wins, the losses, and the lessons.

★ GitHub — opening soon Get notified

# Quick start (at release)

git clone https://github.com/draix/stockmachina

cd stockmachina && pip install -r requirements.txt

cp .env.example .env # IBKR paper account

python scripts/launch_trading.py

# Runs degraded without a GPU: LLM + forecasters are optional.

# Paper-trade until the gates pass. No exceptions.

Measure everything. Trust nothing. Ship evidence.

StockMachina is being forward-tested in public right now. Follow along, inspect the architecture, and bring your own hardware when the repo opens.

Get notified at release See the architecture

An agentic trading systemthat grades itself.

Green P&L can hide dead components.

Lie #1 — “it's working”

Lie #2 — “risk is handled”

Lie #3 — “the backtest says so”

Event-driven core. Agentic where it pays.

LLM Screener

Runs on a Mac mini

GPU box optional

Degrades, never stops

Built from real trading scars, not feature lists.

Counterfactual scorecards

Exit engineering

Let winners run — both ways

Entry windows, OOS-validated

12-check risk middleware

MFE/MAE telemetry

Realistic backtest costs

Fail-open agentics

Architecture beats backbone.

Boring where it should be, sharp where it matters.

Built for people who treat their capital like quants.

Independent quants

Engineers entering trading

LLM-agent researchers

Builders of small desks

The forward test is the roadmap.

Build + harden

Forward test — in public

Public repository

Live capital, gated

MIT. Inspectable. Forkable. Yours.

Measure everything. Trust nothing. Ship evidence.

An agentic trading system
that grades itself.