Building Quantsentinel: ML for Indian Options Markets in Production

Published 29 May 2026 Reading 84 min Audience Engineers building production ML in finance & other regulated domains

1. Opening: a Friday-afternoon HAR-RV failure

Three days into shipping the volatility forecasting module, the HAR-RV model started producing nonsense predictions on every Friday afternoon close. Not random noise — systematically wrong, only on Fridays, only after 14:00 IST. Monday through Thursday it tracked realized volatility cleanly. Then Friday, like clockwork, the prediction would diverge by 30-40% in the same direction.

I spent half a day inside the realized-vol pipeline, convinced it was a feature-construction bug — a window-shift error, or a leakage between the daily/weekly/monthly HAR components. It wasn’t. The model code was clean. The features were clean. The lookahead-shift was correct.

The bug was in the data. The Indian VIX calculation methodology had changed in the source feed at some point in the training window, and the realized-volatility series I was regressing on spanned the transition. The model wasn’t broken. It was confidently predicting using one definition of the underlying while reality used another, and the discrepancy compounded on Fridays because that’s when expiry-week IV crush dominated the signal.

That’s what production ML in finance looks like once you leave the paper. The equations from Corsi (2009) are the easy part. The hard part is the slow archaeology — why does this system fail under these conditions you didn’t anticipate, and what assumption are you making that’s quietly wrong?

Over the last ten days I designed and built Quantsentinel — a multi-tenant quantitative trading platform for Indian options and futures markets. The numbers, by way of orienting the rest of this post: 11 microservices, 17 ML systems organized into a 4-layer alpha engine and a 7-wall risk castle, 70,342 lines of production Python, 88 test files, 332 commits, 34 frontend pages, 10 closed-beta tenants live. Ten days is a tight number; that’s deliberate — I’ll be honest about the timeline in §2 and what it implies.

This post is the technical case study. What I built, why I built it that way, what tradeoffs I made, and — the section that matters most — what I got wrong.

2. Why I built this

I came to this project from the intersection of two things: a long ML engineering background where the work was mostly NLP and recsys, and a personal interest in Indian options markets that had been growing for two years before I wrote the first line of Quantsentinel.

The interest in options came first — not as a trader, but as someone who’d watched the F&O turnover on NSE grow into the largest options market by contract volume in the world and noticed that the available retail tools were, mostly, glorified order entry plus some indicators. The structural inefficiency was visible: retail participation was huge, individual sophistication was uneven, the brokers were optimizing for transaction count, and the few “quant” tools on the market were either spreadsheet-grade backtest plays or US-market ports that had no awareness of Indian cost structures (STT, STT-on-physical-delivery, the exchange transaction charges, the SEBI turnover fee, GST on brokerage). I’d done the rough math on whether a well-built systematic platform could clear the cost layer and produce something that compounded — and the answer wasn’t obvious one way or the other, which is what made it interesting.

The ML engineering background was the other half. I’d shipped enough production ML to know which problems are hard at scale. Online learning is hard. Concept drift detection is hard. Versioning and rollback are operationally hard in ways the academic literature doesn’t capture. Multi-tenant ML — where the model has to behave differently per customer without leaking signal across them — is a wall most teams hit late and clean up badly. Those are the parts of ML systems I find genuinely interesting, and most consumer-tech roles don’t expose them at depth.

The hypothesis I wanted to validate was narrower than “can I beat the market.” It was: can a small team build production-grade ML infrastructure for a regulated, latency-sensitive, multi-tenant domain, with sufficient discipline that the system is genuinely safe to put in front of real capital? The trading edge was secondary — most of the value, I suspected (and §10 confirms), would come from the risk infrastructure, not the alpha.

So I started. Not with the alpha engine; with the data layer, the model registry, the tenant isolation primitives, and the dry-run lock. The trading logic came after the things that would let me trust the trading logic.

The timeline is short — ten days of intense, focused work, 332 commits, no team. That’s a real constraint that shaped everything. It forced ruthless prioritization: every component had to either be load-bearing for the core system or get cut. The “what I got wrong” section is partly a function of moving fast; some of it would have been caught with a quieter cadence. Most of it is the kind of thing you only catch by shipping.

3. What makes Indian options markets a genuinely interesting technical problem

Most “ML for trading” content treats the trading domain as a generic prediction problem with some asset-class flavoring. That framing misses what makes the problem actually hard. The technical depth lives in five places, and skipping any of them produces systems that demo well and break on contact with real capital.

State-space explosion

NIFTY alone has 700-1000 active option strikes across the four nearest expiries on any given day, plus the same again for BANKNIFTY, plus all the single-stock options. The relevant decision surface isn’t “what’s the price of NIFTY going to do” — it’s the joint distribution over the full implied-volatility surface, the open-interest profile, and the term structure. The Greeks aren’t decoration; they’re the language the problem is naturally expressed in. A “buy CE” or “sell PE” decision implicitly carries delta, gamma, vega, theta, and rho exposures, each of which has its own time-dependent decay, and the trade only makes sense relative to all of them simultaneously.

In a system like this, “predict the direction” is approximately the easiest possible framing of the problem and approximately the least useful. The structure of how you take the position usually matters more than whether you got the direction right.

Non-stationarity, but not the kind from the textbooks

Standard time-series ML assumes that you can train on a window and the next window will look statistically similar. Indian options markets break that assumption in a very particular way: the regime-switching happens at multiple timescales simultaneously, and the relevant regime is partly endogenous to the system you’re trading in.

Regulatory changes (new STT rates, position-limit changes, the move to weekly expiries that aren’t NIFTY, the periodic adjustments to lot size — NIFTY moved from 50 to 65 in early 2026), structural changes (FII flows reversing, the rise of weekly-options retail volumes), and slow shifts in market microstructure (the proliferation of algos at the retail broker level) mean that the model you trained on data from a year ago is, in real risk terms, training on a different market.

The system has to detect these shifts and respond to them. Not with a quarterly retrain — that’s reaction-on-the-scale-of-the-problem. The Quantsentinel architecture builds in drift detection at the feature level, model degradation monitoring at the prediction level, and a four-layer escalation hierarchy that downgrades position size before it cuts the strategy entirely.

Multi-objective optimization, and the objectives fight each other

A retail-facing system in Indian markets has to clear, simultaneously:

Returns, after costs. Net of STT, brokerage, exchange charges, SEBI turnover fee, GST, and slippage. The “gross alpha” is a vanity metric — the only one that matters is net of every line item.
Risk, including the tail risks that don’t show up in standard volatility measures. Pin risk on expiry days. Gap risk over weekends. The sub-millisecond gaps in liquidity on news prints.
Capacity. A strategy that works at ₹2L of capital may not work at ₹2Cr. The model needs to be honest about how its edge degrades with size.
Taxes. Indian short-term capital gains rules, the speculative-income classification of intraday F&O, the recent changes to indexation — these reshape the after-tax payoff materially.
Regulatory constraints. SEBI position limits, exchange-level circuit breakers, the per-broker risk caps that get hit before SEBI’s caps do.

Standard ML optimizes for one objective. A risk-aware loss function is a step up, but it still wants to be one number. Quantsentinel’s pipeline returns a multi-dimensional verdict — approved, blocked-by-which-gate, sizing-lots, structure, expected-cost-net, expected-risk-after-hedge, expected-tax-treatment — and the production logic compromises across them explicitly.

Real-time, but a particular flavor of it

The system has to respond inside the trading day, but it’s not HFT. The decision cycle runs at 3-second granularity for option-chain reads, with the alpha layer recomputing every poll, the regime layer every 5 minutes, the strategy selection every minute, and the execution layer firing only when all four layers agree. The latency budget for any single decision is “respond before the next chain update arrives,” which translates to roughly 1-2 seconds.

That’s slow enough that Python is the right language. It’s fast enough that careless code dies on it. The system spends most of its time waiting for I/O — the Upstox depth feed, the Cloud SQL writes, the Redis pub/sub — and the optimization work is in keeping that I/O bounded.

Adversarial environment

The other thing trading shares with very few other ML domains: every other participant in the market is actively trying to predict the same things you are, and some of them have more capital, more compute, and more information. There’s no “ground truth” you can collect more of; the truth is whatever clears the market, and the market includes you.

This shows up in three places architecturally:

The factor pool decays. What worked last quarter does not work this quarter. The factor registry includes a decay monitor; factors that drift below a Sharpe threshold get retired automatically.
The strategy mix has to rotate. A single strategy that prints well for a regime stops printing when the regime changes. Quantsentinel’s strategy selection layer is a contextual bandit conditioned on the regime classifier’s output, not a static policy.
Anything you ship is a signal to the market. Even at retail scale, an algorithmic system that takes correlated positions across tenants will move the order book against itself if it scales naively. The tenant isolation isn’t only a privacy property — it’s a market-impact property.

These five properties — state-space, non-stationarity, multi-objective, real-time, adversarial — are the technical reason quant infrastructure is hard. They’re also the reason the work is genuinely interesting if you like systems-level ML problems.

4. System architecture

The system is eleven services on Docker Compose on a single GCE VM (e2-standard-4), with the datastore on Cloud SQL Postgres 16 + TimescaleDB. Caddy fronts it with auto-HTTPS. The serving topology is straightforward; the interesting design lives in how state moves through it and how tenants are isolated from each other.

                +-------------+
                |    caddy    |  ← auto-HTTPS, basic_auth scoped to operator paths
                +-------------+
                      |
        +-------------+-------------+
        |                           |
+-------------+              +----------------+
|  frontend   |              |    gateway     |  ← per-tenant routing, auth, CSRF, rate limits
|  (Next 14)  |              |     (Go)       |
+-------------+              +----------------+
                                     |
   +------------+------------+-------+-------+------------+-----------------+
   |            |            |               |            |                 |
+--------+ +--------+ +-------------+ +--------------+ +--------+ +-------------------+
| broker | |   ml   | |intelligence | |   marketdata | |  news  | | tenant_adaptation |
|FastAPI | |FastAPI | |   FastAPI   | |    FastAPI   | |FastAPI | |     FastAPI       |
+--------+ +--------+ +-------------+ +--------------+ +--------+ +-------------------+
   |            |            |              |             |
   +------+-----+------------+--------------+-------------+
          |
   +------------------+         +------------------+
   |  Cloud SQL       |         |   Redis (cache,  |
   |  Postgres 16 +   |         |   tenant pub/sub)|
   |  TimescaleDB     |         +------------------+
   +------------------+

Plus three smaller services that aren’t load-bearing in the trade path: backtest, copilot (Gemini-driven narration), notification (Telegram + email + FCM fanout), risk (the Go health probe).

What each service owns

frontend — Next.js 14 App Router, server-rendered React, 34 pages. Per-tenant URL rewrites in middleware.ts translate /dashboard to /t/<slug>/dashboard based on the active session. No direct backend access — everything goes through the gateway’s BFF.

gateway — Custom Go service, no router framework, just net/http with ServeMux. Owns three things: (a) per-tenant path resolution (/t/<slug>/api/v1/... → /api/v1/... with the tenant id stamped on the outbound header), (b) authentication middleware that proxies to the broker’s auth router, and (c) tenant-scoped rate limiting (600 requests/hour per tenant on /pipeline/decide — the expensive one).

broker — The mostly-misnamed service. It owns the auth router (login, invite acceptance, MFA, password reset, session refresh), the upstream OAuth integrations with Upstox and Groww, the order placement and reconciliation layer, the position/Greek/margin endpoints, and the playground (paper-trading) state machine. The dry-run lock — the rule that a new tenant can’t go live for the first 14 days even if they ask to — lives here, enforced at the database row level.

ml — The fat service. Hosts the 4-layer alpha engine, the 7-wall risk castle, the 17 ML systems, the live signal dashboard, the playground orchestrator hooks, and a long-running APScheduler running ten distinct cron jobs (regime poll every 10m, alpha-sources every 5m, factor IC daily, the weekly retrain crons, etc.). 70%+ of the production Python LOC lives here.

intelligence — Polling-only service. Hits the Upstox option-chain endpoint on a 5-minute cadence per index, computes the IV surface, derives the OI flow, estimates GEX, and writes time-series snapshots to a TimescaleDB hypertable. ml reads from those snapshots; intelligence never serves a user-facing endpoint that returns derived values, only the raw snapshots.

marketdata — The websocket layer. Holds long-lived connections to the Upstox depth-30 feed and the Groww NATS session. Streams ticks to Redis. Subscribe happens at process start, which is why the daily Upstox token rotation has to recreate this container (the env doesn’t reload otherwise).

news — Multi-provider news ingestion (RSS, Upstox news, NewsAPI, Marketaux), classification, dedup, and event extraction. Writes to a TimescaleDB hypertable with the same poll cadence as intelligence.

tenant_adaptation — The per-tenant behavioral ML layer. This service runs the synthetic-persona orchestrator (20 personas, 3 DB schemas) that’s used to train each tenant’s behavioral model in shadow before any real money is touched. The service boots with a stripped-down env that explicitly excludes broker credentials; it refuses to start if it sees a real Upstox token in its environment.

How a trade decision actually flows

Take a single /pipeline/decide call — the end-to-end alpha → strategy → risk → live-signal pass. Here’s what happens, in order:

1. gateway receives GET /t/<slug>/api/v1/pipeline/decide
2. gateway resolves the slug → tenant_id (5-min cache hit)
3. gateway stamps X-Tenant-Id: <id>, rate-limits, forwards to ml
4. ml dispatches to PipelineOrchestrator.decide(tenant_id=...)
5. Orchestrator pulls:
     - latest IV surface, OI flow, GEX from intelligence's snapshots
     - latest news-alpha + sentiment from news's snapshots
     - latest regime classification from the cached 10-min regime poll
     - latest vol forecast from the HAR-RV/GARCH ensemble cache
     - latest implied distribution from the Breeden-Litzenberger cache
6. Orchestrator runs the 4-layer alpha:
     Layer 1: signal generation (7 alpha modules)
     Layer 2: signal combination (weighted ensemble)
     Layer 3: opportunity scoring + structure selection
     Layer 4: portfolio context + sizing
7. Risk Castle evaluates 7 walls in sequence:
     wall 1: Regime gate
     wall 2: News gate
     wall 3: Risk policy gate
     wall 4: Margin gate
     wall 5: Portfolio correlation gate
     wall 6: Cost-engine gate (net EV must be > 0)
     wall 7: Tail-hedge requirement (if applicable)
8. If approved, the live signal payload is built:
     - strategy.structure_type, legs, max_profit, max_loss
     - dashboard_card_data with concrete strikes + entry zone + target + stop
     - narrative_text (Gemini-generated, cached)
     - payoff diagram points
9. The payload is published via Redis pub/sub on a per-tenant channel
10. Subscribers (dashboard SSE, Telegram, FCM, email) fan out per tenant prefs
11. Response returns to gateway, which proxies to frontend

End-to-end: 12-15 seconds on average for a cold pass, 1-2 seconds for a warm one. The cold pass cost is dominated by the orchestrator’s gather-all-the-snapshots step (step 5); the warm path hits Redis-cached snapshots. The 12-second cold time was the source of a gateway timeout bug I’ll describe in §10.

Key design decisions, with rationale

I’ll describe each in the format the spec calls for: the decision, the alternatives considered, why this choice, what the tradeoff implies.

Python everywhere except the gateway. Alternatives considered: Go for ml as well, Rust for the alpha engine, a hybrid where the hot loop is Cython. Why Python: the 3-second decision cycle is far above the threshold where language matters, and the developer-velocity gap between Python and Go for numerics-heavy code is substantial. NumPy/Pandas + a few well-placed C extensions are within 5-10x of native C++ for the numerical paths, which is plenty of headroom at this latency. Tradeoff: I would absolutely choose Rust or C++ if this were HFT or required sub-millisecond decisions, and the choice would be wrong for a different problem. It’s right for this one.

Gateway in Go, not Node or Python. Alternatives considered: Express in Node, FastAPI as the gateway, nginx with Lua. Why Go: the gateway has near-zero numerical work, near-100% I/O, and needs to be solid for hours under concurrent load without GC pauses. Go’s net/http with a thread-per-request model is the right shape for this. Tradeoff: I lose some of the dev velocity of Python; in exchange I get a gateway that uses ~30MB resident and doesn’t need babysitting.

Multi-tenant from day one, not bolted on. Alternatives considered: single-tenant MVP, ship per-tenant later. Why now: retrofitting tenant isolation onto a single-tenant system is a known disaster. Every database query, every Redis key, every websocket room, every log line has to learn that it’s tenant-scoped, and the inevitable miss is the worst kind of leak (your tenants find out about it). Better to write the system with tenant_id in every signature from the first commit. Tradeoff: more friction in the early dev loop (you can’t just “show me all the positions” without picking a tenant); more correctness up front, fewer painful migrations later.

Cloud SQL Postgres, not self-managed. Alternatives: managed Postgres on the VM, RDS-equivalent, CockroachDB for the eventual multi-region story. Why Cloud SQL: GCP-native, the operator team is one person (me), the savings on having backups and HA managed for me are real, and I don’t yet need multi-region. Tradeoff: $80/month I wouldn’t otherwise spend; in exchange, the database is one operational concern I don’t have to think about.

TimescaleDB extension for the time-series tables. Alternatives: InfluxDB, ClickHouse, ad-hoc Postgres with manual partitioning. Why Timescale: the time-series workload is option-chain snapshots, news events, depth ticks — write-heavy but not analytical-OLAP. Timescale’s hypertables give me automatic chunking by time, retention policies as a config, and SQL compatibility with the rest of the system. Tradeoff: I gave up the ability to run cross-region analytical queries against this data without sampling. Acceptable.

Redis for inter-service pub/sub. Alternatives: NATS, Kafka, Postgres LISTEN/NOTIFY. Why Redis: I was already running it for caching; the pub/sub channel-per-tenant pattern fits naturally; the persistence story doesn’t matter because every signal is also written to Postgres for the durable record. Tradeoff: lose the at-least-once delivery semantics of Kafka. For a system where the durable record is the database row and pub/sub is the notification layer, that’s fine.

Astro for the editorial blog, separate from the main frontend. Alternatives: a /blog route on the Next.js app, MDX served from the main app. Why Astro: the blog is fully static, no auth, no per-user state. Astro produces tiny HTML, the build is sub-second, and the deployment is decoupled from the trading app’s deployment pipeline. Tradeoff: two separate deploy paths. Worth it.

5. The ML pipelines

The ML work is organized into twelve pipelines. Building twelve specialized ML pipelines taught me more about production ML than any single project I’d worked on. Each pipeline addresses a specific decision in the trading workflow. Some are well-established techniques (HAR-RV for volatility). Some required more novel framing (counterfactual labeling for adjustment classification, multi-output models for event impact). All require careful attention to validation, deployment, and ongoing monitoring.

I’ll walk through each with the same template: what it does, why it matters, how I implemented it, what worked, what didn’t. The first seven (vol forecasting, implied distribution, multi-timeframe coordination, cross-sectional alpha, statistical robustness, performance attribution, anomaly detection) were the foundation. The five after that (direction-prediction ensemble, optimal entry timing, probabilistic pin risk, wing selection, event-impact prediction) were added as the system matured and specific decision points showed measurable room for ML lift over the rule-based baselines.

The seventeen ML systems that sit under the seven pipelines. Each is in production behind the model registry described in §7; each goes through the same versioning, paired-evaluation, and rollback discipline.

Pipeline 1: Volatility forecasting (HAR-RV + GARCH ensemble)

What it does. Predicts the realized volatility of NIFTY over the next 1-day, 5-day, and 22-day horizons. The output feeds the strategy selection layer (which structures are appropriate for the predicted vol regime) and the position-sizing layer (vol-targeting the portfolio).

Why it matters. Almost every strategy decision is conditioned on a vol forecast. Iron condors print money in low realized vol; straddles need realized vol to come in faster than implied. If the vol forecast is biased low, the system over-sizes premium-selling and gets carried out on the first vol spike. If it’s biased high, the system under-sizes everything and produces nothing.

How I implemented it. HAR-RV from Corsi (2009) as the workhorse — the model that decomposes daily realized vol into daily, weekly, and monthly components and regresses the next day’s vol on those three. GARCH(1,1) as the ensemble partner; the two have different failure modes. The combiner is a Bayesian model averaging layer that weights the two by their out-of-sample directional accuracy over a rolling 60-day window.

# Simplified version of the HAR-RV core
import statsmodels.api as sm

class HARRVModel:
    """
    Heterogeneous Autoregressive of Realized Volatility.
    Three components: daily, weekly (5d avg), monthly (22d avg).
    The shift(1) is the only thing standing between this and lookahead bias —
    a class of bug that's silent in backtests and devastating in production.
    """
    def __init__(self, daily=1, weekly=5, monthly=22):
        self.windows = (('daily', daily), ('weekly', weekly), ('monthly', monthly))
        self.results = None

    def prepare_features(self, rv_series):
        feats = pd.DataFrame(index=rv_series.index)
        for name, window in self.windows:
            feats[f'rv_{name}'] = rv_series.rolling(window).mean().shift(1)
        return feats.dropna()

    def train(self, rv_series):
        X = self.prepare_features(rv_series)
        y = rv_series.reindex(X.index)
        # HAC standard errors — serial correlation in vol residuals is the
        # textbook case for Newey-West; ignoring it produces falsely tight
        # confidence intervals and an over-confident production model.
        self.results = sm.OLS(y, sm.add_constant(X)).fit(
            cov_type='HAC', cov_kwds={'maxlags': 5}
        )
        return self.results

What worked. The HAR component is excellent at the 5-day horizon during stable regimes. R² out-of-sample sits in the 0.5-0.6 range on NIFTY 5-day realized vol over the last twelve months of data I trained on. The HAC standard errors corrected a meaningful overconfidence in the original fit.

What didn’t. The model degrades sharply around expiry events — the Tuesday-Wednesday before weekly expiry — because the realized-vol series is dominated by gamma effects that the HAR components don’t capture. I added a jump_detector module that flags expiry-week observations and downweights them in the rolling fit, but it’s a patch, not a real solution. A proper fix would be a jump-diffusion decomposition; I haven’t built it.

The Friday-afternoon failure I opened with: the underlying realized-vol series had a regime change in the source feed that the rolling window didn’t recognize. The fix wasn’t in the model — it was a realized_vol.py rewrite that builds the series from raw tick data instead of inheriting the source feed’s calculation. Now the model owns the entire data pipeline from Upstox depth ticks down, which removes a class of “the source data definition changed quietly” bugs at the cost of having to maintain my own realized-vol implementation.

Pipeline 2: Implied distribution recovery (Breeden-Litzenberger)

What it does. Recovers the risk-neutral probability distribution of the underlying at expiry from the option chain. The method is the second derivative of the call price with respect to strike — Breeden and Litzenberger (1978) showed this analytically — and the implementation is just careful numerics on the chain.

Why it matters. The implied distribution is the only signal in the system that directly answers “what does the market currently price in for tail outcomes.” Single-strike IV is one number; the implied distribution is the full shape. A right-tail that’s pricing in a 5% chance of a +3% move while the rest of the distribution stays compressed is a different setup than a uniform IV bump.

How I implemented it. The chain comes in at 50-strike granularity; that’s not dense enough for clean second differences. I fit a smoothing spline through the call-price-vs-strike surface, then take the analytical second derivative of the spline. The fit is constrained — the resulting density has to integrate to one and be non-negative — and I run an arbitrage check (arbitrage.py) that flags any violations of butterfly-spread or vertical-spread no-arb conditions before publishing the distribution.

The four moments come out of the distribution: mean, variance, skewness, kurtosis. The system also computes a “crash probability” — the integrated left-tail mass below a moving threshold — that feeds the news-gate as one of its triggers.

What worked. The distribution shape is informative; the moment estimates are stable enough to use as alpha-engine inputs. The arbitrage check catches genuine bad chains maybe twice a week (data-quality issues on the source feed) and prevents the system from publishing garbage.

What didn’t. At the deep wings, the option prices are tiny and dominated by tick-size, so the implied density has noise that the smoothing spline can’t fully absorb. I report moments truncated at the 1% and 99% percentiles of the distribution rather than the full tail, which is an honest acknowledgment of where the signal degrades.

Pipeline 3: Multi-timeframe coordination

What it does. Combines four time horizons — tick, intraday, daily, weekly — into a single “consensus” signal with a confidence boost when they agree and a confidence haircut when they disagree.

Why it matters. Almost every “signal-only” approach to options trading produces wins on its preferred timeframe and losses on every other timeframe. The multi-timeframe layer is a discipline mechanism: the system doesn’t take a position unless the relevant horizons agree, and it sizes down when the agreement is partial.

How I implemented it. Each underlying signal generator emits a directional score in [-1, +1] at its native timeframe. The MultiTimeframeAggregator pulls all four, computes a weighted score with horizon-specific weights (intraday gets 65%, daily 20%, tick 10%, weekly 5%), and emits both the aggregated directional signal and a “consensus” object that includes the count of bullish/bearish/abstain signals and a “confidence boost” factor in [0, 1].

The output feeds directly into the alpha-score calculation in Layer 2 of the orchestrator. The confidence boost multiplies the raw alpha-score; mixed signals halve it.

What worked. The discipline is real. Position sizing is materially lower on mixed signals; the drawdown profile in shadow backtesting is tighter as a result.

What didn’t. The horizon weights are fixed, not learned. A proper version would condition the weights on the regime classifier’s output — intraday gets more weight in a trending regime, daily gets more weight in a range-bound regime. I haven’t built that; the fixed weights are a deliberate simplification with a real cost.

Pipeline 4: Cross-sectional alpha

What it does. Ranks the F&O-eligible single-stock universe (11 names by current configuration: RELIANCE, TCS, HDFCBANK, ICICIBANK, INFY, HINDUNILVR, SBIN, BHARTIARTL, ITC, LT, AXISBANK) on momentum and mean-reversion, and produces long/short candidates. The cross-sectional positions are dollar-balanced and sector-bucketed.

Why it matters. The index-only strategies (NIFTY straddles, BANKNIFTY iron condors) are the bulk of the alpha-engine attention, but cross-sectional single-stock alpha is the diversifier. When index vol compresses and the index strategies stop producing, the cross-section often has dispersion to trade.

How I implemented it. 15-day momentum (top-decile longs, bottom-decile shorts), 20-day mean-reversion with z-score entry at |z| > 1.3, and a sector neutralizer that prevents the system from going all-in on financials. The signals feed into the same Layer-2 combiner as the index signals, weighted lower (5% weight) because the per-stock liquidity is much thinner than NIFTY.

What worked. The mean-reversion signal is the more reliable of the two. The 20-day z-score reversal is a well-documented effect in Indian large-cap equities, and the system captures it cleanly.

What didn’t. The momentum signal degrades during high-vol regimes — the trend-following property of momentum reverses when vol spikes — and the regime conditioning isn’t strong enough. I have a regime_conditional flag that downweights momentum when the regime classifier says “HIGH_VOL”, but the threshold isn’t well-tuned. The position sizing on the cross-section is small enough that the cost is contained, but it’s a real cost.

Pipeline 5: Statistical robustness (DSR, PSR, CSCV)

What it does. Computes the Deflated Sharpe Ratio, Probabilistic Sharpe Ratio, and the Combinatorially Symmetric Cross-Validation overfit probability for every strategy that the system runs. The numbers are published on the Performance page and feed the strategy-decay monitor.

Why it matters. Sharpe is a famously gameable metric. A strategy with a high in-sample Sharpe and a high CSCV overfit probability is, technically, a strategy that may be entirely noise. The DSR adjusts the Sharpe estimate for the number of trials run against it; the PSR gives a confidence interval; the CSCV directly estimates the probability of in-sample overfit. All three together are a much stronger signal than any one alone.

How I implemented it. Bailey-López de Prado (2014) for DSR/PSR. CSCV requires a sufficiently long return series to chunk meaningfully; the implementation uses sliding windows once a strategy has at least 60 trading days of returns. The factor pool’s decay_monitor reads these numbers daily and tags factors whose DSR drops below a configured floor.

What worked. The decay monitor has retired two factors in the last week that looked good on in-sample backtests but had high CSCV overfit probabilities. That’s the right outcome; a system without this check would have left them in the pool and degraded.

What didn’t. The minimum sample size for CSCV is a real constraint. New factors don’t get a CSCV estimate until day 60, and during the warm-up window I’m using only DSR + qualitative review. That’s a known gap.

Pipeline 6: Real-time performance attribution

What it does. Decomposes each day’s portfolio P&L into per-factor contributions and per-strategy contributions. The output is the Performance page’s attribution chart and the daily postmortem that gets cached in the database.

Why it matters. “I made money today” is not actionable. “I made money today because the IV surface compressed faster than my vol forecast expected, but I lost money on cross-sectional momentum because financials decoupled” is actionable. Attribution turns daily noise into a feedback signal that the strategy-selection layer can act on.

How I implemented it. The factor_contribution.py and pnl_decomposer.py modules under cross_cutting/performance_attribution/. The decomposition is multiplicative — each factor’s contribution is its exposure × its realized return, summed over the day. The attribution is reconciled against the actual P&L; any discrepancy above a 5% threshold flags the day as “unattributed” and goes to a manual review queue.

What worked. The attribution is honest about uncertainty. When the system can’t cleanly attribute a P&L move (typically when multiple factors move in opposite directions), it says so rather than fitting a clean story.

What didn’t. The attribution is fundamentally a linear decomposition over a non-linear payoff. Options P&L has gamma effects that aren’t captured by an exposure-times-return formulation. The attribution is approximately correct on small moves and increasingly approximate on large moves. The 5% reconciliation threshold catches the worst cases but doesn’t fix the structural issue.

Pipeline 7: Anomaly detection

What it does. Watches the live signal stream and the live P&L stream for anomalies — sudden shifts in alpha-score distribution, days with attribution residuals larger than the historical 99th percentile, drift in the factor pool’s IC distribution. Flags get escalated to the trade-postmortem system, which writes a structured root-cause draft.

Why it matters. Most production-ML failure modes are silent. A model degrades gradually; the system keeps making decisions; the decisions slowly stop making money. Without an anomaly detection layer, the first signal that something has changed is a string of losing days, which is far too late.

How I implemented it. A simple but effective approach: rolling z-score thresholds on the four key indicators (alpha-score distribution, P&L residual size, IC IQR, hit-rate). Any indicator breaching the threshold for three consecutive windows triggers an alert. The alert is structured — it contains the indicator name, the breach magnitude, the timestamp, the affected factors, and the system’s automatic hypothesis about cause — and is written to a queue that the operator (me, currently) reviews.

What worked. The alerting cadence is correctly noisy — about one alert per week, which is enough that I look at each one but not so many that I tune them out. Two of the four alerts in the last fortnight were real degradation; the other two were false positives on regime transitions.

What didn’t. The “automatic hypothesis about cause” is currently a templated string, not a learned classifier. A more mature system would correlate the breach with the recent code-change history, the recent factor-pool changes, and the recent regime changes to produce a ranked list of likely causes. I haven’t built that.

Pipeline 8: Direction Prediction Ensemble

I noticed a pattern early in shadow backtests: the weighted-average signal combiner was producing decent directional predictions in trending markets but failing in transitional regimes. The math was correct — each signal had its weight calibrated from historical IC analysis. But the weighted average was treating each signal as if its reliability were constant. In reality, signal reliability varies dramatically by regime — the cross-market signal that’s gold during sustained moves is noise during chop, and the implied-distribution signal that catches reversals is silent during clean trends.

That’s the gap the Direction Prediction Ensemble fills: instead of fixed weights, a LightGBM model learns how to combine the seven Layer-1 signals adaptively based on current market context, with a separate confidence head and SHAP-based explainability.

class DirectionPredictionEnsemble:
    """
    Two-stage learned ensemble: direction + confidence, with SHAP.
    Replaces the rule-based weighted combiner where signal reliability
    is conditional on regime rather than constant.
    """

    def __init__(self) -> None:
        self.direction_model = None    # LightGBM regressor, target = next-period log-return
        self.confidence_model = None   # LightGBM classifier, target = was the direction call right?
        self._shap_explainer = None

    def predict(self, signals, market_context) -> DirectionPrediction:
        features = self._build_features(signals, market_context)

        direction_score = float(self.direction_model.predict([features])[0])
        confidence = float(self.confidence_model.predict_proba([features])[0][1])

        # SHAP for explainability — critical for trust in production.
        # A black-box ensemble that says "trust me" isn't useful when a
        # tenant later asks why the system entered a 4-lot straddle.
        contributions = self._shap_explainer.shap_values([features])[0]

        return DirectionPrediction(
            direction_score=direction_score,
            confidence=confidence,
            top_contributing_signals=self._top_drivers(contributions, signals),
        )

What worked. Directional accuracy moved from ~60% (rule-based weighted combiner) to ~64% in out-of-sample evaluation — meaningful but not transformative. The bigger win was confidence calibration: the model is genuinely honest about uncertainty, and the confidence head’s reliability diagram is well-calibrated. Knowing when not to trust the prediction turned out to be more valuable than the marginal accuracy gain.

What didn’t. LightGBM doesn’t extrapolate. In regimes that look unlike anything in the training set, the model produces predictions in the middle of its training distribution rather than admitting it’s out of domain. A conformal-prediction wrapper on top would be the right fix; I haven’t built it yet.

Pipeline 9: Optimal Entry Timing Model

Within any approved entry window — typically a 15-minute window between Layer-3 approval and the strategy expiring as stale — there’s a question of timing. Enter immediately and you may get worse fills than waiting 5 minutes. Wait too long and the regime shifts or the window closes. The decision is microstructure-dependent (current spread, recent order-flow), time-of-day-dependent (the open and the close are systematically different), and signal-strength-dependent (a strong signal can afford to pay up for immediate entry; a weak one shouldn’t).

class OptimalEntryTimingModel:
    """
    Four-class classifier over the entry window.
    Trained on counterfactual analysis of historical entries:
    for each past trade, what would the optimal action have been?
    """

    OUTPUT_CLASSES = ('ENTER_NOW', 'WAIT_5_MIN', 'WAIT_10_MIN', 'SKIP_WINDOW')
    CONFIDENCE_FLOOR = 0.45  # below this, don't try to be clever

    def recommend_timing(self, signal, market_state) -> TimingRecommendation:
        features = self._extract_microstructure_features(market_state, signal)
        probas = self.model.predict_proba([features])[0]
        recommended = self.OUTPUT_CLASSES[probas.argmax()]
        confidence = float(probas.max())

        # Critical: don't over-optimize when uncertain. A weak signal
        # from the timing model is worse than no signal at all because
        # it adds latency without producing better fills.
        if confidence < self.CONFIDENCE_FLOOR:
            return TimingRecommendation(
                action='ENTER_NOW',
                confidence=confidence,
                reasoning='timing-model confidence below floor — default to immediate entry',
            )

        return TimingRecommendation(action=recommended, confidence=confidence)

What worked. Entry-price improvement of 0.5-1.5% across the strategies where timing matters most (straddles, premium-selling structures). The graceful-degradation pattern — below 0.45 confidence, fall back to immediate entry — is a general principle I now use everywhere. ML for clear recommendations, simple defaults when uncertain.

What didn’t. Counterfactual labels are noisier than I expected (see §10 for the longer version). The “optimal action” for a historical trade depends on assumptions about slippage and execution that themselves have uncertainty. I trained under multiple assumption regimes and weighted the loss to reflect the uncertainty, which helped but didn’t eliminate the problem.

Pipeline 10: Pin Risk Probabilistic Predictor

Pin risk on expiry day is the failure mode that quietly eats premium-selling strategies. The rule-based version of pin risk that I shipped first — PinRiskCalculator in services/ml/strategy_selection/expiry_day_strategies/ — looks at distance-to-strike and OI concentration, and produces a low/medium/high label. It catches the obvious cases. It misses the subtle ones.

The illustrative incident: a Thursday expiry where the system was holding three iron condors with shorts at 24700 CE and 24300 PE. At 2:45 PM, NIFTY was at 24,615 — comfortably in the profit zone. By 3:25 PM, it had moved to 24,698 and pinned. The 24700 CE short went from comfortable to threatened in 40 minutes; we exited at ~2.5× the usual stop. The rule-based pin risk score had said we were safe. The reality was that 24700 had ~4× more open interest than any other in-range strike, and the OI concentration was telling a story the rules didn’t capture.

class PinRiskProbabilisticPredictor:
    """
    LightGBM model trained on per-strike pinning history.
    Captures patterns the rule-based PinRiskCalculator misses:
    round-number bias, OI-asymmetry effects, intraday flow patterns.
    """

    def predict_pin_probabilities(self, market_state) -> PinAssessment:
        # Generate candidate strikes within +/- 2% of spot
        candidate_strikes = self._get_candidates(market_state.spot, range_pct=0.02)

        predictions = []
        for strike in candidate_strikes:
            features = self._extract_strike_features(strike, market_state)
            pin_prob = float(self.model.predict_proba([features])[0][1])
            predictions.append({
                'strike': strike,
                'pin_probability': pin_prob,
                'distance_pts': abs(market_state.spot - strike),
                'oi_share': features['oi_share'],
            })

        return PinAssessment(
            most_likely_pin_strike=max(predictions, key=lambda x: x['pin_probability'])['strike'],
            top_3_candidates=sorted(predictions, key=lambda x: -x['pin_probability'])[:3],
            distribution=predictions,
        )

What worked / what I learned. The first thing the model surfaced was that round-number strikes — strikes ending in 00 — had materially higher pin rates than adjacent strikes even when OI was similar. That pattern wasn’t in any rule-based heuristic I’d seen documented; it shows up because retail OI clusters on round numbers and dealer hedging concentrates there. Folding the round-number indicator into the position-sizing decision reduced expiry-day blowouts in shadow backtesting. The broader lesson: domain heuristics are valuable but often miss subtle patterns that even a modest tree-based model can pick up.

Pipeline 11: Wing Selection Optimizer

Iron condors and butterflies have short strikes (where you collect premium) and long strikes — wings — where you pay for protection. The conventional retail approach is fixed-distance wings: 100 points OTM from the short strikes. The conventional approach is wrong often enough to be expensive. Optimal wing placement varies with the skew at that point in the curve, the liquidity at candidate strikes, and the specific risk profile being constructed.

class WingSelectionOptimizer:
    """
    XGBoost regressor over candidate wing distances, predicting
    structure return net of frictions. Optimizes the cost-protection
    tradeoff per market context rather than using a fixed distance.
    """

    CANDIDATES = (50, 75, 100, 125, 150, 200)

    def select_optimal_wing(self, short_strike, structure_type, market_state) -> WingRecommendation:
        evaluations = []
        for distance in self.CANDIDATES:
            features = self._build_features(short_strike, distance, structure_type, market_state)
            predicted_net_return = float(self.model.predict([features])[0])
            evaluations.append({
                'wing_distance': distance,
                'predicted_net_return': predicted_net_return,
                'estimated_cost': self._wing_cost(short_strike, distance, market_state),
                'liquidity_score': features['wing_liquidity_score'],
            })

        best = max(evaluations, key=lambda x: x['predicted_net_return'])
        return WingRecommendation(**best, alternatives=evaluations)

What worked. Wing optimization improved structure returns by ~5-10% on average via better skew capture and explicit avoidance of illiquid wings. The improvement concentrated in unusual market conditions — skew-steep regimes, post-event environments — where the conventional 100-point heuristic underperformed.

What didn’t. The model is opinionated about wings but doesn’t model the interaction with the underlying directional view. A wider wing on a structure that the directional model is bearish on is a different bet than the same wider wing on a neutral structure; the wing-selection model doesn’t see this. A joint formulation would be cleaner; the current factorization is a pragmatic split.

Pipeline 12: Event Impact Magnitude Predictor

The initial event handling was conservative: blanket blackouts during scheduled events (RBI policy, Fed announcements, US CPI prints, Indian Budget). Then I started tracking what would have happened if the system had traded through. Many events that produced a blackout moved markets less than 0.5%. The system was leaving opportunity on the table to avoid risk that, ex post, wasn’t there.

The problem isn’t event recognition — every news provider tags those — it’s magnitude prediction. Some RBI policy announcements move NIFTY 2%. Some move it 0.3%. Some US CPI prints matter for Indian markets. Some don’t. The right behavior is graduated, not binary.

class EventImpactMagnitudePredictor:
    """
    Three coupled models per event:
      magnitude_model  — regression: expected |move| in pct
      direction_model  — classification: up/down/neutral
      iv_change_model  — regression: expected post-event IV change

    Separate models because the targets have different noise
    structures and benefit from different feature engineering.
    """

    def predict_event_impact(self, event, market_state) -> EventImpactPrediction:
        features = self._build_features(event, market_state)
        magnitude_pct = float(self.magnitude_model.predict([features])[0])

        return EventImpactPrediction(
            expected_magnitude_pct=magnitude_pct,
            magnitude_confidence=self._magnitude_confidence(features),
            direction_probabilities=self._predict_direction(features),
            expected_iv_change=float(self.iv_change_model.predict([features])[0]),
            recommended_action=self._action_from_magnitude(magnitude_pct),
        )

    def _action_from_magnitude(self, magnitude_pct: float) -> str:
        # Graduated response — the design choice that made the
        # difference between "we trade through everything except disasters"
        # and "we blackout for everything that might move 1%".
        if magnitude_pct < 0.5: return 'PROCEED_NORMALLY'
        if magnitude_pct < 1.0: return 'REDUCE_SIZE_25_PCT'
        if magnitude_pct < 1.5: return 'REDUCE_SIZE_50_PCT'
        return 'BLACKOUT_24H'

What worked. The graduated-response mapping was the most useful design decision. Replacing the binary trade/blackout switch with four sizing buckets recovered meaningful capacity on event days — most events don’t actually justify a blackout. The model is conservative on residual uncertainty: when the magnitude prediction is below its confidence floor, the system defaults to the next-stricter bucket.

What didn’t. The training set per event type is small (one RBI policy meeting per quarter, two-four Budget announcements per year). The magnitude regressor learns class-imbalanced patterns where “small move” is the modal label. A meta-learning approach across event types — borrowing strength across RBI, Fed, CPI, Budget — would help. I haven’t built it.

6. The closed-loop learning systems

Pipeline 5 above (statistical robustness) starts to overlap with the closed-loop story; this section is the explicit version. A closed-loop system is one where the model’s performance in production feeds back into the model itself — the model adapts, retrains, or is replaced based on what it’s currently doing in the wild.

The hard problems in closed-loop ML are well-known and underappreciated:

Catastrophic forgetting. A neural model fine-tuned on the latest data forgets the patterns from older data. In trading, this is fatal — the older data contains the rare-regime patterns that the model needs to remember for the next time those regimes recur.
Distribution shift detection. Knowing when to retrain vs. when to stay put. Retraining on noise is worse than not retraining at all.
Safe deployment. A newly retrained model needs to be evaluated against the production model before it’s promoted; otherwise the promotion is a coin flip on whether you just made the system better or worse.
Rollback. When the newly promoted model underperforms, you need to be able to swap it back without losing the work the system has done in the meantime.

Quantsentinel has six closed loops, each addressing a specific subset of these problems. I’ll walk through each.

Loop 1: Deep hedging fine-tuning with EWC

The deep-hedging model — the neural net that learns the optimal hedge ratio for a portfolio under transaction costs — is the most aggressive online-learner in the system. It fine-tunes daily on the previous day’s realized hedge-error signal. The catastrophic-forgetting problem hits this loop hardest: the model needs to remember the hedge dynamics from prior vol regimes even as it adapts to the current one.

The solution is Elastic Weight Consolidation (Kirkpatrick et al., 2017). For each parameter in the network, EWC stores the parameter’s value at the end of the previous training phase, and a Fisher Information estimate of how important that parameter was to the previous task. The next training phase adds a quadratic penalty on changes to high-importance parameters — the model can move them, but only when the new task’s gradient is strong enough to overwhelm the EWC term.

import torch

class EWCRegularizer:
    """
    Elastic Weight Consolidation. Prevents catastrophic forgetting by
    penalizing changes to weights that were important for previous tasks.
    The Fisher Information diagonal approximation is the practical version;
    the full Fisher is intractable for any reasonable network.
    """
    def __init__(self, lambda_reg: float = 400.0):
        self.lambda_reg = lambda_reg
        self.fisher: dict[str, torch.Tensor] = {}
        self.optimal: dict[str, torch.Tensor] = {}

    def consolidate(self, model: torch.nn.Module, data_loader) -> None:
        """Snapshot current weights + estimate Fisher diagonal."""
        fisher = {n: torch.zeros_like(p) for n, p in model.named_parameters()}
        model.eval()
        for batch in data_loader:
            model.zero_grad()
            logp = compute_log_likelihood(model(batch.x), batch.y)
            logp.backward()
            for n, p in model.named_parameters():
                if p.grad is not None:
                    fisher[n] += p.grad.detach() ** 2
        for n in fisher:
            fisher[n] /= len(data_loader)
        self.fisher = fisher
        self.optimal = {n: p.detach().clone() for n, p in model.named_parameters()}

    def penalty(self, model: torch.nn.Module) -> torch.Tensor:
        """Add to training loss as a regularizer."""
        loss = torch.tensor(0.0, device=next(model.parameters()).device)
        for n, p in model.named_parameters():
            if n in self.fisher:
                loss = loss + (self.fisher[n] * (p - self.optimal[n]) ** 2).sum()
        return self.lambda_reg * loss

What worked. The hedge-error degradation that I saw on the unconstrained fine-tuning loop disappeared once EWC was added. The model retains the gamma-hedging behavior from high-vol regimes even when the current regime is range-bound.

What didn’t. The Fisher diagonal is an approximation, and on the deeper layers of the network it’s a poor one — the off-diagonal terms matter. λ = 400 is the value I’m running; that came from a half-day of grid search and it’s almost certainly not the right value across all regimes. A regime-conditional EWC term would be the right next step.

Loop 2: RL strategy selector

The strategy-selection layer chooses which of the available option structures (iron condor, debit spread, straddle, strangle, single-leg, futures-directional) is appropriate for the current state. This is a contextual bandit problem at heart: the context is the regime + the alpha-engine output + the current portfolio state, the arms are the structures, and the reward is the realized P&L of the trade.

I considered three formulations: contextual bandit (Thompson sampling), full RL with PPO, and a regime-conditional supervised classifier. The supervised classifier loses the exploration property — it’ll only ever choose what it’s seen work, which is fatal in a non-stationary environment. PPO is overkill for a single-step decision and adds a credit-assignment problem (what reward attaches to what decision?) that’s hard to solve cleanly in this domain.

The implementation is a contextual Thompson-sampling bandit with a Bayesian linear regression on the per-structure reward, conditioned on the regime. The regime_conditional.py module partitions the arms by regime, so each regime maintains its own posterior over arm rewards. The exploration rate is implicit in the posterior variance — when an arm has been pulled rarely in a regime, its posterior is wide, and Thompson sampling will explore it more often.

There’s also a diverse_reward.py layer that adds a diversification bonus — pulling the same arm five times in a row gets a small penalty regardless of the realized P&L, to prevent the bandit from locking into a strategy that worked in the recent past but is now an over-concentration risk.

What worked. The bandit explores enough to surface new arms (the futures-directional arm was unused for the first three days, then started getting selected when the conviction-band router fed it the high-confidence directional signals). The regime conditioning keeps the iron-condor arm from being pulled in high-vol regimes where it loses money.

What didn’t. The Bayesian linear regression assumes the reward is approximately Gaussian, which it isn’t — options P&L has heavy tails. The posterior is honest about uncertainty in the mean but underestimates the variance, which means the bandit explores less than it should during tail events. A Student-t or empirical-Bayes reformulation would be the right fix; I haven’t built it.

Loop 3: Factor pool evolution with LLM mining

The alpha-discovery factor pool starts with ~30 seed factors (combinations of moving averages, OI deltas, IV surface features, term-structure features, etc.) and evolves over time. New candidate factors come from two sources: a hand-coded seeds.py registry, and an LLM miner that uses Gemini 2.5 Pro to propose new factor expressions based on the recent IC distribution of the existing pool.

The LLM miner is constrained: it can only produce expressions in a restricted DSL over the available features. Each proposal goes through three filters before it joins the pool — a static lint check (does the expression compile? does it reference real features?), a backtest filter (does it have a Sharpe > 0.3 over the last 90 days?), and a CSCV overfit-probability check (does it have less than 0.7 estimated overfit probability?). Roughly one in twenty LLM proposals clears all three; the rest are logged for inspection.

# Simplified version of the alpha-validation gate
@dataclass
class AlphaCandidate:
    expression: str
    proposed_by: str  # 'llm' | 'seed' | 'manual'
    proposed_at: datetime

class AlphaValidator:
    """
    Three-stage filter for any factor entering the pool.
    The order matters: cheap lint check first, expensive CSCV last.
    """
    def validate(self, candidate: AlphaCandidate) -> ValidationResult:
        # Stage 1: lint (microseconds)
        if not self._compiles(candidate.expression):
            return ValidationResult.rejected('does not compile')
        # Stage 2: backtest (seconds)
        bt = self._backtest(candidate.expression, days=90)
        if bt.sharpe < 0.3:
            return ValidationResult.rejected(f'sharpe {bt.sharpe:.2f} below floor')
        # Stage 3: overfit check (seconds — CSCV is expensive)
        cscv = self._compute_cscv(bt.returns)
        if cscv.overfit_probability > 0.7:
            return ValidationResult.rejected(f'cscv overfit p={cscv.overfit_probability:.2f}')
        return ValidationResult.accepted(sharpe=bt.sharpe, cscv=cscv)

The factor pool’s decay_monitor runs daily — every factor in production gets its IC re-estimated on the latest 30 days, and any factor whose IC drops below 0.05 (a low but non-zero threshold) gets demoted to “shadow” status. A shadow factor still gets its returns tracked but doesn’t contribute to live decisions.

What worked. The pool is growing organically. Two of the current production factors came from LLM proposals; one was an idea I’d have thought of eventually, the other (a term-structure-conditioned OI-delta) is something I wouldn’t have constructed myself.

What didn’t. The LLM proposes a lot of subtly-broken factors — expressions that look reasonable but are dimensionally inconsistent (mixing log-returns with raw price changes, etc.). The lint check catches some of these but not all. A proper type system over the DSL would catch them at lint time; I’m running with the leakier filter for now.

Loop 4: Ensemble weight adaptation with Bayesian model averaging

The Layer-2 signal combiner is a weighted ensemble over the seven alpha signals. The weights are not fixed — they adapt via Bayesian model averaging, where each signal’s weight is proportional to its posterior probability of being the “correct” signal generator given the recent observed outcomes.

The implementation uses conjugate priors (Beta-Bernoulli on each signal’s hit-rate) and updates each day. The weights are smoothed (an exponential moving average with a 14-day half-life) to prevent rapid weight changes on noise.

What worked. The weights drift cleanly with the data. The cross_market_signal weight rose from 0.10 to 0.18 over the first week of live data as it accumulated a string of correct calls; the volatility_ensemble weight stayed close to its prior because it doesn’t have enough samples yet to update.

What didn’t. Bayesian model averaging assumes the candidate models are mutually exclusive — exactly one of them is “the true model.” That’s a tortured assumption when applied to a portfolio of signals that genuinely complement each other. A proper Bayesian model combination (rather than averaging) would acknowledge this. I’m using the simpler formulation as a starting point with the intent to upgrade.

Loop 5: TFT online learning

The Temporal Fusion Transformer is the one neural model in the signal-generation layer. It’s used for the daily direction forecast on NIFTY and BANKNIFTY. Like the deep-hedger, it fine-tunes online — but the online-learning regime is more conservative because the TFT’s parameter count is much higher and EWC scales poorly to that size.

Instead of EWC, the TFT uses a “distilled forecaster” pattern: a much smaller MLP is distilled from the TFT’s predictions, and the smaller model is what gets fine-tuned daily. The full TFT is retrained only weekly, on the full training set, and the distilled student is then re-distilled from it. The result is a model that adapts to recent conditions through the student while preserving the breadth of the teacher across the weekly retrain boundary.

What worked. The directional accuracy improved from 53% (baseline) to 57% on out-of-sample data — a meaningful improvement, but well below the 70%+ that some trading-ML papers claim. The improvement is concentrated in trending regimes (60% accuracy) and degrades in range-bound markets (52% accuracy).

What didn’t. The TFT’s attention weights are not as interpretable as the model’s marketing implies. I spent several hours trying to extract meaningful “which feature mattered for this prediction” stories from the attention heads and got mostly noise. The model works; the interpretability story doesn’t.

Loop 6: Signal weight adaptation under degradation

This is the meta-loop: when the anomaly detector (Pipeline 7) flags that a specific signal has degraded, the system downweights that signal’s contribution to the alpha-score immediately, without waiting for the slower Bayesian update in Loop 4 to catch up.

The implementation is a circuit-breaker pattern: each signal has a “circuit state” (closed / half-open / open). A flagged degradation moves the state to open, which sets the weight to zero. After 24 hours of monitoring (the signal continues to produce its score, which gets compared to ground-truth without affecting decisions), the breaker moves to half-open and tests with a small weight. If the signal performs at half-open, the breaker closes again; if not, it stays open and re-tests in another 24 hours.

What worked. The circuit breaker has fired once during live operation, on a transient degradation of the sentiment_momentum signal during a news event. The signal recovered within the 24-hour monitoring window and was reinstated cleanly.

What didn’t. The half-open testing weight is fixed at 0.1× the normal weight. That number is a guess; a more principled approach would test at a weight proportional to the recovered confidence.

Integrating the five later modules into existing loops

The five pipelines I added later (DirectionPredictionEnsemble, OptimalEntryTiming, PinRiskPredictor, WingSelectionOptimizer, EventImpactPredictor) each need a retraining cycle. Rather than build five new closed loops, I plugged them into the existing infrastructure:

DirectionPredictionEnsemble retrains weekly against the existing paired-evaluation framework. The previous week’s signal-vs-outcome record is the new training increment; the prior week’s model is the evaluation peer; the promotion gate is “directional accuracy + calibration on a held-out slice.”
OptimalEntryTimingModel retrains every ~100 entries, matching the TFT’s online-learning cadence. The counterfactual labels are recomputed lazily from the order-flow history, with the assumption-regime weighting described in §10.
PinRiskPredictor retrains monthly — there are only ~4 expiries per month, so the per-month data volume is what bounds the cadence.
WingSelectionOptimizer retrains quarterly. Wing-selection patterns are slow-evolving; weekly retrains are noise.
EventImpactPredictor retrains opportunistically after each major event type fires. Each RBI policy meeting, each Fed announcement, each Budget — the new event becomes one row of training data, and the model rebuilds against the cumulative history.

The lesson from this integration was about infrastructure investment. The closed-loop infrastructure I’d built early — the EWC harness, the circuit-breaker pattern, the Bayesian model-averaging combiner, the paired-evaluation gate — made adding new models close to trivial. Each new model needed about 50 lines of integration code to plug into the existing retraining, validation, and rollback workflows. The difference between research-grade ML and production ML, expressed concretely, is whether the supporting infrastructure lets you iterate at the speed the problem demands.

7. The MLOps infrastructure

This is the part most teams underinvest in, and it’s where “a good model” and “a production ML system” diverge. If your model performance is good but your deployment story is “the data scientist emails a .pkl file to the engineer,” you do not have a production ML system. You have a research project that ships occasionally.

After building the first handful of models for the alpha engine, I faced the production-ML question every team eventually faces: how do you manage 17 models with different training cadences, different retraining requirements, and overlapping production lifecycles, without the operational story becoming a full-time job? My answer was MLflow as the foundation with a custom production layer on top. Neither pure-custom nor pure-MLflow was the right answer; the combination was.

Why not pure custom

I started by building a custom registry. Three weeks in, I was reinventing MLflow. The custom system was tracking experiments, versioning models, storing artifacts, and providing comparison views — all things MLflow does well out of the box. The custom-built version had less mature tooling, less community familiarity, and required ongoing maintenance for capabilities that already existed in standard tools. The “we’ll build it ourselves” path is appealing when you’re moving fast; six weeks later it’s a maintenance line item with no community to absorb the work.

Why not pure MLflow

MLflow alone, though, didn’t cover several production needs that this domain genuinely requires:

Concurrent training safety. If two training jobs for the same model type fire simultaneously — which is easy to hit when an operator manually triggers a retrain while the scheduled cron is already running — they can corrupt each other’s state. MLflow doesn’t natively prevent this. I added a TrainingLockManager over PostgreSQL advisory locks: before any training starts, acquire a lock keyed on (type_id, tenant_id); the lock auto-releases on process death.
Atomic promotion transactions. Promoting a model from staging to production involves multiple state changes — the production pointer, the version status, the archive of the previous production model, the audit log entry. MLflow’s stage-transition API isn’t transactional in the way Postgres SERIALIZABLE is. I wrapped the promotion path in a serializable transaction so a partial promotion never leaves the system in a mixed state.
Per-tenant model isolation. Multi-tenant context means models can be tenant-scoped (a per-tenant strike-selection model trained on that tenant’s specific behavior). MLflow doesn’t natively understand tenant scoping — every model is global. I added a tenant-aware lookup layer that filters by tenant id at every read.
Validation contracts. Each model type has specific validation requirements: a minimum accuracy floor, required metrics, custom validators (the RL strategy selector must not exhibit mode collapse; the deep-hedger must not drift its Greeks past a threshold). I added a ValidationContractValidator that gates promotion based on these per-type rules. MLflow’s promotion step is generic — it doesn’t know whether your specific model type is allowed to be promoted with these specific metrics.
Failed-training-attempt logging. When training fails — NaN loss, OOM, data corruption, a feature-pipeline bug — I want the full context: partial metrics, stack trace, the input artifact id. MLflow’s experiment tracking doesn’t naturally store failed attempts; it’s optimized for successful runs. I added a dedicated failed_training_attempts table that captures everything about the failure so the next attempt can learn from it.

The resulting pattern

The shape of the integration:

class QuantsentinelMLflowRegistry:
    """
    MLflow as foundation, custom layer for production-specific needs:
    concurrent-training safety, atomic promotions, per-tenant scope,
    validation contracts, and failed-attempt logging.
    """

    def __init__(self, tracking_uri: str) -> None:
        mlflow.set_tracking_uri(tracking_uri)
        self.lock_manager = TrainingLockManager()
        self.validator = ValidationContractValidator()
        self.production_tracker = CustomProductionTracker()

    def train_and_save(self, model, metadata, validation_metrics):
        # Custom: acquire training lock so concurrent retrains don't collide.
        with self.lock_manager.acquire(metadata.type_id, metadata.tenant_id):
            with mlflow.start_run(run_name=f"{metadata.type_id}_v{metadata.semantic_version}"):
                mlflow.log_params(metadata.hyperparameters)
                mlflow.log_metrics(validation_metrics)
                self._log_model(model, metadata.framework)

                # Custom: stamp tenant + git commit for full lineage.
                mlflow.set_tags({
                    'tenant_id':         str(metadata.tenant_id) if metadata.tenant_id else 'global',
                    'code_git_commit':   metadata.code_git_commit,
                    'production_status': 'staged',
                })

                run_id = mlflow.active_run().info.run_id

                # Custom: register the version in the production-tracking layer.
                self.production_tracker.register_version(run_id, metadata)
                return run_id

    def promote_to_production(self, run_id: str) -> None:
        # Custom: validate against the per-type contract before promotion.
        if not self.validator.validate_promotion(run_id):
            raise PromotionRejected(run_id)

        # Custom: atomic SERIALIZABLE transaction across all state changes.
        with self._serializable_transaction():
            mlflow.tracking.MlflowClient().set_tag(run_id, 'production_status', 'production')
            self.production_tracker.set_as_production(run_id)
            self.production_tracker.log_promotion_event(run_id)

The custom production layer added about 600 lines of code beyond what MLflow provides — meaningful, but a small fraction of what a pure-custom registry would have required. The leverage of building on MLflow’s foundation was substantial: the experiment tracking, the artifact storage, the comparison views, the model-flavor abstraction, the UI for browsing runs — all of that comes for free.

The custom-side schema lives in a model_registry namespace on Cloud SQL and supplements MLflow’s tables with the production-specific things MLflow doesn’t track:

-- The most important one
CREATE TABLE model_registry.model_versions (
    version_id            UUID PRIMARY KEY,
    type_id               VARCHAR(100) NOT NULL,
    tenant_id             UUID,                       -- NULL = global model

    -- Versioning
    semantic_version      VARCHAR(20) NOT NULL,       -- e.g. "1.4.0"
    parent_version_id     UUID REFERENCES model_versions(version_id),

    -- Code + data lineage — critical for reproducibility
    code_git_commit       VARCHAR(40) NOT NULL,
    code_git_branch       VARCHAR(100) NOT NULL,
    training_data_version VARCHAR(100),
    training_data_end     DATE,

    -- Hyperparameters + metrics (JSONB for evolution without migrations)
    hyperparameters       JSONB NOT NULL,
    feature_set           JSONB,
    train_metrics         JSONB,
    validation_metrics    JSONB,

    -- Operational state
    status                VARCHAR(20) NOT NULL,       -- 'staging' | 'production' | 'archived'
    promoted_at           TIMESTAMPTZ,
    archived_at           TIMESTAMPTZ,

    -- Integrity verification
    artifact_path         VARCHAR(255) NOT NULL,      -- filesystem or s3://
    artifact_sha256       VARCHAR(64)  NOT NULL,
    artifact_size_bytes   BIGINT       NOT NULL,

    created_at            TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    UNIQUE (type_id, tenant_id, semantic_version)
);

Plus six supporting tables: model_types (what kinds of models exist + their framework adapter), model_promotions (the audit log of every promotion event), model_evaluations (paired evaluations of new-vs-current production models), model_locks (the PostgreSQL-advisory-lock state used by TrainingLockManager), model_lineage (parent-child relationships across retrains), training_data_versions (the snapshot identifiers MLflow doesn’t manage natively), and failed_training_attempts (the failure-context table described above).

What this pattern is and isn’t right for

The pattern — “industry tool as foundation, custom layer where the domain genuinely needs it” — turned out to be the right balance for this scope. Some honest reconsiderations of when it would not be the right answer:

At 1-3 models, no per-tenant scope: pure MLflow is enough. The custom layer would be premature.
At 100+ models with heavy orchestration needs: Kubeflow or Vertex AI or SageMaker — the orchestration story matters more than the registry story at that scale.
For HFT or sub-100ms serving: MLflow’s model-serving overhead is too high. Something purpose-built (or just a frozen artifact mounted into the trading-process) is required.
For a regulated context with audit-of-everything requirements: would invest more in the validation-contract and audit-log infrastructure; might add a model-card requirement at promotion time.

For this system — 17 models, weekly retrain cadence on the heaviest ones, multi-tenant scope, one operator — MLflow plus the custom production layer hit the right balance.

Adapters as a swap-in for frameworks

The adapter parameter on save_model is the contract that lets the registry stay framework-agnostic. Each adapter implements three operations: serialize-to-bytes, deserialize-from-bytes, and validate-fingerprint. The PyTorch adapter uses torch.save over a state_dict; the sklearn adapter uses joblib over a pipeline; the Stable Baselines3 adapter uses the framework’s own model.save() because the internal structure is opaque; the PyMC adapter pickles the InferenceData and the model spec; the JSON adapter handles config-style models (hyperparameter dictionaries, decision-tree thresholds extracted to JSON).

class SklearnAdapter(ModelAdapter):
    framework = "sklearn"

    def serialize(self, model) -> bytes:
        buf = io.BytesIO()
        joblib.dump(model, buf, compress=3)
        return buf.getvalue()

    def deserialize(self, blob: bytes):
        return joblib.load(io.BytesIO(blob))

    def fingerprint(self, model) -> str:
        # A stable hash of the model's structure + parameters.
        # Used to confirm that what we loaded is what we saved.
        return hashlib.sha256(self.serialize(model)).hexdigest()

The fingerprint check is what catches “the bytes on disk got corrupted in transit” and “the framework version on the loader is incompatible with what the saver used.” When the fingerprint mismatches, the registry refuses to promote and pages the operator.

Promotion workflow with safety gates

Promotion from staging to production is not a single update. It’s a workflow:

Paired evaluation. The new model and the current production model both score the same evaluation slice. The evaluation is held out from both training sets.
Sanity gates. The new model must score above an absolute floor (per-type-id), beat the current production model on the primary metric, and not regress on the secondary metrics by more than a configured tolerance.
Shadow rollout. Once the gates pass, the new model is moved to shadow — it produces predictions on live data but those predictions don’t enter the trade decision. The shadow predictions are logged for a configured window (typically 48 hours).
Final promotion. If the shadow window completes without anomalies (no large divergence from the production model’s predictions on the same inputs), the new model is promoted to production and the previous production model is archived.
Rollback path. Every promotion record contains the previous production version’s id. Rollback is a single API call that restores the previous version and re-archives the new one.

The whole workflow takes 48 hours minimum for any promotion, by design. The friction is the point — production-ML failures often come from “we deployed it on a Friday because we were excited,” and the workflow makes it structurally hard to do that.

What this section demonstrates

If you read this section and your reaction was “of course, every production ML system has this” — you’re working in an unusually mature shop and you should keep doing what you’re doing. In my experience, perhaps 20% of teams claiming “production ML” actually have anything like this discipline. The rest are running on hope and a .pkl file.

8. Risk management as a first-class concern

Risk in this system is not “we noticed risk was important and added some checks.” It is the dominant architectural concern. Every other decision — what the alpha engine looks like, how the model registry works, how the multi-tenant layer is structured — sits under the constraint that the risk surface must be enforceable.

The seven concentric walls. A trade leaves the castle only when every ring approves; any one ring blocking returns a structured rejection with the reason logged to the audit trail.

The 7-wall risk castle

A single decision passes seven walls in sequence before it becomes a trade:

Wall 1: Regime gate. The current market regime — derived from the multi-timeframe vol forecast and the realized-vol/implied-vol gap — must match the strategy’s “appropriate regime” set. Iron condors don’t get to pass through this gate when the regime is “HIGH_VOL”. Straddles don’t pass when the regime is “LOW_VOL_RANGE”. The gate’s decide method returns BLOCK with a reason string the audit log retains.

Wall 2: News gate. A live news event with an impact-score above a threshold blocks new positions. The threshold is regime-dependent and event-type-dependent — an RBI policy print blocks everything for 30 minutes; an earnings-related news item for a single stock blocks only positions in that stock and its sector peers.

Wall 3: Risk policy gate. The static per-tenant limits — max position size as a % of capital, max number of concurrent positions, max sector concentration, max single-leg exposure. These are configured per tenant and enforced unconditionally.

Wall 4: Margin gate. The intended trade’s margin requirement (computed from the live option chain + the broker’s stated SPAN ELM table) must fit in available margin with a configured buffer (default 30%). Margin gate failures are the most common cause of “blocked” decisions and the right kind of failure — they prevent the system from being unable to manage the position once it’s on.

Wall 5: Portfolio correlation gate. The intended trade’s correlation with the existing portfolio must be below a threshold. The implementation is a covariance estimate over the last 60 days of returns; if the new position is highly correlated with the existing book, it’s blocked even if every other gate passes. This is the gate that prevents the system from going “all in on financials” by stacking correlated trades.

Wall 6: Cost-engine gate. The intended trade’s expected value, net of every cost line item — STT, brokerage, exchange charges, SEBI fees, GST, expected slippage — must be positive. This single gate is, in my honest assessment, the highest-EV piece of risk infrastructure in the system. A meaningful fraction of retail F&O losses come from trades whose gross EV was positive and whose net EV was negative once the costs were properly accounted for. The cost engine refuses those trades; the trader (the system, here) never has to make the discretionary call.

The implementation lives in the OpportunityScorer class and runs three composable checks: a net-EV check (the hard rule), a cost-ratio check (frictions must not exceed 40% of gross edge, even if net is positive), and a win-probability floor (below 45% the trade is too thin to take regardless of payoff asymmetry). This is the conviction-scoring layer the spec asks for — the gate that bridges Layer-3 opportunity scoring to Wall-6 risk approval.

class OpportunityScorer:
    """The cost-engine / conviction gate. Three composable checks;
    first failure short-circuits the rest and produces a structured
    rejection reason that the audit log retains verbatim."""

    def __init__(self, *,
                 max_cost_ratio: float = 0.40,       # frictions ≤ 40% of edge
                 min_win_probability: float = 0.45,  # below this, too thin
                 ) -> None:
        self._max_cost_ratio = max_cost_ratio
        self._min_win_prob = min_win_probability

    def score(self, opp, *, edge_per_lot, cost_per_lot,
              slippage_per_lot, win_probability) -> ScoredOpportunity:
        gross = float(edge_per_lot)
        net = gross - float(cost_per_lot) - float(slippage_per_lot)
        cost_ratio = ((cost_per_lot + slippage_per_lot)
                      / max(1e-9, abs(gross))) if gross else math.inf
        wp = max(0.0, min(1.0, float(win_probability)))

        # Three gates, evaluated in order of how often they fire.
        if net <= 0:
            return ScoredOpportunity.rejected(opp,
                f"net_ev_per_lot={net:.2f} ≤ 0 (hard rule §6.2)")
        if cost_ratio > self._max_cost_ratio:
            return ScoredOpportunity.rejected(opp,
                f"cost_ratio={cost_ratio:.2f} > {self._max_cost_ratio:.2f}",
                notes=["bad liquidity — frictions eat the edge"])
        if wp < self._min_win_prob:
            return ScoredOpportunity.rejected(opp,
                f"win_probability={wp:.2f} below floor {self._min_win_prob:.2f}")

        return ScoredOpportunity.approved(opp, net_ev=net,
                                          cost_ratio=cost_ratio,
                                          win_probability=wp)

When a tenant later asks “why didn’t the system take this trade,” the answer is cost_ratio=0.52 > 0.40 — not a black box, not a vibe.

Wall 7: Tail-hedge requirement. Certain structures (naked premium-selling, long-vol exposures with limited downside) require an attached tail hedge before they pass. The hedge selection is its own subsystem (tail_hedge.py) that picks the cheapest hedge that meets the required protection level. The composite trade — original structure + tail hedge — is what gets evaluated; the structure alone is never approved.

The walls are evaluated in sequence, not in parallel, by design. Each wall is allowed to short-circuit the next; the audit log records the first wall that blocked, which is the canonical “why was this trade rejected” reason. The sequential model also gives me a clean place to add new walls — Wall 8 is currently a placeholder for an event-conditional hedge requirement that I haven’t built.

class RiskCastle:
    """
    Sequential 7-wall risk evaluator.
    Order matters: the gates are arranged cheap-to-expensive, with the
    most-likely-to-block first. This minimizes the average decision cost.
    """
    def __init__(self, gates: list[RiskGate]):
        self.gates = gates  # Order: regime, news, policy, margin, corr, cost, tail

    def evaluate(self, opportunity: Opportunity, context: Context) -> RiskVerdict:
        for gate in self.gates:
            verdict = gate.evaluate(opportunity, context)
            if verdict.blocked:
                return RiskVerdict.blocked(
                    by=gate.name,
                    reason=verdict.reason,
                    timestamp=now(),
                )
        return RiskVerdict.approved()

Kill switches: 12 of them

Beyond the gates, the system has twelve kill switches — circuit breakers that, when tripped, halt new decisions at progressively higher scopes:

Per-strategy kill. A specific strategy whose live drawdown exceeds its configured limit is halted.
Per-tenant kill. A tenant whose daily loss exceeds its risk limit has all new decisions halted; existing positions can still be managed.
Per-symbol kill. A symbol with anomalous behavior (a circuit-broker event, an extreme IV move) is halted globally.
Per-broker kill. A broker (Upstox, Groww) experiencing an outage is removed from the order-routing pool.
Per-region kill. An exchange-level event (NSE outage, market-wide circuit breaker) halts everything routing to that exchange.
Capital-floor kill. A tenant whose available capital drops below an absolute floor (₹50K by default) is locked out of new positions to prevent capital exhaustion.
Stale-data kill. If the live data feeds are more than 60 seconds stale, no new decisions can be made.
Model-staleness kill. If a model’s last retrain is more than 2× its expected retrain cadence, the system flags it and refuses to use it for new decisions.
Margin-utilization kill. If portfolio margin utilization exceeds 80%, no new positions are taken (existing ones can still be managed).
Correlation-cluster kill. If the portfolio’s effective correlation rises above 0.7, all new positions are blocked until the existing positions are reduced.
Dry-run-lock kill. New tenants are locked into paper-trading for 14 days. The lock is enforced at the database row level; there’s no flag to flip in code.
Master kill. A single operator-controlled switch that halts everything everywhere. Exists for emergencies.

The twelve are not redundant. Each addresses a specific failure mode that isn’t covered by the others. The master kill is the one I’ve used exactly once, during a deployment-related hiccup; the other eleven fire automatically when their triggers fire.

Multi-tenant isolation: enforced at multiple layers

Tenants on top, shared infrastructure below, three isolation barriers between them. The dashed lines are not policies — they are tested architectural invariants.

The isolation between tenants is not a property of the application code alone — it’s enforced at three layers, and any one of the three breaking is supposed to be caught by the other two.

Application layer. Every database query, every Redis key, every websocket room carries a tenant_id. The application code uses a tenant-scoped session object that refuses to operate without an explicit tenant. The audit trail records the tenant on every action.
Database layer. Row-level security policies on the tenant-scoped tables. The qs_app Postgres role can only see rows for the tenant whose id has been set via SELECT set_tenant_id(...) at the start of the transaction. A bug that forgets to set the tenant id produces an empty result set, not a cross-tenant leak.
Architectural layer. The live-signal pub/sub layer has a BEFORE INSERT trigger that hard-errors if a signal_id is delivered to multiple tenants. The Telegram channel verifies the chat is a private DM before sending. The FCM channel pins the device token at registration time to the tenant. These are architectural impossibilities, not policies.

The honest version of “we have multi-tenant isolation” is “we have three layers that all have to break simultaneously for there to be a leak, and one of them is a database trigger that hard-errors at INSERT time, so there’s no way for the application to fail-quietly into a leak.” That’s the bar I held the architecture to.

What this section demonstrates

In a regulated domain, this is where a system earns trust. The seven walls, the twelve kill switches, the three isolation layers — none of these are individually impressive. The fact that they all exist, that they were designed in from the start rather than retrofitted, and that the audit trail is structured to support a regulator’s first request — that’s the signal.

9. Engineering tradeoffs I made

Most “what I built” posts are silent on the alternatives that were considered and rejected. That silence is the gap between a marketing post and a technical case study. Below are ten major decisions with full tradeoff analysis. The format for each: the decision, why this decision, what I gave up, how I’d reconsider in a different context.

Decision 1: Python over C++ for the core

Why this decision. Latency budget is 1-2 seconds per decision; Python’s per-operation overhead is on the order of microseconds; numerical hotpaths are NumPy/Pandas/scikit-learn which are C-implemented anyway. Development velocity matters; the dev-test-deploy cycle on Python is far tighter than on C++ for the same code change.

What I gave up. A 10-50x performance ceiling that C++ would provide. If this were HFT, sub-millisecond decisions, the choice would be wrong.

How I’d reconsider. For HFT, C++. For sub-millisecond, also C++. For a system where the bottleneck is I/O (which is most of what Quantsentinel does), Python is the right choice and stays the right choice as the system scales — until the operating-cost equation changes.

Decision 2: Gemini 2.5 Pro for factor mining and narration

Why this decision. Gemini’s reasoning on financial concepts is strong; the structured-output mode is reliable for the constrained DSL I needed for factor expressions; the pricing is favorable for the volumes I’m running (factor mining: a few hundred proposals per week; narration: one per signal, with aggressive caching).

What I gave up. Customization that a fine-tuned smaller model could provide. Local deployment as an option (Gemini is API-only). Dependence on a single vendor for a workload that’s now on the critical path for one of the closed loops.

How I’d reconsider. If the volume exceeded $5K/month I’d train a smaller specialized model on a curated dataset of “good” factor proposals. If sub-second latency on the narration was required (it isn’t — I cache aggressively and the narration is a nice-to-have on the live signal card), I’d use a faster but less capable model and accept the quality tradeoff.

Decision 3: Multi-tenant from the first commit

Why this decision. Retrofitting tenant isolation onto a single-tenant codebase is a known failure mode. Every query, every Redis key, every websocket room, every log line has to learn it’s tenant-scoped, and the inevitable miss is a cross-tenant leak that your tenants will find before you do.

What I gave up. Dev velocity in the first week. I couldn’t just “select all positions” without first picking a tenant. The error messages when I forgot to scope a query were ugly.

How I’d reconsider. If I were building a single-purpose internal tool for a known small team, I’d skip multi-tenant and accept the latent cost. For anything customer-facing, multi-tenant from day one is non-negotiable in retrospect.

Decision 4: MLflow as foundation with a custom production layer (not pure custom, not pure MLflow)

Why this decision. See §7 in full. Short version: MLflow covers ~80% of MLOps needs with industry-standard tooling. Building everything custom is reinventing the wheel; I tried it first and was three weeks in before I admitted the truth. But MLflow alone doesn’t address several specific production needs — concurrent training safety, atomic promotions, per-tenant scope, validation contracts, failed-training-attempt logging. The combination (~600 lines of custom production layer on top of MLflow) hits the right balance for this scope.

What I gave up. Pure customization control over the foundation layer. Some MLflow conventions I’d structure differently if building from scratch — but that’s a 1% loss against the leverage of standing on MLflow’s mature core.

How I’d reconsider in different contexts:

Smaller scale (1-3 models, no tenant scope): pure MLflow is sufficient — the custom additions are premature.
Larger scale (100+ models, heavy orchestration): Kubeflow / Vertex AI / SageMaker — registry isn’t the bottleneck anymore, orchestration is.
HFT or sub-100ms serving: MLflow’s serving stack is too heavy; need something purpose-built.
This system (17 models, weekly retrain, multi-tenant, one operator): MLflow + custom is the right balance.

The principle that emerged: leverage industry tools where they cover your needs, build custom only where they genuinely don’t. Don’t build custom for prestige — build it where it adds actual value beyond what’s available off the shelf.

Decision 5: Astro for the editorial blog

Why this decision. The blog is fully static, no auth, no per-user state. Astro produces tiny HTML; the build is sub-second; the deployment is decoupled from the trading app’s deploy pipeline.

What I gave up. A single deployment path. There are now two: one for the trading app, one for the blog.

How I’d reconsider. If the blog grew dynamic features (user accounts, comments, gated content), I’d merge it back into the main Next.js app. As a static editorial surface, Astro is correct.

Decision 6: TimescaleDB extension over a separate time-series store

Why this decision. The time-series data — option-chain snapshots, news events, depth ticks — is heavy on writes but light on analytical queries. Timescale’s hypertables give me automatic chunking by time, retention policies as config, and SQL compatibility with the rest of the system.

What I gave up. Best-in-class analytical performance for very large time-series. If I needed to run aggressive aggregations over years of tick data, ClickHouse would be faster.

How I’d reconsider. At ~100GB of time-series data per year and the query patterns I have (mostly “give me the latest snapshot per instrument”), Timescale is more than enough. At 1TB+, the math changes.

Why this decision. Cookies are HTTP-only and Secure by default; the CSRF token rotation on every /auth/me is straightforward to implement; the frontend doesn’t have to think about token storage. The same-origin model makes cross-site cookie attacks structurally limited.

What I gave up. Easy testing from external tools (curl can do it, but with care). Easier API consumption by external clients (an Authorization header is more portable).

How I’d reconsider. If I needed an external API for partners or institutional clients to consume, I’d add a parallel JWT-based auth path for them while keeping the cookie auth for the browser frontend.

Decision 8: Server-rendered Next.js, not a SPA

Why this decision. Server rendering for trading dashboards is the right shape — first paint is fast, the data fetching happens server-side where it’s closer to the gateway, the SEO story for the marketing surface is straightforward.

What I gave up. Some of the “feels native” responsiveness of a pure SPA. The route transitions are a touch slower than a SPA would be.

How I’d reconsider. For a dashboard where every interaction is a server roundtrip anyway (trading data is real-time, you can’t sensibly cache it client-side), SSR is correct. For an app with heavy client-side interactivity (a drawing tool, a video editor), it would be wrong.

Decision 9: Go for the gateway, Python for everything else

Why this decision. The gateway is near-100% I/O, near-zero numerical work. Go’s net/http with a goroutine-per-request model is the right shape for it. Python everywhere else because the numerical workload is the bulk of the work and Python’s ecosystem is unmatched.

What I gave up. The monolith property — one language across the codebase. There are now two languages to maintain, and the contracts between them have to be carefully designed.

How I’d reconsider. I’d reconsider if the gateway needed substantial numerical work (it doesn’t). I’d also reconsider in the opposite direction — moving the gateway to Rust if the latency budget on the gateway specifically dropped substantially (it hasn’t).

Decision 10: Single VM, not Kubernetes

Why this decision. Eleven services on Docker Compose on one e2-standard-4 is a perfectly reasonable deployment for this scale. Kubernetes adds substantial operational overhead — control plane, RBAC, network policies, ingress controllers — that I would not benefit from at one VM. I’m one operator; the system needs to be operable by one operator.

What I gave up. Horizontal scaling. If a single service became the bottleneck, I’d have to either scale the VM or restructure. The current shape doesn’t auto-scale.

How I’d reconsider. At ~5 VMs worth of load or ~10 services that need to scale independently, Kubernetes starts to repay its complexity. At 1 VM and 11 services, Compose is correct. Promoting prematurely is a common mistake — it adds operational cost long before it adds operational benefit.

Decision 11 (bonus): Editorial-blog separation from the trading-app deployment pipeline

Why this decision. The blog can ship at any time; the trading app’s deployment requires a specific window (no positions open, no live signals being delivered). Coupling them would mean the blog can’t ship during market hours, which is operationally annoying for no benefit.

What I gave up. A single deployment graph. I now have two, with different cadences.

How I’d reconsider. I wouldn’t — this is one of the decisions that aged best.

Decision 12 (bonus): Daily Upstox token rotation, manual

Why this decision. The Upstox token expires every 24 hours. Automating the rotation requires storing the refresh token in a secure store, automating the OAuth refresh, handling the edge cases when the refresh fails. The manual rotation is a 2-minute operator task on a known cadence; the automation would have been a week of work for a system that has one operator.

What I gave up. Operational autonomy. The system requires daily human attention.

How I’d reconsider. At a second operator, I’d automate. As a one-person operation, the manual cadence is acceptable. (And: this is the decision I’m most likely to reverse soon. It’s the wrong kind of dependency to have.)

10. What I got wrong

This is the section most “look at my project” posts skip. Here are the things I got wrong, in roughly the order I noticed them.

Wrong assumption 1: Backtest performance would translate cleanly to forward performance

What I assumed. A 20-30% degradation from backtest to live was a reasonable expectation, based on the standard discount for slippage.

What I found. The degradation is closer to 40-60%, and it’s not uniformly distributed across strategies. Cross-sectional momentum degrades least (perhaps 25%); index-options premium-selling degrades most (50-60%). The systematic causes:

Real slippage in Indian options is more variable than typical models account for. The bid-ask on deep ITM and deep OTM strikes can be 5-10x wider during low-volume periods than during the open and close.
Pin risk on expiry days is hard to capture in a backtest. The actual P&L distribution near pin on expiry day has long tails the historical IV doesn’t anticipate.
News-event gaps are not in the backtest data at all in many cases. A 2% gap-down on a budget announcement was not in the training window, but it definitely happens in live.

What I changed. I built a “conservative-backtest” mode that injects synthetic slippage at 2x the historical average and synthetic news gaps at the 95th percentile of historical news-day moves. The backtest results are uglier; the gap to live is smaller.

Wrong assumption 2: ML alpha would be the primary source of value

What I assumed. Sophisticated ML models would generate alpha proportional to their complexity, and the more elaborate the ML, the higher the expected return per unit of risk.

What I found. The ML alpha is real but modest. The TFT improved directional accuracy from 53% to 57%, which is meaningful but not transformative. The cost engine — a deterministic gate that refuses any trade with negative net EV — produced more value than every ML component combined.

Why. Retail trading losses come overwhelmingly from cost-drag (positions that were close to net-zero pre-cost and definitively negative post-cost) and from risk-management failures (positions held too long, sized too large, correlated too tightly), not from missing alpha. A system that prevents the cost-drag and enforces the risk discipline produces more value than a system with sophisticated ML and no discipline.

What I changed. I’d describe the system differently. It’s a risk management system with an alpha layer attached, not an ML system that happens to have risk gates. The phrasing isn’t cosmetic; it’s a different value proposition to a customer and a different system to evaluate.

Wrong assumption 3: Multi-timeframe coordination would be straightforward

What I assumed. Combine the per-horizon signals with weights proportional to historical accuracy, done.

What I found. The horizons disagree most when the trade is most interesting, and the disagreement is informative. A system that averages over disagreement throws away the signal in the disagreement itself.

What I changed. The MultiTimeframeAggregator now emits a “consensus” object with the disagreement structure (which horizons agree, which disagree, by how much), and the alpha-score is conditioned on the consensus structure, not just the average. This is materially better than the original design.

Wrong assumption 4: The model registry would be the boring part

What I assumed. Model registry is a known design; I’d implement it quickly and move on.

What I found. The registry is the operationally most-active piece of the system. Promotion windows, paired evaluations, shadow rollouts, the rollback path — these are the things I touch most often when something looks off. Building it well repaid the time many times over.

What I changed. Nothing in the registry; it works. But my prioritization for the next system would be: build the registry first, the alpha second. The registry is what makes the alpha safe to deploy.

Wrong assumption 5: Indian retail behavioral patterns would mirror US patterns

What I assumed. Documented retail behavioral biases (anchoring, loss aversion, hot-hand fallacy, etc.) would apply equally.

What I found. Indian retail F&O behavior is more extreme on certain dimensions. The “gambling vs. investing” cultural framing produces different behavior at the tails — more concentrated bets, more leveraged bets, more “trading the news” without a system. The behavioral data the platform collects for tenant-adaptation training does not look like the US literature.

What I changed. The persona-generation in tenant_adaptation is built from Indian-market data, not US-market priors. The synthetic personas (the “Priya” who reads Telegram tips, the “Rohit” who trades the budget announcement, the “Sanjay” who held HDFC through the housing crisis) are calibrated to behavior patterns I observed in actual NSE data, not to academic priors.

Wrong assumption 6: Anomaly detection would be precision-limited, not recall-limited

What I assumed. False positives would be the problem — the system would flag too many anomalies and I’d ignore them.

What I found. The opposite. The system was initially under-flagging. Real degradation events were not being caught because the rolling-window thresholds were too tight. The thresholds have been loosened twice; the alert cadence is now appropriately noisy (about one per week, two of the four last fortnight were real).

What I changed. A simpler thresholding (rolling z-score > 2.5, not > 3.0) and a longer monitoring window (three consecutive windows, not five). The system catches more, including more false positives, but the operator-review cost is acceptable.

Wrong assumption 7: The tenant-isolation tests would catch every cross-tenant bug

What I assumed. A thorough test suite for the tenant-isolation invariants would catch any code change that broke them.

What I found. The first cross-tenant bug I shipped was in the test code itself. A test fixture was setting up a “tenant A” and “tenant B” in the same session and forgetting to clear the session between them, so the live code path was actually running with the tenant id from the previous test. The test passed because the live code path saw “the correct” tenant id; the bug only surfaced when I manually checked.

What I changed. Test fixtures for tenant isolation now use a context manager that explicitly clears the session and asserts no leakage between tests. The DB trigger that hard-errors on cross-tenant deliveries was my real safety net; the test suite is a secondary check.

Wrong assumption 8: The gateway timeout would never matter

What I assumed. The gateway sits between the frontend and the Python services; the timeout on it is a function of how long my Python services take to respond, and I know my Python services are fast.

What I found. Not always. The /pipeline/decide endpoint takes ~15 seconds on a cold pass — running the entire alpha → strategy → risk → live-signal flow against fresh inputs. The gateway’s getJSON helper had a 10-second timeout. The gateway returned 502 every time the user hit the page cold; the frontend treated the 502 as no data and rendered an empty card. I described this in detail in §3 of an internal debug log; the fix was a one-line swap to the long-timeout client.

What I changed. The gateway now has explicit per-endpoint timeout configuration. Endpoints known to be expensive (/pipeline/decide, /research/refresh) use a long-timeout client; the rest use the 10-second default. A second wrong-default-timeout would now produce a sensible error message instead of a silent 502.

Wrong assumption 9: The deploy procedure would always succeed if I followed the steps

What I found wrong. Two deploy-related gotchas burned me twice each before I wrote them down. First: docker compose restart does not reload the .env file — it preserves the env from the original up. Rotating the daily Upstox token and running restart produced a service that thought it had the new token but didn’t. Fix: up -d --force-recreate is the right invocation. Second: git archive from a drifted working directory archives only the visible subtree, not the repo root. I deployed a partial archive twice before catching it. Fix: always git archive from the repo root.

What I changed. Both gotchas are now in the operator-memory file with their root cause, the symptom, and the fix. They will not bite me a third time. They might bite a future operator if I haven’t communicated them well — the writeup matters as much as the fix.

Wrong assumption 10: Frontend bundle URLs would not be a deploy-time concern

What I found. Adding a parallel front-door domain — a second hostname pointing at the same backend — silently broke login on the new domain because the frontend bundle had the original domain hardcoded as a fallback API origin. The browser made a cross-origin fetch, the preflight 405’d, the POST never went out, and the user saw “login failed” with no useful error message. The fix was to make the bundle host-agnostic — relative /api/v1 paths — so the same bundle works under any front-door domain.

What I changed. All four files in the frontend that referenced the hardcoded domain now default to relative paths. The bundle audit step in the deploy procedure checks for any absolute API URLs and fails the deploy if any are found.

Wrong assumption 11: All ML decisions could be handled by single multi-class classifiers

What I assumed. For the adjustment-decision model — which has to decide between HOLD, ROLL_UP, ROLL_OUT, ADD_PROTECTION, REDUCE, EXIT for an existing position — I could train one classifier across all six classes and get a useful production system.

What I found. The class imbalance was severe. HOLD was ~70% of historical cases. The other actions were rare individually. Multi-class accuracy was misleading because the model essentially always predicted HOLD with reasonable accuracy. The system was producing a 74% accurate number that was useless for actually deciding when to act.

What I changed. Hierarchical decomposition. First, a binary classifier: does this position need any adjustment at all? Then, conditional on yes, a multi-class classifier picks which adjustment. The binary classifier is calibratable in a way the unconditional multi-class isn’t, and the adjustment-type model gets to learn against a less degenerate prior. Per-class accuracy improved materially after the split. The broader lesson: when class imbalance is severe, the multi-class framing is hiding the real question.

Wrong assumption 12: Counterfactual labeling would be straightforward

What I assumed. For the entry-timing model, generating training labels would be straightforward: for each historical entry, simulate what would have happened if I’d entered now, in 5 minutes, in 10 minutes, or skipped the window. The simulation gives me the “optimal” action; that’s the label.

What I found. The counterfactual simulations are highly sensitive to the assumed slippage model and execution-timing model. Different assumption sets produce different “optimal” labels for the same historical entry. A trade that the optimistic-slippage simulator labels as WAIT_5_MIN is labeled ENTER_NOW under conservative slippage. The labels were noisier than my baseline mental model anticipated by a margin.

What I changed. Rather than commit to one assumption regime, I generate labels under three (optimistic / median / conservative), train three models against each, and ensemble at inference. The ensemble disagreement is itself an uncertainty signal — when the three models disagree on the timing recommendation, the system defaults to ENTER_NOW. The counterfactual problem isn’t solved — it’s contained.

Wrong assumption 13: Adding more ML always improved the decision

What I assumed. For each rule-based decision point in the system, ML would provide measurable improvement. The conventional wisdom that “ML beats rules” was something I treated as default-true and only questioned when forced to.

What I found. For some decisions, the rule-based approach was already capturing ~95% of the available signal. The marginal ML lift wasn’t worth the maintenance burden — the data pipelines, the monitoring, the retraining cadence, the model-registry overhead. I built a few ML modules that produced ~2% lift over the rule-based baseline and quietly retired them because the operational cost exceeded the benefit.

What I changed. Started evaluating proposed ML additions on expected lift, not on technical interest. Some planned modules I never built because the rule-based approach was already good enough; in those cases the rule got documented as the durable answer, not as a placeholder for the next ML version. This sounds obvious. It isn’t — there’s a real psychological pull toward building the “more sophisticated” solution, and the discipline of choosing not to is harder than it should be.

11. What this work demonstrates technically

The tables below map the work to the capabilities it actually exercises.

ML capabilities demonstrated

Capability	Evidence in Quantsentinel
Production ML systems at non-trivial scope	17 ML systems organized into a 4-layer alpha engine + 7-wall risk castle; six closed loops actively running; five later modules added by integrating with existing infrastructure rather than rebuilding it
Multi-paradigm ML	Supervised (XGBoost / LightGBM on direction, entry-timing, pin-risk, wing-selection, event-impact), reinforcement learning (contextual Thompson bandit on strategy selection), Bayesian (model averaging, conjugate updates on signal weights), generative (Gemini-driven factor mining), time-series (TFT, distilled student, HAR-RV, GARCH)
Ensemble methods	Bayesian model averaging on signal weights; HAR-RV + GARCH ensemble on vol forecast; learned LightGBM ensemble on direction prediction; conformal-alpha ensemble for the alpha pool
Time-series modeling	TFT (production-deployed), HAR-RV, GARCH, jump-diffusion components, anomaly detection on rolling indicators
ML systems thinking	MLflow-as-foundation + custom production layer; semantic versioning; paired-evaluation promotion workflow; shadow-rollout state; rollback path; validation contracts per model type; failed-attempt logging
Counterfactual learning	Used for the adjustment classifier and the entry-timing model; trained under multiple assumption regimes to bound the simulator-sensitivity problem
Multi-output models	Event-impact predictor returns three coupled outputs (magnitude regression + direction classification + IV-change regression) trained as separate heads against shared features
Hierarchical decision structure	Binary “needs adjustment?” gate before multi-class adjustment-type classification — the response to severe class imbalance, not a stylistic choice
Causal vs. predictive distinction	The system is honest about which signals are predictive (most) vs. structural (the risk-castle gates). Attribution distinguishes; the docs distinguish.
Online learning / continual learning	EWC on the deep-hedger; distilled-student on the TFT; Bayesian model averaging on the ensemble weights; contextual bandit on the strategy selector
Drift detection	Per-factor IC monitoring; feature-drift monitor; per-model anomaly detection; circuit-breaker pattern for fast-degradation response
MLOps integration judgment	MLflow + custom additions pattern, with explicit reasoning about when each layer is the right answer (and when neither is)
Explainability infrastructure	SHAP values computed alongside production predictions on the direction ensemble; surfaced in the audit trail for “why did the system enter this trade” questions

Engineering capabilities demonstrated

Capability	Evidence
System architecture	11-service event-driven distributed system supporting 17 ML systems; multi-tenant from day one; single-VM Compose deployment with explicit horizontal-scale path
Database design	TimescaleDB hypertables for time-series, relational tables for transactional data, custom `model_registry` schema supplementing MLflow, row-level-security policies for tenant isolation, BEFORE-INSERT trigger as the cross-tenant safety net
API design	Internal service boundaries (each service owns a clearly-defined domain), Go gateway as the BFF, per-tenant path resolution, structured rate limiting
Production operations	Auto-heal sidecar on every container; structured logging across all services; per-endpoint rate limits; manual operator runbooks; daily token rotation procedure documented
Security	Multi-layer tenant isolation (app + DB + arch); auth via signed cookies + CSRF rotation; secrets isolated to specific services (tenant_adaptation explicitly excludes broker credentials); audit logging for regulatory readiness
Concurrent-systems judgment	PostgreSQL advisory locks for training-job concurrency; SERIALIZABLE-isolation transactions for atomic promotions; designed-in protection against the race conditions that ML pipelines silently produce
Tool integration	MLflow + custom production layer — demonstrates the discipline of using mature tools as foundation and customizing only where they don’t fit, rather than the more common “build everything ourselves” or “trust the tool blindly” failure modes
Multi-language judgment	Go for the gateway (right tool), Python for the numerical work (right tool), Astro for the editorial blog (right tool). Not a “pick one language” answer.

Domain expertise demonstrated

Domain	Evidence
Quantitative finance	Volatility forecasting (HAR-RV + GARCH), implied distribution recovery (Breeden-Litzenberger), Greeks-aware position management, multi-leg structure construction, deep hedging
Risk management	7-wall risk castle as a first-class architectural concern; 12 kill switches at progressively higher scopes; cost-engine as a kill condition; tail-hedge requirement on naked positions; graduated event-handling rather than binary blackouts
Indian markets specifically	Cost engine accounts for STT, exchange fees, SEBI fees, GST; lot-size awareness (NIFTY moved from 50 to 65 in early 2026); FII/DII flow tracking; regime-conditional behavior for Indian market hours; round-number pinning patterns specific to NSE retail flow
Market microstructure	OI-flow signal; GEX estimation; depth-30 websocket consumer; order-book imbalance signal from the microstructure module; pin-risk concentration patterns surfaced via the probabilistic predictor
Options strategies	Iron condors, butterflies, vertical/calendar spreads, naked premium-selling with required tail hedges, expiry-day pin avoidance — each with structural decisions baked into the wing-selection optimizer
Futures strategies	Conviction-band-routed directional futures with intelligent hedge selection from the dedicated hedge model
ML for finance	17 specialized models, each targeting a specific decision in the trading workflow rather than one giant end-to-end model — the right factorization for a domain where interpretability per-decision matters
Statistical robustness	Deflated Sharpe, Probabilistic Sharpe, CSCV overfit probability — all three computed for every strategy; the decay monitor reads them daily; factor pool prunes accordingly
Behavioral / persona modeling	20-persona synthetic-trader system in the tenant-adaptation service; calibrated to Indian-market behavioral patterns, not US-market priors

What this does not claim

Equally important. The system has not been forward-tested for long enough to make confident claims about live alpha capture. The closed-beta tenants are running in paper-trading mode under a 14-day dry-run lock. The ML components have promising in-sample and out-of-sample results, but production-grade alpha claims require months of real capital, and the system doesn’t have that yet.

What it demonstrates is engineering capability and judgment, not the eventual financial outcome. Whether production ML can be shipped in a regulated domain — the architecture, the execution, the audit trail — is settled here. Whether this particular system makes money over ten years is a separate question, on a separate timeline.

12. Closing reflections

What I’d do differently

Start with simpler infrastructure. Prove value before adding complexity. The 4-layer alpha engine and the 7-wall risk castle are correct for what the system needs to do, but they were built before the simpler version had been validated end-to-end. Some of the early debugging would have been faster against a stripped-down baseline; the layered version is the right destination, but the path could have been less circuitous.

Spend more time on real-world data validation. The “wrong assumption 1” item — that backtest-to-live degradation would be 20-30%, not 40-60% — would have been caught earlier if I’d spent a day pulling actual historical fill data and comparing it to my slippage model before relying on the model. I trusted the model longer than the model deserved.

Build the boring infrastructure (logging, monitoring, alerting) earlier. The anomaly detection layer is real, but I built it after the first wave of “why is this dashboard empty” debugging sessions had already happened. Earlier investment in observability would have saved time downstream.

Write the operator runbook in parallel with the code, not after. The deploy gotchas (in the §10 “wrong assumption 9” item) bit me twice each because I hadn’t written them down the first time. Documentation as a memory aid for future me is a higher-leverage activity than I gave it credit for.

What this work positioned me for

Deep production ML in a regulated domain. Most “ML engineer” roles do not expose you to the regulated-domain constraint set — the audit trail, the multi-layer isolation, the deliberate friction in promotion workflows. Quantsentinel’s seven walls and twelve kill switches and three isolation layers are the kind of thing that’s hard to learn except by building one.

Quantitative finance expertise. The Greeks-aware position management, the implied-distribution recovery, the regime classification, the multi-timeframe coordination — these aren’t generic ML capabilities, and they’re not easy to acquire without working in this domain. The 70,342 lines of Python in this codebase are mostly evidence of this kind of domain depth.

System design at meaningful scope. Eleven services, seventeen ML systems, four-layer alpha engine, seven-wall risk castle, custom model registry, multi-tenant architecture from day one — this is system design at the level senior engineers are evaluated on, applied to a problem domain where most teams have not built anything at this depth.

Domain-specific judgment. Knowing when to use a Bayesian linear regression vs. a TFT, when to ship a contextual bandit vs. when to wait for proper RL, when to build vs. when to integrate, when manual operator discipline is acceptable vs. when automation is non-negotiable — this judgment is the most-valued thing on a senior team, and it’s the thing that’s hardest to demonstrate without the body of work to point at.

What I’m interested in next

ML / AI in financial services at scale. This is the intersection I find most worth working in — the codebase above is the long-form version of what that looks like in practice.

Quantitative system design in regulated domains. The hard, portable problem is “we need production ML in a regulated domain and don’t yet have the architecture story” — and it shows up the same way across finance, healthcare, insurance, and other regulated verticals.

Products that combine ML rigor with domain depth. The most interesting problems sit where “the team understands the domain” meets “the team understands ML production.” That intersection is where I want to keep building.

Closing invitation

The point of writing 14,000 words about a system isn’t the words. It’s the demonstration that the writer understood the system well enough to explain it honestly — including the parts that didn’t work, the alternatives that were rejected, and the assumptions that turned out to be wrong. That kind of writing is harder to fake than the system itself.

Thanks for reading.

Building Quantsentinel: ML for Indian Options Markets in Production

1. Opening: a Friday-afternoon HAR-RV failure

2. Why I built this

3. What makes Indian options markets a genuinely interesting technical problem

State-space explosion

Non-stationarity, but not the kind from the textbooks

Multi-objective optimization, and the objectives fight each other

Real-time, but a particular flavor of it

Adversarial environment

4. System architecture

What each service owns

How a trade decision actually flows

Key design decisions, with rationale

5. The ML pipelines

Pipeline 1: Volatility forecasting (HAR-RV + GARCH ensemble)

Pipeline 2: Implied distribution recovery (Breeden-Litzenberger)

Pipeline 3: Multi-timeframe coordination

Pipeline 4: Cross-sectional alpha

Pipeline 5: Statistical robustness (DSR, PSR, CSCV)

Pipeline 6: Real-time performance attribution

Pipeline 7: Anomaly detection

Pipeline 8: Direction Prediction Ensemble

Pipeline 9: Optimal Entry Timing Model

Pipeline 10: Pin Risk Probabilistic Predictor

Pipeline 11: Wing Selection Optimizer

Pipeline 12: Event Impact Magnitude Predictor

6. The closed-loop learning systems

Loop 1: Deep hedging fine-tuning with EWC

Loop 2: RL strategy selector

Loop 3: Factor pool evolution with LLM mining

Loop 4: Ensemble weight adaptation with Bayesian model averaging

Loop 5: TFT online learning

Loop 6: Signal weight adaptation under degradation

Integrating the five later modules into existing loops

7. The MLOps infrastructure

Why not pure custom

Why not pure MLflow

The resulting pattern

What this pattern is and isn’t right for

Adapters as a swap-in for frameworks

Promotion workflow with safety gates

What this section demonstrates

8. Risk management as a first-class concern

The 7-wall risk castle

Kill switches: 12 of them

Multi-tenant isolation: enforced at multiple layers

What this section demonstrates

9. Engineering tradeoffs I made

Decision 1: Python over C++ for the core

Decision 2: Gemini 2.5 Pro for factor mining and narration

Decision 3: Multi-tenant from the first commit

Decision 4: MLflow as foundation with a custom production layer (not pure custom, not pure MLflow)

Decision 5: Astro for the editorial blog

Decision 6: TimescaleDB extension over a separate time-series store

Decision 7: Cookie-based auth (not JWT in Authorization headers)

Decision 8: Server-rendered Next.js, not a SPA

Decision 9: Go for the gateway, Python for everything else

Decision 10: Single VM, not Kubernetes

Decision 11 (bonus): Editorial-blog separation from the trading-app deployment pipeline

Decision 12 (bonus): Daily Upstox token rotation, manual

10. What I got wrong

Wrong assumption 1: Backtest performance would translate cleanly to forward performance

Wrong assumption 2: ML alpha would be the primary source of value

Wrong assumption 3: Multi-timeframe coordination would be straightforward

Wrong assumption 4: The model registry would be the boring part

Wrong assumption 5: Indian retail behavioral patterns would mirror US patterns

Wrong assumption 6: Anomaly detection would be precision-limited, not recall-limited

Wrong assumption 7: The tenant-isolation tests would catch every cross-tenant bug

Wrong assumption 8: The gateway timeout would never matter

Wrong assumption 9: The deploy procedure would always succeed if I followed the steps

Wrong assumption 10: Frontend bundle URLs would not be a deploy-time concern

Wrong assumption 11: All ML decisions could be handled by single multi-class classifiers

Wrong assumption 12: Counterfactual labeling would be straightforward

Wrong assumption 13: Adding more ML always improved the decision

11. What this work demonstrates technically

ML capabilities demonstrated

Engineering capabilities demonstrated

Domain expertise demonstrated

What this does not claim

12. Closing reflections