Turning a five-script POC into a pipeline that survives Monday

The proof of concept worked. Five scripts, run in order, produced a trained model and a respectable score. It also took a folder of hardcoded paths, a specific run order only I knew, and my full attention every time it ran. That last part is the tell: a POC needs a human in the loop, and production can’t have one.

This is the story of the refactor that took it from “works when I run it” to “runs without me.”

From scripts to a package

The first move was structural. Five top-to-bottom scripts became a proper src/ package with a clear boundary between extraction, features, training, and serving. The CLI became the only entry point:

rcm extract --since 2026-06-01
rcm train --config configs/prod.yaml
rcm predict --window today

The point isn’t the tidiness. It’s that every step is now independently runnable, testable, and schedulable, instead of being a paragraph in a script that only makes sense in sequence.

Storage that doesn’t re-read the world

The POC re-pulled and re-read everything on every run. Fine for a demo over a sample. Ruinous over a growing production table.

Two changes fixed it. First, incremental extraction: each run pulls only rows newer than the last successful watermark, instead of the full history. Second, the landed data is written as Hive-partitioned Parquet:

data/
  rcm/
    dt=2026-06-07/part-0.parquet
    dt=2026-06-08/part-0.parquet
    dt=2026-06-09/part-0.parquet

Partitioning by date means a job that only cares about recent data reads three small files, not one enormous one. Parquet’s columnar layout means a model that uses forty columns doesn’t pay to read the other two hundred. The combined effect on read time was not subtle.

The unglamorous parts that actually matter

Production failure is mostly boring failure: a transient SQL timeout, a node that blinked, a job that half-finished. So the refactor spent most of its budget there.

SQL retry with backoff, because the database will hiccup and a single blip shouldn’t kill an overnight run.
Structured logging — JSON lines with a run id — so when something does break at 3am, the answer is in a query, not a scrollback.
Model versioning, so every prediction can be traced to the exact model and data window that produced it.

None of this improves the F1 score. All of it is the difference between a model that scored well once and a system a team can depend on.

The constraint that shaped everything

One rule sat above all of this: inference uses only the columns available at decision time, never the ones that only exist after the fact. It’s an easy rule to break by accident — a tempting feature that quietly encodes the future — and a hard one to debug once it’s in. Enforcing it at the schema boundary, rather than trusting myself to remember, was the single most valuable guardrail in the whole pipeline.

A POC answers “can this be modeled?” A pipeline answers “can this run on Monday without me?” They’re different questions, and the second one is most of the job.