Production RCM claim classifier
A binary classifier predicting which healthcare claims need a specific revenue-cycle action, refactored from a notebook POC into a scheduled, versioned production pipeline.
- 0.67
- F1 (positive)
- 0.90
- ROC AUC
- 5 → 1 pkg
- from POC
The problem
In revenue cycle management, a large share of claims need a particular follow-up action, and finding them by hand doesn’t scale. The goal was a classifier that surfaces exactly those claims, accurately enough that an operations team would trust it.
What I built
I took a working-but-fragile proof of concept — five scripts run in sequence —
and rebuilt it as a structured src/ package with a single CLI entry point,
incremental extraction against a watermark, Hive-partitioned Parquet storage,
SQL retry logic, structured logging, and model versioning.
The model itself is a gradient-boosted tree (XGBoost) with engineered features: target encoding, interaction frequencies, and quartile bins. Hyperparameters were searched with Optuna, and the decision threshold was optimized for F1 on the positive class rather than left at the library default.
A constraint worth naming
Inference uses only the feature columns available at decision time — never columns that only become populated after the action has happened. That rule was enforced at the schema boundary so a tempting but leaky feature couldn’t sneak in. It’s the kind of guardrail that doesn’t move the offline metric but keeps the online system honest.
Outcome
The model reached an F1 of roughly 0.67 on the positive class and an AUC near 0.90. Just as importantly, it moved from “works when I run it” to a pipeline that runs unattended and can trace every prediction back to a specific model version and data window.