Spark DataFrame extraction API

The problem

Allocation data needed to be pulled, filtered, and deduplicated consistently across many workspaces — with behavior driven by configuration rather than code changes per workspace.

What I built

A PySpark service running on Kubernetes, with queries and filters defined in config so a new workspace is an entry, not a deployment. The work included tracking down a subtle row-count discrepancy between two extraction paths, traced to different config sources, a missing allocation filter, and a global vs. per-workspace deduplication order — the kind of bug that only shows up at the boundary between two systems that each look correct on their own.