Spark DataFrame extraction API
A configurable PySpark service for pulling and deduplicating allocation data across workspaces, with the config-driven query layer that keeps it flexible.
- Spark/K8s
- runtime
- driven
- config
The problem
Allocation data needed to be pulled, filtered, and deduplicated consistently across many workspaces — with behavior driven by configuration rather than code changes per workspace.
What I built
A PySpark service running on Kubernetes, with queries and filters defined in config so a new workspace is an entry, not a deployment. The work included tracking down a subtle row-count discrepancy between two extraction paths, traced to different config sources, a missing allocation filter, and a global vs. per-workspace deduplication order — the kind of bug that only shows up at the boundary between two systems that each look correct on their own.