Skip to content
← all work

Spark DataFrame extraction API

A configurable PySpark service for pulling and deduplicating allocation data across workspaces, with the config-driven query layer that keeps it flexible.

Spark/K8s
runtime
driven
config
PySpark Spark on Kubernetes Python SQL

The problem

Allocation data needed to be pulled, filtered, and deduplicated consistently across many workspaces — with behavior driven by configuration rather than code changes per workspace.

What I built

A PySpark service running on Kubernetes, with queries and filters defined in config so a new workspace is an entry, not a deployment. The work included tracking down a subtle row-count discrepancy between two extraction paths, traced to different config sources, a missing allocation filter, and a global vs. per-workspace deduplication order — the kind of bug that only shows up at the boundary between two systems that each look correct on their own.