Back to Jobs
Staff Infrastructure Engineer
Apply
As a Staff Cloud Infrastructure Engineer, you will design and own the multi-cluster orchestration layer that allows our computer vision platform to scale reliably and cost-effectively. The workload processes hundreds of matches per month, with extreme bustiness—most demand concentrates into narrow weekend windows—creating GPU capacity challenges that cannot be solved with standard, off-the-shelf approaches.
The foundational infrastructure is already in place (multiple multi-zonal Kubernetes clusters), but critical challenges remain:
- replacing manual workload placement with automated, priority-aware scheduling
- making spot/preemptible GPUs resilient without human intervention
- guaranteeing capacity for high-importance workloads
- and evolving toward a GPU strategy capable of supporting several times our current scale
This is a high-autonomy role with full ownership of the infrastructure stack. Your mission is to turn operational bottlenecks into robust, automated systems.
Who You Are
- You’ve operated Kubernetes in production at real scale, including multi-cluster or multi-region systems
- You deeply understand GPU infrastructure: scheduling constraints, quotas, spot dynamics, and why GPUs behave differently than CPUs
- You’ve built automation that removed manual operational work
- You’re fluent in infrastructure as code (Pulumi preferred, Terraform acceptable)
- You treat cost as an engineering constraint, not a post-hoc finance problem
- You’ve handled bursty, unpredictable workloads and the capacity planning challenges they create
- You can analyse complex workflow DAGs and reason about failure modes and bottlenecks
- You’ve experienced enough operational pain to know what must be automated now versus later
Core Technical Requirements
- Deep Kubernetes expertise (managed Kubernetes experience strongly preferred), including multi-cluster operations
- Hands-on experience with cloud GPU infrastructure: spot instances, capacity planning, instance selection
- Strong infrastructure-as-code skills
- Experience with workflow orchestration systems (Argo Workflows ideal; alternatives acceptable)
- Proven ability to design for reliability: SLAs, graceful degradation, automated recovery
- Strong understanding of cloud cost optimisation at the architectural level
- Effective use of LLM-based developer tools to accelerate infrastructure and platform work
Nice to Have
- Experience designing multi-cloud systems or cloud abstraction layers
- Background in video processing, media pipelines, or broadcast infrastructure
- Familiarity with ML infrastructure patterns and GPU scheduling for inference vs. batch workloads
- Contributions to open-source infrastructure projects (Kubernetes, Argo, or related tooling)