Jobs / Ant***
Senior Staff+ Software Engineer, Kubernetes Platform
Ant*** · London, ENG, United Kingdom
Visa sponsorship details are locked. Unlock company name and apply link with .
London, ENG, United KingdomExp: 12+ yrs325,000-485,000 GBP/yearlyHybrid
Remuneration
325,000-485,000 GBP/yearly
Location
London, ENG, United Kingdom
Visa sponsorship
Sponsors visa
Job summary
Ant*** is seeking a Senior Staff+ Software Engineer for their Kubernetes Platform team to manage and scale large Kubernetes clusters across multiple cloud providers. This role involves owning and extending the Kubernetes scheduler, scaling the control plane, and building core cluster services to ensure reliable training of frontier AI models. The ideal candidate will have significant experience in distributed systems and deep Kubernetes expertise to address the challenges of operating at an extreme scale.
Benefits
Competitive compensationBenefitsOptional equity donation matchingGenerous vacationParental leaveFlexible working hours
Qualifications
- Significant software engineering experience building and operating production distributed systems
- Proficiency in at least one systems-appropriate language (e.g., Go, Python, Rust, or C++)
- Deep, hands-on Kubernetes experience (beyond user level) in scheduler, controllers, apiserver, or operating large multi-tenant clusters
- Demonstrated ability to debug complex issues across the stack, from API behavior to node and network-level root causes
- Track record of designing for reliability, correctness, and clear failure semantics in systems other engineers depend on
- Strong written and verbal communication; comfort building consensus with internal stakeholders
- Experience with Kubernetes internals or contributions: kube-scheduler / scheduling framework, apiserver, etcd, client-go, controller-runtime, or similar
- Experience building or operating cluster schedulers or batch systems (e.g., Kueue, Volcano, Slurm, or in-house equivalents)
- Background scaling control planes or coordination systems (etcd, ZooKeeper, Consul, or large DNS/service-mesh deployments)
- Familiarity with ML infrastructure: GPUs, TPUs, or Trainium; gang scheduling; topology-aware placement; collective networking such as NCCL
- Experience with GCP and/or AWS, including GKE/EKS internals and Infrastructure as Code
- Low-level systems experience such as Linux kernel tuning, cgroups, or eBPF
- 12+ years of relevant industry experience, including time leading large, ambiguous infrastructure projects
- Bachelor’s degree or an equivalent combination of education, training, and/or experience
- Field of study relevant to the role as demonstrated through coursework, training, or professional experience
Responsibilities
- Own, operate, and extend the Kubernetes scheduler for Ant***'s accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption
- Scale the Kubernetes control plane (apiserver, etcd, controller-manager) to support clusters beyond typical limits, and identify bottlenecks
- Design, build, and operate core cluster services such as service discovery that every workload depends on
- Build and maintain custom controllers, operators, and CRDs
- Partner with research, training, and inference to understand workload shapes and translate requirements into platform capabilities
- Collaborate with cloud providers on required features and escalations
- Participate in on-call, lead incident response, and design processes (postmortems, runbooks, SLOs) to prevent repeating failures
Skills
AWSConsulC++EKSetcdGCPGKEGoKubernetesLinuxPythonRustZooKeeper
Degrees
Bachelor’s degree
Relocation
No