Jobs / Radiant
HPC Infrastructure Site Reliability Engineer
Radiant · London, ENG, United Kingdom
London, ENG, United KingdomExp: 8+ yrsRemote
Remuneration
Not specified
Location
London, ENG, United Kingdom
Visa sponsorship
Not specified
Job summary
We are seeking a senior Infrastructure Site Reliability Engineer with deep experience operating large-scale distributed systems and recent hands-on expertise in high-performance computing (HPC) and AI infrastructure. This operations-first SRE role involves working in a 24/7/365 on-call environment, responsible for ensuring reliability, performance, and continuous improvement of mission-critical infrastructure. The ideal candidate will have a strong background across bare metal, networking, storage, virtualization, and orchestration, alongside deep HPC experience.
Benefits
Exposure to industry-leading GPU and AI infrastructureOpportunities to grow alongside a rapidly scaling global businessCollaborative, inclusive, and supportive engineering cultureReal ownership and the ability to influence operational excellenceWork that sits at the intersection of people, performance, and technologyModern, flexible, globally connected workplace with ambitious goals
Qualifications
- deep experience operating large-scale distributed systems
- recent hands-on expertise in high-performance computing (HPC) and AI infrastructure
- experience in large-scale, globally distributed or multi-site infrastructure environments
- specialized in GPU-accelerated HPC systems
- strong breadth across bare metal, networking, storage, virtualization, and orchestration
- deep HPC experience including NVIDIA GPU ecosystems, RDMA networking (RoCE and InfiniBand), and performance validation and benchmarking
- strong Linux and distributed systems expertise
- 8+ years experience in Site Reliability Engineering, Infrastructure Engineering, or similar roles in large-scale distributed production environments operating a 24/7 support model
- 2–3+ years recent experience in HPC and/or AI infrastructure, including GPU-based compute environments at scale
- strong Linux expertise (preferably Ubuntu), including deep systems administration and production troubleshooting
- proven experience in performance tuning across compute systems, including kernel, BIOS/firmware, and storage subsystem optimization
- strong hands-on experience with bare-metal infrastructure and out-of-band management tooling (IPMI, iLo, iDRAC, Redfish or equivalent)
- solid networking fundamentals including TCP/IP, DNS, DHCP, VLANs, routing, and switching, with exposure to high-performance networking environments
- exposure to NVIDIA GPU ecosystems, including CUDA-based workloads and GPU-accelerated compute environments, including the NVIDIA AI reference architecture
- familiarity with high-performance networking technologies such as InfiniBand and RoCE
- strong experience with infrastructure automation and scripting (e.g. Bash, Python, Ansible or similar IaC/tooling approaches)
- understanding of observability principles and practical use of monitoring and telemetry systems (e.g. Prometheus, Grafana or equivalents)
- understanding of workload schedulers and running workloads across multiple systems in parallel
- practical experience with at least one parallel storage platform
- experience working in ITIL-aligned environments, including Incident, Major Incident, Problem, and Change Management
Responsibilities
- ensure reliability, performance, and continuous improvement of mission-critical infrastructure
- investigate complex, cross-layer issues spanning GPU compute, networking, storage, and orchestration
- perform performance evaluation, testing, and operational acceptance of new HPC environments
- work across hardware, network, and software layers to validate readiness of high-density GPU infrastructure and support safe, predictable deployment at scale
- play a central role in continuous service improvement (CSI)—reducing operational toil, increasing automation, and improving reliability, consistency, and operational efficiency across the platform
- strengthen observability, refine operational workflows, and eliminate repetitive or failure-prone processes
- help shape future infrastructure design and deployment approaches, feeding operational insight back into infrastructure engineering decisions and ensuring production learnings directly influence next-generation HPC platform evolution
- operate and improve high-density AI/HPC infrastructure in a 24/7 production environment
- participate in a 24x7x365 on-call rotation, supporting mission-critical systems and incident response
- troubleshoot complex issues across compute, networking, storage, and orchestration layers in GPU-accelerated environments
- lead performance evaluation, testing, and operational acceptance of new HPC infrastructure before production release
- drive continuous service improvement (CSI), reducing toil through automation, tooling, and process refinement
- build and maintain infrastructure automation and tooling (IaC and scripting) to improve reliability and operational efficiency
- optimize Linux systems for performance, including kernel, BIOS/firmware, and storage tuning for HPC workloads
- configure and operate bare-metal infrastructure using IPMI, iLO, iDRAC, Redfish, and related tooling
- partner with infrastructure tooling and observability teams to improve telemetry, alerting, and system visibility at scale
- own ITIL-aligned processes across Incident, Major Incident, Problem, and Change Management, ensuring strong execution and continuous improvement
- lead root cause analysis and ensure corrective actions are implemented and automated where possible
- play a key role in designing and delivering future HPC cluster and site builds, shaping global consistency and operational standards
- collaborate closely with Platform Engineering, Network Engineering, Infrastructure Tooling, and Data Centre Operations to improve reliability and deployment quality
Skills
AnsibleBashGrafanaKubernetesLinuxPrometheusPythonUbuntu
Certifications
LPIC CertificationsITIL Foundation level qualification
Degrees
Bachelor or Masters Level degree in Computer Science, Engineering or related fie
Work schedule
24/7/365 on-call environment24x7x365 on-call rotation
Industry
GPU-as-a-Service providerAIHPC
Relocation
No