Jobs / Radiant
Infrastructure Tooling & Observability Engineer( UK)
Radiant · London, ENG, United Kingdom
London, ENG, United KingdomExp: 6-8 yrsRemote
Remuneration
Not specified
Location
London, ENG, United Kingdom
Visa sponsorship
Not specified
Job summary
Seeking an Infrastructure Tooling & Observability Engineer to act as a key engineering force within the global Infrastructure Operations organization. This role involves translating high-level reliability objectives into scalable, production-ready systems that directly improve the resilience, efficiency, and performance of global infrastructure. The engineer will design and build internal control planes, intelligent observability, and automation systems, and contribute to Continual Service Improvement initiatives.
Qualifications
- Degree in Computer Science/Software Engineering, or equivalent experience.
- 6–8 years of experience in infrastructure engineering, DevOps, SRE, and/or software engineering roles, with a strong focus on operational systems.
- Proven experience in building or maintaining production infrastructure tooling or platform systems in a recent DevOps or software engineering role.
- Experience working in large-scale or distributed infrastructure environments (hyperscale, enterprise, or similarly complex systems).
- Strong programming ability in at least one of: Ruby (Rails), Go, or similar systems languages, with willingness and ability to work across multiple languages and codebases.
- Hands-on experience with infrastructure automation tools such as Ansible and orchestration platforms such as AWX.
- Strong experience with observability systems, including the Grafana stack (Prometheus, Loki, Mimir, and Grafana Alloy).
- Familiarity with low-level telemetry and infrastructure protocols such as SNMP and syslog.
- Experience working with Kubernetes or similar orchestration platforms in production environments.
- Understanding of API design and integration patterns, particularly REST-based services and service-to-service communication.
- Experience building and maintaining CI/CD pipelines, including GitHub Actions and self-hosted runners.
- Strong understanding of operational reliability concepts, including monitoring, alerting, capacity planning, and incident response.
- Comfortable working closely with SRE, Platform Engineering, and infrastructure teams to translate operational needs into maintainable software systems.
Responsibilities
- Design, build, and evolve internal tooling and observability platforms for large-scale infrastructure operations across distributed environments.
- Develop systems that convert high-volume telemetry (logs, metrics, events) into actionable insights, improving visibility, alerting quality, and operational decision-making.
- Translate SRE reliability requirements into scalable, production-ready software solutions, including automation for incident detection, prevention, and remediation.
- Drive automation across infrastructure operations, reducing manual effort in areas such as environment provisioning, cluster onboarding, inventory management, and lifecycle workflows.
- Build tooling for capacity management, performance testing, benchmarking, and automated collection and analysis of results.
- Contribute to Continual Service Improvement (CSI) initiatives by identifying operational inefficiencies and delivering durable engineering solutions.
- Work closely with SRE and infrastructure engineering teams to embed observability and reliability into core platform workflows.
- Interface with Platform Engineering teams to ensure tooling aligns with broader orchestration and infrastructure strategy.
- Integrate and extend existing systems written in Ruby/Rails and Go, contributing to a consistent and maintainable engineering ecosystem.
- Develop and maintain automation workflows using Ansible and AWX.
- Support CI/CD-driven operational tooling, including GitHub Actions and self-hosted runners.
Skills
AnsibleGitHubGitHub ActionsGoGrafanaKubernetesLokiMimirPrometheusRESTRuby
Certifications
Kubernetes Certified AdministratorCompTIA+ Security QualificationsLPI/LPIC certification
Degrees
Computer ScienceSoftware Engineering
Languages
RubyGo
Relocation
No