Jobs / reddit

Staff Site Reliability Engineer - Site Experience

reddit · United Kingdom · Remote
United KingdomExp: 8+ yrsRemote
Remuneration
Not specified
Location
United Kingdom · Remote
Visa sponsorship
Not specified

Job summary

Reddit is seeking a Staff Site Reliability Engineer to lead reliability engineering initiatives for critical user-facing systems at internet scale. This role involves partnering with product and infrastructure teams to improve availability, latency, scalability, and operational excellence across Reddit’s most business-critical experiences. It is a highly technical leadership position focused on large-scale distributed systems and complex reliability challenges.

Benefits

Global Benefit programsFamily Planning SupportGender-Affirming CareMental Health & Coaching BenefitsGroup Personal Pension Scheme with Employer matchPrivate Medical and Dental SchemeIncome Replacement ProgramsBike to Work schemeFlexible VacationPaid Volunteer Time OffGenerous Paid Parental Leave

Qualifications

  • 8+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems
  • Strong collaboration and communication skills with the ability to influence technical direction across teams
  • Strong experience supporting high traffic, user-facing production environments
  • Deep understanding of distributed systems, networking, Linux systems, or cloud native architectures
  • Experience designing highly available systems with strong operational and reliability practices
  • Strong programming skills in languages such as Go, Python, or similar
  • Strong understanding of observability systems including metrics, logging, tracing, and alerting
  • Experience improving reliability through SLOs, automation, incident management, and performance optimization
  • Demonstrated ability to troubleshoot complex issues across applications, infrastructure, networking, and services
  • Experience operating systems at internet scale traffic volumes
  • Experience with Kubernetes, containers, cloud infrastructure, and modern deployment platforms
  • Familiarity with technologies such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, or similar distributed infrastructure technologies
  • Experience with CDN optimization, edge reliability, traffic engineering, or global infrastructure
  • Contributions to open source software or participation in technical communities
  • Experience leading large scale incident response and operational transformation initiatives

Responsibilities

  • Lead Reliability Engineering for User Experience
  • Drive reliability, scalability, and operational excellence for critical user-facing systems and services
  • Improve performance and resiliency across APIs, content delivery, feed generation, search, messaging, and real-time experiences
  • Architect for Scale
  • Partner with product and infrastructure engineering teams to design highly available and performant systems under massive global load
  • Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning
  • Reduce Operational Risk
  • Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure
  • Build proactive mitigation strategies and drive engineering improvements to reduce incidents and improve service health
  • Drive Automation
  • Eliminate repetitive operational work through automation and tooling
  • Build systems to improve deployment safety, incident response, remediation workflows, and reliability guardrails
  • Incident Management
  • Lead complex incident response efforts across engineering teams
  • Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented
  • Influence Engineering Standards
  • Define and champion best practices around reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity
  • Mentor and Multiply Impact
  • Provide technical leadership and mentorship to engineers across SRE and software engineering teams
  • Shape reliability culture and raise operational excellence across the organization

Skills

CassandraClickHouseEnvoyGoGrafanaKafkaKubernetesLinuxOpenTelemetryPrometheusPythonRedis

Languages

GoPython

Relocation

No