Jobs / reddit

Staff Site Reliability Engineer - Site Experience

Apply Now

reddit · United Kingdom · Remote

United KingdomExp: 8+ yrsRemote

Apply Now

Remuneration

Not specified

Location

United Kingdom · Remote

Visa sponsorship

Not specified

Job summary

Reddit is seeking a Staff Site Reliability Engineer to lead reliability engineering initiatives for critical user-facing systems at internet scale. This role involves partnering with product and infrastructure teams to improve availability, latency, scalability, and operational excellence across Reddit’s most business-critical experiences. It is a highly technical leadership position focused on large-scale distributed systems and complex reliability challenges.

Benefits

Global Benefit programsFamily Planning SupportGender-Affirming CareMental Health & Coaching BenefitsGroup Personal Pension Scheme with Employer matchPrivate Medical and Dental SchemeIncome Replacement ProgramsBike to Work schemeFlexible VacationPaid Volunteer Time OffGenerous Paid Parental Leave

Qualifications

8+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems
Strong collaboration and communication skills with the ability to influence technical direction across teams
Strong experience supporting high traffic, user-facing production environments
Deep understanding of distributed systems, networking, Linux systems, or cloud native architectures
Experience designing highly available systems with strong operational and reliability practices
Strong programming skills in languages such as Go, Python, or similar
Strong understanding of observability systems including metrics, logging, tracing, and alerting
Experience improving reliability through SLOs, automation, incident management, and performance optimization
Demonstrated ability to troubleshoot complex issues across applications, infrastructure, networking, and services
Experience operating systems at internet scale traffic volumes
Experience with Kubernetes, containers, cloud infrastructure, and modern deployment platforms
Familiarity with technologies such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, or similar distributed infrastructure technologies
Experience with CDN optimization, edge reliability, traffic engineering, or global infrastructure
Contributions to open source software or participation in technical communities
Experience leading large scale incident response and operational transformation initiatives

Responsibilities

Lead Reliability Engineering for User Experience
Drive reliability, scalability, and operational excellence for critical user-facing systems and services
Improve performance and resiliency across APIs, content delivery, feed generation, search, messaging, and real-time experiences
Architect for Scale
Partner with product and infrastructure engineering teams to design highly available and performant systems under massive global load
Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning
Reduce Operational Risk
Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure
Build proactive mitigation strategies and drive engineering improvements to reduce incidents and improve service health
Drive Automation
Eliminate repetitive operational work through automation and tooling
Build systems to improve deployment safety, incident response, remediation workflows, and reliability guardrails
Incident Management
Lead complex incident response efforts across engineering teams
Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented
Influence Engineering Standards
Define and champion best practices around reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity
Mentor and Multiply Impact
Provide technical leadership and mentorship to engineers across SRE and software engineering teams
Shape reliability culture and raise operational excellence across the organization

Skills

CassandraClickHouseEnvoyGoGrafanaKafkaKubernetesLinuxOpenTelemetryPrometheusPythonRedis

Languages

GoPython

Relocation

Apply Now