Jobs / reddit
Staff Site Reliability Engineer - Site Experience
reddit · United Kingdom · Remote
United KingdomExp: 8+ yrsRemote
Remuneration
Not specified
Location
United Kingdom · Remote
Visa sponsorship
Not specified
Job summary
Reddit is seeking a Staff Site Reliability Engineer to lead reliability engineering initiatives for critical user-facing systems at internet scale. This role involves partnering with product and infrastructure teams to improve availability, latency, scalability, and operational excellence across Reddit’s most business-critical experiences. It is a highly technical leadership position focused on large-scale distributed systems and complex reliability challenges.
Benefits
Global Benefit programsFamily Planning SupportGender-Affirming CareMental Health & Coaching BenefitsGroup Personal Pension Scheme with Employer matchPrivate Medical and Dental SchemeIncome Replacement ProgramsBike to Work schemeFlexible VacationPaid Volunteer Time OffGenerous Paid Parental Leave
Qualifications
- 8+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems
- Strong collaboration and communication skills with the ability to influence technical direction across teams
- Strong experience supporting high traffic, user-facing production environments
- Deep understanding of distributed systems, networking, Linux systems, or cloud native architectures
- Experience designing highly available systems with strong operational and reliability practices
- Strong programming skills in languages such as Go, Python, or similar
- Strong understanding of observability systems including metrics, logging, tracing, and alerting
- Experience improving reliability through SLOs, automation, incident management, and performance optimization
- Demonstrated ability to troubleshoot complex issues across applications, infrastructure, networking, and services
- Experience operating systems at internet scale traffic volumes
- Experience with Kubernetes, containers, cloud infrastructure, and modern deployment platforms
- Familiarity with technologies such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, or similar distributed infrastructure technologies
- Experience with CDN optimization, edge reliability, traffic engineering, or global infrastructure
- Contributions to open source software or participation in technical communities
- Experience leading large scale incident response and operational transformation initiatives
Responsibilities
- Lead Reliability Engineering for User Experience
- Drive reliability, scalability, and operational excellence for critical user-facing systems and services
- Improve performance and resiliency across APIs, content delivery, feed generation, search, messaging, and real-time experiences
- Architect for Scale
- Partner with product and infrastructure engineering teams to design highly available and performant systems under massive global load
- Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning
- Reduce Operational Risk
- Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure
- Build proactive mitigation strategies and drive engineering improvements to reduce incidents and improve service health
- Drive Automation
- Eliminate repetitive operational work through automation and tooling
- Build systems to improve deployment safety, incident response, remediation workflows, and reliability guardrails
- Incident Management
- Lead complex incident response efforts across engineering teams
- Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented
- Influence Engineering Standards
- Define and champion best practices around reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity
- Mentor and Multiply Impact
- Provide technical leadership and mentorship to engineers across SRE and software engineering teams
- Shape reliability culture and raise operational excellence across the organization
Skills
CassandraClickHouseEnvoyGoGrafanaKafkaKubernetesLinuxOpenTelemetryPrometheusPythonRedis
Languages
GoPython
Relocation
No