Jobs / Flip App

Senior Site Reliability Engineer (m/w/d)

Apply Now

Flip App · Home Office, Deutschland

Home Office, DeutschlandExp: 5+ yrsRemote

Apply Now

Remuneration

Not specified

Location

Home Office, Deutschland

Visa sponsorship

Not specified

Job summary

As a Senior Site Reliability Engineer in the Platform Squad, you will have end-to-end responsibility for critical reliability areas and drive technical direction. You will lead architectural decisions, mentor team members, and continuously raise the bar for reliability. This role is for engineers with a proven track record in building and operating highly available, high-throughput systems, seeking senior-level technical ownership and impact through deep engineering work.

Benefits

E-Gym-Wellpass membershipJob-Rad leasingTeam eventsCulture daysWorkation in European countries

Qualifications

5+ years of hands-on experience as a Site Reliability Engineer, Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus
Proven track record in building and operating highly available, high-throughput systems in production
Deep production-level experience with Kubernetes on a major hyperscaler
Profound experience with modern observability stacks (e.g., Prometheus, Mimir, VictoriaMetrics, Dash0, Loki, ELK) and a clear view on SLIs, SLOs, and Error Budgets
Solid software development skills in Go (strongly preferred) or Python
Hands-on experience with Infrastructure as Code (Pulumi, OpenTofu, Terraform) and GitOps (e.g., ArgoCD) + CI/CD pipeline design
Proven ability to lead complex infrastructure initiatives from design to production, including writing RFCs and driving architectural decisions
Experience mentoring engineers and raising the technical level within a team
Confident end-to-end responsibility for critical incidents and ability to translate insights into sustainable technical improvements
Strong communication skills and fluent English
Willingness to participate in on-call duties to ensure platform reliability
Experience with rolling out production-ready API gateways using Gateway API (e.g., Envoy Gateway)
Experience operating multi-cluster service meshes (e.g., Cilium, Linkerd, Istio)
Experience deploying and maintaining Kubernetes operators (e.g., Strimzi, CNPG)
Experience operating highly available PostgreSQL in production

Responsibilities

Take end-to-end responsibility for critical reliability areas
Drive technical direction within the squad
Lead architectural decisions on the platform
Mentor team members
Continuously raise the bar for reliability within the team
Co-own architecture and development of cloud infrastructure on Azure and Kubernetes clusters
Drive resilience strategy, including global scaling, zero-downtime deployments, rollback mechanisms, and disaster recovery
Ensure 24/7 platform availability
Further develop the observability stack (Loki, Grafana, Tempo, Mimir)
Improve the IaC platform to enable self-service for engineering teams
Lead major platform incidents
Conduct blameless post-mortems
Translate insights into lasting improvements
Coach team members
Lead RFCs and design reviews
Help engineers develop into stronger SREs
Collaborate with the squad to define platform direction

Skills

Argo CDAzureCiliumEnvoyGoGrafanaIstioKubernetesLinkerdLokiMimirOpenTofuPostgreSQLPrometheusPulumiPythonTempoTerraformVictoriaMetrics

Languages

English

Work schedule

On-call

Industry

RetailManufacturingLogistics

Relocation

Apply Now