Jobs / Flip GmbH
Senior Site Reliability Engineer (m/f/d)
Flip GmbH · Home Office, Deutschland
Home Office, DeutschlandExp: 5+ yrsRemote
Remuneration
Not specified
Location
Home Office, Deutschland
Visa sponsorship
Not specified
Job summary
As a Senior Site Reliability Engineer in the Platform Squad, you will own critical reliability domains end-to-end, drive technical direction, lead architectural decisions, and mentor teammates. This role is for an engineer with a proven track record of building and operating high-throughput, highly available systems, seeking senior-level technical ownership and impact through deep engineering work.
Benefits
E-Gym-Wellpass membershipJob bike leasingRegular team eventsCulture daysWorking abroad in the European Union
Qualifications
- 5+ years of hands-on experience as a Site Reliability Engineer (SRE), Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus
- proven track record building and operating high-throughput, highly available systems in production
- deep, production-level experience with Kubernetes on any Hyperscaler
- strong experience with modern observability stacks (e.g., Prometheus, Mimir, VictoriaMetrics, Dash0, Loki, ELK) and a clear point of view on SLIs, SLOs, and error budgets
- solid software development skills in Go (strongly preferred) or Python
- hands-on experience with Infrastructure as Code (Pulumi, OpenTofu, Terraform) and GitOps (e.g., ArgoCD) + CI/CD pipeline design
- demonstrated ability to lead complex infrastructure initiatives from design to production, including writing RFCs and driving architecture decisions within the team
- experience mentoring engineers and raising the technical bar within a team
- comfortable owning major incidents end-to-end and turning learnings into systemic change
- strong communication skills and business-fluent English
- willingness to participate in on-call rotations to ensure platform reliability
- experience rolling out production-ready API-Gateways with Gateway API (e.g., Envoy Gateway)
- experience operating multi-cluster service meshes (e.g., Cilium, Linkerd, Istio)
- experience deploying and maintaining Kubernetes Operators (e.g., Strimzi, CNPG)
- experience operating highly available PostgreSQL in production
Responsibilities
- own critical reliability domains end-to-end
- drive technical direction within the squad
- lead architectural decisions on the platform
- mentor teammates
- continuously raise the reliability bar within the team
- drive the architecture and evolution of cloud infrastructure on Azure and Kubernetes clusters
- define resilience strategy for global scaling, zero-downtime deployments, rollback mechanisms, and disaster recovery
- ensure platform availability around the clock
- improve the LGTM stack (Loki, Grafana, Tempo, Mimir)
- improve the IaC Platform to eliminate toil and enable self-service for engineering teams
- lead platform-related major incidents
- drive blameless post-mortems for the squad
- translate findings into systemic improvements
- coach teammates
- run RFCs and design reviews within the team
- help engineers grow into stronger SREs
- partner with the squad to define the platform's direction
Skills
Argo CDAzureCiliumEnvoyGoGrafanaIstioKubernetesLinkerdLokiMimirOpenTofuPostgreSQLPrometheusPulumiPythonTempoTerraformVictoriaMetrics
Languages
GoPythonEnglish
Work schedule
On-call rotations
Relocation
No