Jobs / Flip GmbH

Senior Site Reliability Engineer (m/f/d)

Apply Now

Flip GmbH · Home Office, Deutschland

Home Office, DeutschlandExp: 5+ yrsRemote

Apply Now

Remuneration

Not specified

Location

Home Office, Deutschland

Visa sponsorship

Not specified

Job summary

As a Senior Site Reliability Engineer in the Platform Squad, you will own critical reliability domains end-to-end, drive technical direction, lead architectural decisions, and mentor teammates. This role is for an engineer with a proven track record of building and operating high-throughput, highly available systems, seeking senior-level technical ownership and impact through deep engineering work.

Benefits

E-Gym-Wellpass membershipJob bike leasingRegular team eventsCulture daysWorking abroad in the European Union

Qualifications

5+ years of hands-on experience as a Site Reliability Engineer (SRE), Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus
proven track record building and operating high-throughput, highly available systems in production
deep, production-level experience with Kubernetes on any Hyperscaler
strong experience with modern observability stacks (e.g., Prometheus, Mimir, VictoriaMetrics, Dash0, Loki, ELK) and a clear point of view on SLIs, SLOs, and error budgets
solid software development skills in Go (strongly preferred) or Python
hands-on experience with Infrastructure as Code (Pulumi, OpenTofu, Terraform) and GitOps (e.g., ArgoCD) + CI/CD pipeline design
demonstrated ability to lead complex infrastructure initiatives from design to production, including writing RFCs and driving architecture decisions within the team
experience mentoring engineers and raising the technical bar within a team
comfortable owning major incidents end-to-end and turning learnings into systemic change
strong communication skills and business-fluent English
willingness to participate in on-call rotations to ensure platform reliability
experience rolling out production-ready API-Gateways with Gateway API (e.g., Envoy Gateway)
experience operating multi-cluster service meshes (e.g., Cilium, Linkerd, Istio)
experience deploying and maintaining Kubernetes Operators (e.g., Strimzi, CNPG)
experience operating highly available PostgreSQL in production

Responsibilities

own critical reliability domains end-to-end
drive technical direction within the squad
lead architectural decisions on the platform
mentor teammates
continuously raise the reliability bar within the team
drive the architecture and evolution of cloud infrastructure on Azure and Kubernetes clusters
define resilience strategy for global scaling, zero-downtime deployments, rollback mechanisms, and disaster recovery
ensure platform availability around the clock
improve the LGTM stack (Loki, Grafana, Tempo, Mimir)
improve the IaC Platform to eliminate toil and enable self-service for engineering teams
lead platform-related major incidents
drive blameless post-mortems for the squad
translate findings into systemic improvements
coach teammates
run RFCs and design reviews within the team
help engineers grow into stronger SREs
partner with the squad to define the platform's direction

Skills

Argo CDAzureCiliumEnvoyGoGrafanaIstioKubernetesLinkerdLokiMimirOpenTofuPostgreSQLPrometheusPulumiPythonTempoTerraformVictoriaMetrics

Languages

GoPythonEnglish

Work schedule

On-call rotations

Relocation

Apply Now