Jobs / Flip GmbH

Senior Site Reliability Engineer (m/f/d)

Flip GmbH · Home Office, Deutschland
Home Office, DeutschlandExp: 5+ yrsRemote
Remuneration
Not specified
Location
Home Office, Deutschland
Visa sponsorship
Not specified

Job summary

As a Senior Site Reliability Engineer in the Platform Squad, you will own critical reliability domains end-to-end, drive technical direction, lead architectural decisions, and mentor teammates. This role is for an engineer with a proven track record of building and operating high-throughput, highly available systems, seeking senior-level technical ownership and impact through deep engineering work.

Benefits

E-Gym-Wellpass membershipJob bike leasingRegular team eventsCulture daysWorking abroad in the European Union

Qualifications

  • 5+ years of hands-on experience as a Site Reliability Engineer (SRE), Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus
  • proven track record building and operating high-throughput, highly available systems in production
  • deep, production-level experience with Kubernetes on any Hyperscaler
  • strong experience with modern observability stacks (e.g., Prometheus, Mimir, VictoriaMetrics, Dash0, Loki, ELK) and a clear point of view on SLIs, SLOs, and error budgets
  • solid software development skills in Go (strongly preferred) or Python
  • hands-on experience with Infrastructure as Code (Pulumi, OpenTofu, Terraform) and GitOps (e.g., ArgoCD) + CI/CD pipeline design
  • demonstrated ability to lead complex infrastructure initiatives from design to production, including writing RFCs and driving architecture decisions within the team
  • experience mentoring engineers and raising the technical bar within a team
  • comfortable owning major incidents end-to-end and turning learnings into systemic change
  • strong communication skills and business-fluent English
  • willingness to participate in on-call rotations to ensure platform reliability
  • experience rolling out production-ready API-Gateways with Gateway API (e.g., Envoy Gateway)
  • experience operating multi-cluster service meshes (e.g., Cilium, Linkerd, Istio)
  • experience deploying and maintaining Kubernetes Operators (e.g., Strimzi, CNPG)
  • experience operating highly available PostgreSQL in production

Responsibilities

  • own critical reliability domains end-to-end
  • drive technical direction within the squad
  • lead architectural decisions on the platform
  • mentor teammates
  • continuously raise the reliability bar within the team
  • drive the architecture and evolution of cloud infrastructure on Azure and Kubernetes clusters
  • define resilience strategy for global scaling, zero-downtime deployments, rollback mechanisms, and disaster recovery
  • ensure platform availability around the clock
  • improve the LGTM stack (Loki, Grafana, Tempo, Mimir)
  • improve the IaC Platform to eliminate toil and enable self-service for engineering teams
  • lead platform-related major incidents
  • drive blameless post-mortems for the squad
  • translate findings into systemic improvements
  • coach teammates
  • run RFCs and design reviews within the team
  • help engineers grow into stronger SREs
  • partner with the squad to define the platform's direction

Skills

Argo CDAzureCiliumEnvoyGoGrafanaIstioKubernetesLinkerdLokiMimirOpenTofuPostgreSQLPrometheusPulumiPythonTempoTerraformVictoriaMetrics

Languages

GoPythonEnglish

Work schedule

On-call rotations

Relocation

No