Jobs / Flip App

Senior Site Reliability Engineer (m/w/d)

Flip App · Home Office, Deutschland
Home Office, DeutschlandExp: 5+ yrsRemote
Remuneration
Not specified
Location
Home Office, Deutschland
Visa sponsorship
Not specified

Job summary

As a Senior Site Reliability Engineer in the Platform Squad, you will have end-to-end responsibility for critical reliability areas and drive technical direction. You will lead architectural decisions, mentor team members, and continuously raise the bar for reliability. This role is for engineers with a proven track record in building and operating highly available, high-throughput systems, seeking senior-level technical ownership and impact through deep engineering work.

Benefits

E-Gym-Wellpass membershipJob-Rad leasingTeam eventsCulture daysWorkation in European countries

Qualifications

  • 5+ years of hands-on experience as a Site Reliability Engineer, Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus
  • Proven track record in building and operating highly available, high-throughput systems in production
  • Deep production-level experience with Kubernetes on a major hyperscaler
  • Profound experience with modern observability stacks (e.g., Prometheus, Mimir, VictoriaMetrics, Dash0, Loki, ELK) and a clear view on SLIs, SLOs, and Error Budgets
  • Solid software development skills in Go (strongly preferred) or Python
  • Hands-on experience with Infrastructure as Code (Pulumi, OpenTofu, Terraform) and GitOps (e.g., ArgoCD) + CI/CD pipeline design
  • Proven ability to lead complex infrastructure initiatives from design to production, including writing RFCs and driving architectural decisions
  • Experience mentoring engineers and raising the technical level within a team
  • Confident end-to-end responsibility for critical incidents and ability to translate insights into sustainable technical improvements
  • Strong communication skills and fluent English
  • Willingness to participate in on-call duties to ensure platform reliability
  • Experience with rolling out production-ready API gateways using Gateway API (e.g., Envoy Gateway)
  • Experience operating multi-cluster service meshes (e.g., Cilium, Linkerd, Istio)
  • Experience deploying and maintaining Kubernetes operators (e.g., Strimzi, CNPG)
  • Experience operating highly available PostgreSQL in production

Responsibilities

  • Take end-to-end responsibility for critical reliability areas
  • Drive technical direction within the squad
  • Lead architectural decisions on the platform
  • Mentor team members
  • Continuously raise the bar for reliability within the team
  • Co-own architecture and development of cloud infrastructure on Azure and Kubernetes clusters
  • Drive resilience strategy, including global scaling, zero-downtime deployments, rollback mechanisms, and disaster recovery
  • Ensure 24/7 platform availability
  • Further develop the observability stack (Loki, Grafana, Tempo, Mimir)
  • Improve the IaC platform to enable self-service for engineering teams
  • Lead major platform incidents
  • Conduct blameless post-mortems
  • Translate insights into lasting improvements
  • Coach team members
  • Lead RFCs and design reviews
  • Help engineers develop into stronger SREs
  • Collaborate with the squad to define platform direction

Skills

Argo CDAzureCiliumEnvoyGoGrafanaIstioKubernetesLinkerdLokiMimirOpenTofuPostgreSQLPrometheusPulumiPythonTempoTerraformVictoriaMetrics

Languages

English

Work schedule

On-call

Industry

RetailManufacturingLogistics

Relocation

No