Jobs / Thales

Senior Site Reliability Engineer / Cloud Operations Engineer (m/f/d)

Thales · Berlin, BE, Deutschland
Berlin, BE, DeutschlandHybrid
Remuneration
Not specified
Location
Berlin, BE, Deutschland
Visa sponsorship
Not specified

Job summary

Thales, in partnership with Google Cloud, is establishing a new, 100% German business unit to provide a locally and autonomously operated “Trusted Cloud” for German companies and public administrations. This role involves operating and maintaining mission-critical sovereign cloud services with high availability targets, monitoring service health, and resolving complex production incidents. The position requires participation in a 24/7 on-call rotation and collaboration with international teams to drive long-term solutions and improve platform reliability.

Qualifications

  • Several years of experience in Site Reliability Engineering, Cloud Operations, DevOps, Platform Engineering, Infrastructure Engineering, Production Support, Network Operations (NOC), Technical Operations, or a comparable role.
  • Experience operating and supporting business-critical production systems with demanding uptime and availability requirements.
  • Strong troubleshooting and incident management skills in complex technical environments.
  • Experience monitoring, operating, and maintaining distributed systems, cloud platforms, infrastructure services, or large-scale applications.
  • Familiarity with reliability engineering concepts, observability, monitoring, alerting, incident response, and root cause analysis.
  • Experience working with automation, scripting, operational tooling, or Infrastructure-as-Code approaches.
  • Strong analytical and problem-solving skills with a structured and methodical approach.
  • Professional proficiency in both German and English.
  • Willingness to participate in a regular on-call rotation.
  • Curiosity, adaptability, and a strong desire to learn and work with hyperscale cloud technologies.

Responsibilities

  • Operate and maintain mission-critical sovereign cloud services with availability targets of 99.99% and above.
  • Monitor service health, reliability, scalability, latency, and performance using Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
  • Investigate, troubleshoot, and resolve complex production incidents across large-scale distributed cloud environments.
  • Participate in a structured 24/7 on-call rotation (approximately one week every six weeks) to ensure continuous service availability.
  • Collaborate with Site Reliability Engineers, Cloud Infrastructure Specialists, and Product Experts across international teams to mitigate incidents and drive long-term solutions.
  • Build a deep understanding of Google's cloud technologies and distributed systems through an intensive training program covering technologies such as Borg, Colossus, Spanner, and other core GCP components.
  • Drive operational excellence by creating and maintaining technical documentation, standardizing incident response procedures, and continuously improving operational playbooks.
  • Lead and contribute to post-incident reviews, root cause analyses, and the implementation of preventive measures to improve platform reliability.
  • Identify opportunities for automation and contribute to improving operational efficiency, scalability, compliance, and service reliability.
  • Support the operation of highly secure cloud environments designed to meet stringent regulatory and sovereignty requirements.

Skills

Cloud SpannerGCP

Languages

GermanEnglish

Work schedule

24/7 on-call rotation (approximately one week every six weeks)Regular on-call rotation

Relocation

No