Jobs / Thales

Senior Site Reliability Engineer / Cloud Operations Engineer (m/f/d)

Apply Now

Thales · Berlin, BE, Deutschland

Berlin, BE, DeutschlandHybrid

Apply Now

Remuneration

Not specified

Location

Berlin, BE, Deutschland

Visa sponsorship

Not specified

Job summary

Thales, in partnership with Google Cloud, is establishing a new, 100% German business unit to provide a locally and autonomously operated “Trusted Cloud” for German companies and public administrations. This role involves operating and maintaining mission-critical sovereign cloud services with high availability targets, monitoring service health, and resolving complex production incidents. The position requires participation in a 24/7 on-call rotation and collaboration with international teams to drive long-term solutions and improve platform reliability.

Qualifications

Several years of experience in Site Reliability Engineering, Cloud Operations, DevOps, Platform Engineering, Infrastructure Engineering, Production Support, Network Operations (NOC), Technical Operations, or a comparable role.
Experience operating and supporting business-critical production systems with demanding uptime and availability requirements.
Strong troubleshooting and incident management skills in complex technical environments.
Experience monitoring, operating, and maintaining distributed systems, cloud platforms, infrastructure services, or large-scale applications.
Familiarity with reliability engineering concepts, observability, monitoring, alerting, incident response, and root cause analysis.
Experience working with automation, scripting, operational tooling, or Infrastructure-as-Code approaches.
Strong analytical and problem-solving skills with a structured and methodical approach.
Professional proficiency in both German and English.
Willingness to participate in a regular on-call rotation.
Curiosity, adaptability, and a strong desire to learn and work with hyperscale cloud technologies.

Responsibilities

Operate and maintain mission-critical sovereign cloud services with availability targets of 99.99% and above.
Monitor service health, reliability, scalability, latency, and performance using Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Investigate, troubleshoot, and resolve complex production incidents across large-scale distributed cloud environments.
Participate in a structured 24/7 on-call rotation (approximately one week every six weeks) to ensure continuous service availability.
Collaborate with Site Reliability Engineers, Cloud Infrastructure Specialists, and Product Experts across international teams to mitigate incidents and drive long-term solutions.
Build a deep understanding of Google's cloud technologies and distributed systems through an intensive training program covering technologies such as Borg, Colossus, Spanner, and other core GCP components.
Drive operational excellence by creating and maintaining technical documentation, standardizing incident response procedures, and continuously improving operational playbooks.
Lead and contribute to post-incident reviews, root cause analyses, and the implementation of preventive measures to improve platform reliability.
Identify opportunities for automation and contribute to improving operational efficiency, scalability, compliance, and service reliability.
Support the operation of highly secure cloud environments designed to meet stringent regulatory and sovereignty requirements.

Skills

Cloud SpannerGCP

Languages

GermanEnglish

Work schedule

24/7 on-call rotation (approximately one week every six weeks)Regular on-call rotation

Relocation

Apply Now