Jobs / Thales
Senior Site Reliability Engineer / Cloud Operations Engineer (m/f/d)
Thales · Berlin, BE, Deutschland
Berlin, BE, DeutschlandHybrid
Remuneration
Not specified
Location
Berlin, BE, Deutschland
Visa sponsorship
Not specified
Job summary
Thales, in partnership with Google Cloud, is establishing a new, 100% German business unit to provide a locally and autonomously operated “Trusted Cloud” for German companies and public administrations. This role involves operating and maintaining mission-critical sovereign cloud services with high availability targets, monitoring service health, and resolving complex production incidents. The position requires participation in a 24/7 on-call rotation and collaboration with international teams to drive long-term solutions and improve platform reliability.
Qualifications
- Several years of experience in Site Reliability Engineering, Cloud Operations, DevOps, Platform Engineering, Infrastructure Engineering, Production Support, Network Operations (NOC), Technical Operations, or a comparable role.
- Experience operating and supporting business-critical production systems with demanding uptime and availability requirements.
- Strong troubleshooting and incident management skills in complex technical environments.
- Experience monitoring, operating, and maintaining distributed systems, cloud platforms, infrastructure services, or large-scale applications.
- Familiarity with reliability engineering concepts, observability, monitoring, alerting, incident response, and root cause analysis.
- Experience working with automation, scripting, operational tooling, or Infrastructure-as-Code approaches.
- Strong analytical and problem-solving skills with a structured and methodical approach.
- Professional proficiency in both German and English.
- Willingness to participate in a regular on-call rotation.
- Curiosity, adaptability, and a strong desire to learn and work with hyperscale cloud technologies.
Responsibilities
- Operate and maintain mission-critical sovereign cloud services with availability targets of 99.99% and above.
- Monitor service health, reliability, scalability, latency, and performance using Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- Investigate, troubleshoot, and resolve complex production incidents across large-scale distributed cloud environments.
- Participate in a structured 24/7 on-call rotation (approximately one week every six weeks) to ensure continuous service availability.
- Collaborate with Site Reliability Engineers, Cloud Infrastructure Specialists, and Product Experts across international teams to mitigate incidents and drive long-term solutions.
- Build a deep understanding of Google's cloud technologies and distributed systems through an intensive training program covering technologies such as Borg, Colossus, Spanner, and other core GCP components.
- Drive operational excellence by creating and maintaining technical documentation, standardizing incident response procedures, and continuously improving operational playbooks.
- Lead and contribute to post-incident reviews, root cause analyses, and the implementation of preventive measures to improve platform reliability.
- Identify opportunities for automation and contribute to improving operational efficiency, scalability, compliance, and service reliability.
- Support the operation of highly secure cloud environments designed to meet stringent regulatory and sovereignty requirements.
Skills
Cloud SpannerGCP
Languages
GermanEnglish
Work schedule
24/7 on-call rotation (approximately one week every six weeks)Regular on-call rotation
Relocation
No