Jobs / 1KOMMA5˚

(Senior) Site Reliability Engineer (m/f/d) - Platform & Agentic Operations

Apply Now

1KOMMA5˚ · Hamburg, HH, Deutschland

Hamburg, HH, DeutschlandExp: 6+ yrsRemote

Apply Now

Remuneration

Not specified

Location

Hamburg, HH, Deutschland

Visa sponsorship

Not specified

Job summary

1KOMMA5° is seeking a Senior Site Reliability Engineer to join their Platform team. This role focuses on leveraging AI agents to eliminate developer friction, optimize CI/CD pipelines, and automate the resolution of code review and deployment bottlenecks. The Senior SRE will implement and improve monitoring, alerting, and incident response systems, design and maintain resilient infrastructure, and ensure high reliability for the virtual power plant, Heartbeat AI.

Benefits

EGYM WellpassFuturebens discountsJob bike leasing

Qualifications

6+ years in SRE, DevOps, or Platform Engineering
Strong understanding and practical application of Site Reliability Engineering principles, methodologies, and best practices
Proficiency in programming/scripting languages such as Python, GoLang, or TypeScript
Practical understanding of integrating LLMs into automated workflows
Prior experience in incident management, post-incident reviews, and implementing improvements to prevent future incidents
Ability to troubleshoot complex technical issues systematically and effectively
Experience working with a public cloud provider, ideally Google Cloud Platform (GCP), and understanding of its observability services
A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
Excellent communication skills to convey technical concepts and collaborate effectively with diverse teams
Very good knowledge of spoken and written English
Residency in Germany

Responsibilities

Implement and improve monitoring, alerting, and incident response systems and processes to ensure high reliability and meet defined SLOs
Design, build, and maintain resilient, scalable infrastructure utilizing SRE principles and best practices
Attend post-incident reviews, detect patterns, and contribute to continuous improvement efforts
Execute performance testing, analyze system bottlenecks, and formulate strategies for capacity planning
Build systems where CI/CD test failures serve as immediate context for agents, enabling analysis of logs, tracing dependencies, and suggesting or applying instant code fixes

Skills

BackstageDatadogGCPGitHubGitHub ActionsGKEGoPythonTerraformTypeScript

Languages

EnglishGerman

Industry

Climate tech

Relocation

Apply Now