Jobs / 1KOMMA5˚
(Senior) Site Reliability Engineer (m/f/d) - Platform & Agentic Operations
1KOMMA5˚ · Hamburg, HH, Deutschland
Hamburg, HH, DeutschlandExp: 6+ yrsRemote
Remuneration
Not specified
Location
Hamburg, HH, Deutschland
Visa sponsorship
Not specified
Job summary
1KOMMA5° is seeking a Senior Site Reliability Engineer to join their Platform team. This role focuses on leveraging AI agents to eliminate developer friction, optimize CI/CD pipelines, and automate the resolution of code review and deployment bottlenecks. The Senior SRE will implement and improve monitoring, alerting, and incident response systems, design and maintain resilient infrastructure, and ensure high reliability for the virtual power plant, Heartbeat AI.
Benefits
EGYM WellpassFuturebens discountsJob bike leasing
Qualifications
- 6+ years in SRE, DevOps, or Platform Engineering
- Strong understanding and practical application of Site Reliability Engineering principles, methodologies, and best practices
- Proficiency in programming/scripting languages such as Python, GoLang, or TypeScript
- Practical understanding of integrating LLMs into automated workflows
- Prior experience in incident management, post-incident reviews, and implementing improvements to prevent future incidents
- Ability to troubleshoot complex technical issues systematically and effectively
- Experience working with a public cloud provider, ideally Google Cloud Platform (GCP), and understanding of its observability services
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
- Excellent communication skills to convey technical concepts and collaborate effectively with diverse teams
- Very good knowledge of spoken and written English
- Residency in Germany
Responsibilities
- Implement and improve monitoring, alerting, and incident response systems and processes to ensure high reliability and meet defined SLOs
- Design, build, and maintain resilient, scalable infrastructure utilizing SRE principles and best practices
- Attend post-incident reviews, detect patterns, and contribute to continuous improvement efforts
- Execute performance testing, analyze system bottlenecks, and formulate strategies for capacity planning
- Build systems where CI/CD test failures serve as immediate context for agents, enabling analysis of logs, tracing dependencies, and suggesting or applying instant code fixes
Skills
BackstageDatadogGCPGitHubGitHub ActionsGKEGoPythonTerraformTypeScript
Languages
EnglishGerman
Industry
Climate tech
Relocation
No