Jobs / CareADHD
Platform Operations Manager (DevOps & Site Reliability Engineering)
CareADHD · Canary Wharf, ENG, United Kingdom
Canary Wharf, ENG, United KingdomExp: 8+ yrs70,000-75,000 GBP/yearlyHybrid
Remuneration
70,000-75,000 GBP/yearly
Location
Canary Wharf, ENG, United Kingdom
Visa sponsorship
Not specified
Job summary
Care ADHD is seeking an experienced Platform Operations Lead to manage the reliability, availability, performance, and operational stability of their technology platforms. This role involves leading DevOps, Site Reliability Engineering (SRE), cloud infrastructure, and platform operations across UK and India teams. The successful candidate will define operational strategy, improve engineering practices, and be hands-on with cloud infrastructure, automation, monitoring, incident response, and reliability engineering.
Benefits
Competitive salaryHybrid working25 days annual leave (plus UK public holidays)Team get-togethersA paid day off on your birthdayOffice equipment provided500 stipend to set up your home officePension contribution
Qualifications
- 8+ years of experience in DevOps, Platform Engineering, SRE, or Infrastructure Engineering.
- Proven experience leading operational or platform engineering teams.
- Strong experience managing distributed or offshore technical teams.
- Experience supporting business-critical production systems with high availability requirements.
- Experience operating cloud-native platforms in AWS environment.
- Strong hands-on experience with AWS cloud infrastructure and services.
- Strong hands-on experience with CI/CD pipeline design and automation.
- Strong hands-on experience with Infrastructure as Code (Terraform or AWS CDK).
- Strong hands-on experience with Kubernetes and container orchestration.
- Strong hands-on experience with monitoring, logging, and observability platforms.
- Strong hands-on experience with incident management and operational support.
- Strong hands-on experience with Linux systems administration and networking fundamentals.
- Strong understanding of Site Reliability Engineering principles.
- Strong understanding of high availability and disaster recovery design.
- Strong understanding of platform scalability and resilience.
- Strong understanding of security and operational governance.
- Strong understanding of performance optimisation and capacity planning.
- Strong ownership mindset with the ability to lead operational stability and platform reliability across the organisation.
- Excellent communication and stakeholder management skills, particularly across distributed engineering teams.
- Calm and effective under pressure with strong incident management and troubleshooting capabilities.
Responsibilities
- Own the operational health, availability, and reliability of all production and non-production environments.
- Ensure platforms are monitored, maintained, and operational 24/7/365.
- Lead platform incident management, root cause analysis, and service recovery processes.
- Establish and improve operational readiness, resilience, and disaster recovery capabilities.
- Define and manage SLAs, SLOs, and operational performance metrics.
- Ensure high levels of platform uptime, stability, scalability, and security.
- Design, build, and maintain cloud infrastructure primarily within AWS.
- Lead infrastructure automation and Infrastructure as Code initiatives using Terraform or AWS CDK.
- Design and optimise CI/CD pipelines to support efficient, secure, and reliable software delivery.
- Improve deployment automation, release management, and environment consistency.
- Support engineering teams with platform tooling, deployment strategies, and operational best practices.
- Drive improvements in deployment reliability, infrastructure scalability, platform security, cost optimisation, and operational efficiency.
- Implement and maintain observability solutions including monitoring, logging, alerting, and tracing.
- Develop proactive approaches to incident prevention and operational resilience.
- Lead reliability engineering practices including capacity planning, performance monitoring, fault tolerance, and high availability design.
- Reduce operational toil through automation and self-service tooling.
- Establish strong incident response and post-incident review processes.
- Lead and mentor platform operations and DevOps engineers across the UK and India.
- Build a collaborative, accountable, and high-performing operational culture.
- Allocate and coordinate operational resources across projects and platform priorities.
Skills
AWSAWS CDKCloudWatchDatadogDockerGitHubGitHub ActionsGitLabGitLab CIGrafanaJenkinsKubernetesAWS LambdaLinuxNode.jsPagerDutyPostgreSQLPrometheusTerraformTypeScript
Industry
HealthTech
Relocation
No