Jobs / ERP Tech Solutions Ltd
Senior Site Reliability Engineer / Technical Architect
ERP Tech Solutions Ltd · Winnersh, ENG, United Kingdom
Winnersh, ENG, United KingdomFull timeExp: 15+ yrs45,000-? £Onsite
Remuneration
45,000-? £
Location
Winnersh, ENG, United Kingdom
Visa sponsorship
Not specified
Job summary
Seeking a highly experienced Senior Site Reliability Engineer / Technical Architect with strong hands-on expertise in cloud infrastructure, Kubernetes, platform engineering, automation, observability, and AI-assisted engineering. The ideal candidate will have deep experience designing, building, and operating reliable, scalable, and secure infrastructure across AWS, Azure, Kubernetes, Terraform, CI/CD, GitOps, and monitoring platforms. This role requires strong ownership of production systems, incident management, automation, infrastructure standards, and collaboration with engineering, security, and platform teams.
Benefits
Flexitime
Qualifications
- Strong experience in Site Reliability Engineering, DevOps, Cloud Infrastructure, or Platform Engineering.
- Hands-on experience with AWS services such as EC2, EKS, ECS, Lambda, RDS, S3, VPC, CloudFront, Route 53, IAM, KMS, WAF, and Secrets Manager.
- Experience with Azure services including AKS, Virtual Machines, Virtual Networks, Storage Accounts, Load Balancer, Azure Monitor, and Entra ID.
- Strong Kubernetes, Docker, Helm, Terraform, Ansible, and GitOps experience.
- Good scripting and automation skills using Python, Bash, or similar languages.
- Strong monitoring and observability experience with Datadog, Grafana, Prometheus, Loki, Tempo, OpenTelemetry, Splunk, or Nagios.
- Experience with incident response, production support, root cause analysis, capacity planning, cost optimisation, and reliability improvement.
- Good understanding of networking, DNS, DHCP, LDAP, load balancers, firewalls, CDN, VPN, and security controls.
- Experience working in regulated, high-availability, or large-scale production environments.
- Certified Kubernetes Administrator (required).
Responsibilities
- Design, build, and maintain scalable cloud infrastructure across AWS and Azure.
- Manage Kubernetes platforms including EKS, AKS, Helm, Argo CD, and GitOps workflows.
- Create reusable Terraform, Ansible, and automation patterns for infrastructure provisioning.
- Define and improve SLOs, SLIs, monitoring, alerting, dashboards, and incident response processes.
- Implement observability using tools such as Datadog, Grafana, Prometheus, Loki, Tempo, OpenTelemetry, Splunk, and related platforms.
- Improve platform reliability, reduce operational toil, and support root cause analysis during incidents.
- Support secure infrastructure access using IAM, Okta, Teleport, RBAC, MFA, TLS/PKI, Secrets Manager, and cloud security controls.
- Work with CI/CD tools such as Jenkins, GitLab CI, GitHub Actions, and Argo CD to improve deployment reliability.
- Support Linux, Windows Server, Active Directory, DNS, DHCP, LDAP, and Group Policy environments.
- Manage large-scale GPU/HPC workloads using SLURM, PySpark, anomaly detection pipelines, and bare-metal provisioning with IPMI and PXE boot.
- Apply AI-assisted engineering tools such as Cursor, Claude Code, GitHub Copilot, AWS Bedrock, Ollama, Datadog Watchdog, and Grafana AI Agents to improve automation, troubleshooting, and delivery.
- Partner with engineering, security, and business teams to turn operational and regulatory requirements into practical platform standards.
Skills
AKSAnsibleArgo CDAWSAWS KMSAzureAzure MonitorBashCloudFrontDatadogDockerECSEKSGitHubGitHub ActionsGitLabGitLab CIGrafanaHelmIAMJenkinsKubernetesAWS LambdaLinuxLokiOktaOpenTelemetryPrometheusPythonRHELRoute 53S3Secrets ManagerSplunkTempoTerraformWindowsWindows Server
Certifications
Certified Kubernetes AdministratorAWS Certified Solutions ArchitectRed Hat Certified EngineerMicrosoft Certified Solutions ExpertCCNA Routing and Switching / Security
Languages
PythonBash
Relocation
No