Jobs / TWG Global AI
Platform / Site Reliability Engineer (UK)
TWG Global AI · London, ENG, United Kingdom
London, ENG, United KingdomExp: 3-6 yrs90,000-100,000 GBP/yearlyRemote
Remuneration
90,000-100,000 GBP/yearly
Location
London, ENG, United Kingdom
Visa sponsorship
Not specified
Job summary
TWG Global is seeking a Platform / Site Reliability Engineer (SRE) to ensure the scalability, stability, and performance of data platforms and ML infrastructure. This role involves collaborating with data scientists, ML engineers, and platform vendors to deploy and monitor production systems, automate workflows, and reduce operational overhead.
Benefits
BonusMedical benefitsFinancial benefitsPerformance-based incentives
Qualifications
- 3–6 years of experience in DevOps, SRE, or backend engineering roles
- Proficient with Docker, Kubernetes, Terraform, GitLab/GitHub Actions, Airflow
- Strong scripting in Python or Bash
- Familiarity with Linux environments
- Knowledge of observability stacks such as Prometheus, Grafana, ELK, Datadog
- Familiarity with cloud platforms such as AWS, GCP, or Azure
- Strong documentation, problem-solving, and incident response skills
- Experience supporting ML/AI workflows using Palantir Foundry (preferred)
- Exposure to compliance frameworks like SOC 2, ISO 27001, or financial regulations (preferred)
- Knowledge of MLOps frameworks such as MLflow, Kubeflow, SageMaker Pipelines (preferred)
- Ability to automate deployments, testing, and monitoring at scale (preferred)
Responsibilities
- Build and maintain infrastructure for real-time and batch ML workloads
- Implement observability tools for model performance and system uptime
- Design and manage CI/CD pipelines for applications
- Ensure high availability, disaster recovery, and rollback for production environments
- Manage access controls, secrets, and security policies with compliance and IT
- Troubleshoot incidents, lead postmortems, and drive root-cause resolution
- Provide 24/7 coverage across time zones with U.S. and international teams
Skills
AirflowAWSAzureBashDatadogDockerGCPGitHubGitHub ActionsGitLabGrafanaKubernetesLinuxPrometheusPythonTerraform
Languages
PythonBash
Work schedule
24/7 coverage
Industry
Financial servicesInsuranceTechnologyMediaSports
Relocation
No