Jobs / TWG Global AI

Platform / Site Reliability Engineer (UK)

TWG Global AI · London, ENG, United Kingdom
London, ENG, United KingdomExp: 3-6 yrs90,000-100,000 GBP/yearlyRemote
Remuneration
90,000-100,000 GBP/yearly
Location
London, ENG, United Kingdom
Visa sponsorship
Not specified

Job summary

TWG Global is seeking a Platform / Site Reliability Engineer (SRE) to ensure the scalability, stability, and performance of data platforms and ML infrastructure. This role involves collaborating with data scientists, ML engineers, and platform vendors to deploy and monitor production systems, automate workflows, and reduce operational overhead.

Benefits

BonusMedical benefitsFinancial benefitsPerformance-based incentives

Qualifications

  • 3–6 years of experience in DevOps, SRE, or backend engineering roles
  • Proficient with Docker, Kubernetes, Terraform, GitLab/GitHub Actions, Airflow
  • Strong scripting in Python or Bash
  • Familiarity with Linux environments
  • Knowledge of observability stacks such as Prometheus, Grafana, ELK, Datadog
  • Familiarity with cloud platforms such as AWS, GCP, or Azure
  • Strong documentation, problem-solving, and incident response skills
  • Experience supporting ML/AI workflows using Palantir Foundry (preferred)
  • Exposure to compliance frameworks like SOC 2, ISO 27001, or financial regulations (preferred)
  • Knowledge of MLOps frameworks such as MLflow, Kubeflow, SageMaker Pipelines (preferred)
  • Ability to automate deployments, testing, and monitoring at scale (preferred)

Responsibilities

  • Build and maintain infrastructure for real-time and batch ML workloads
  • Implement observability tools for model performance and system uptime
  • Design and manage CI/CD pipelines for applications
  • Ensure high availability, disaster recovery, and rollback for production environments
  • Manage access controls, secrets, and security policies with compliance and IT
  • Troubleshoot incidents, lead postmortems, and drive root-cause resolution
  • Provide 24/7 coverage across time zones with U.S. and international teams

Skills

AirflowAWSAzureBashDatadogDockerGCPGitHubGitHub ActionsGitLabGrafanaKubernetesLinuxPrometheusPythonTerraform

Languages

PythonBash

Work schedule

24/7 coverage

Industry

Financial servicesInsuranceTechnologyMediaSports

Relocation

No