Jobs / Pharmacy2U Ltd

Technology - ML Ops Engineer

Pharmacy2U Ltd · Leeds, ENG, United Kingdom
Leeds, ENG, United KingdomHybrid
Remuneration
Not specified
Location
Leeds, ENG, United Kingdom
Visa sponsorship
No visa sponsorship
Applicants must prove they have the right to live in the UK.

Job summary

The ML Ops Engineer will drive the operation of production-grade Machine Learning and LLM services on Azure, ensuring models run as reliable, scalable, and high-performing systems. This role involves owning the end-to-end MLOps/LLMOps lifecycle, leading on CI/CD, deployment automation, monitoring, and incident response. Working closely with Data Science, the engineer will turn models into robust production services with strong governance, observability, and continuous optimisation.

Benefits

Competitive contributory pensionOccupational sick payLong-service awardsRefer-a-friend bonusesProfessional registration fees coveredCycle to Work schemeGreen Car schemeEnhanced maternity and paternity payFlexible hybrid workingPrivate healthcare insurance (Aviva)Employee Assistance ProgrammeIn-house mental health supportDiscounted gym memberships via Blue Light CardRegular health and wellbeing initiativesCPD, training and professional development25 days annual leave, increasing with serviceBuy and sell holiday schemeBlue Light CardEmployee discount platformExclusive discounts at The Springs, Leeds

Qualifications

  • Strong Python engineering skills
  • Experience in ML frameworks such as scikit-learn, PyTorch, or TensorFlow
  • Familiarity with experiment tracking
  • Comfortable working in regulated environments
  • Understanding of privacy, auditability, change control, and handling sensitive data
  • Strong DevOps/SRE background, including CI/CD, Infrastructure as Code, monitoring and alerting, incident management, and reliability engineering
  • Hands-on experience with containerisation using tools such as Docker and Kubernetes (e.g., AKS)
  • Experience with debugging, performance tuning, and working with container registries
  • Experience working with Azure, including Azure Machine Learning (pipelines, registries, online and batch endpoints) and Azure Monitor or Log Analytics
  • Experience operationalising ML pipelines, including training, batch scoring, feature engineering workflows, and preventing training-serving skew
  • Experience implementing safe deployment practices such as blue/green or canary releases, supported by automated validation
  • Understanding of data contracts, schema evolution, and data quality practices
  • Ability to troubleshoot data drift and missing features

Responsibilities

  • Drive the operation of production-grade Machine Learning and LLM services on Azure
  • Ensure models run as reliable, scalable, and high-performing systems
  • Own the end-to-end MLOps/LLMOps lifecycle
  • Lead on CI/CD, deployment automation, monitoring, and incident response
  • Turn models into robust production services
  • Bring strong governance, observability, and continuous optimisation for fast, safe, and efficient delivery at scale
  • Design and operate CI/CD pipelines for ML models and LLM prompt-flows, covering build, test, validation, deployment, and rollback
  • Own model registration and promotion across environments, ensuring traceability, governance, and auditability
  • Implement safe deployment strategies (e.g., blue/green, canary, champion/challenger)
  • Package and deploy containerised inference services and batch pipelines, ensuring repeatability and rapid rollback
  • Run ML and LLM services as production-grade systems, defining SLOs/SLIs, dashboards, and alerting
  • Lead incident response for runtime issues, including triage, mitigation, recovery, and post-incident reviews
  • Develop and maintain operational runbooks covering restart, rollback, secret rotation, and safe-mode scenarios
  • Improve service resilience and reduce MTTR through automation (e.g., self-healing, retries, fallbacks, circuit breakers)
  • Implement monitoring for availability, latency, errors, resource usage, and job performance
  • Monitor data quality including freshness, volume, completeness, schema drift, and distribution changes
  • Monitor model performance, including drift and prediction distribution shifts, and track accuracy where labels exist
  • Instrument LLM services for token usage, latency, and safety signals, with clear visibility into cost, quotas, and risks
  • Manage prompts and workflows as code, including versioning, code reviews, and automated regression testing
  • Own production configuration for LLM deployments, including model updates, limits, and safeguards

Skills

AKSAzureAzure MonitorDockerKubernetesPython

Work schedule

Core hours principle09:30 - 16:00Out-of-hours rota

Industry

PharmacyDigital healthcare

Relocation

No