Jobs / Pharmacy2U Ltd

Technology - ML Ops Engineer

Apply Now

Pharmacy2U Ltd · Leeds, ENG, United Kingdom

Leeds, ENG, United KingdomHybrid

Apply Now

Remuneration

Not specified

Location

Leeds, ENG, United Kingdom

Visa sponsorship

No visa sponsorship

Applicants must prove they have the right to live in the UK.

Job summary

The ML Ops Engineer will drive the operation of production-grade Machine Learning and LLM services on Azure, ensuring models run as reliable, scalable, and high-performing systems. This role involves owning the end-to-end MLOps/LLMOps lifecycle, leading on CI/CD, deployment automation, monitoring, and incident response. Working closely with Data Science, the engineer will turn models into robust production services with strong governance, observability, and continuous optimisation.

Benefits

Competitive contributory pensionOccupational sick payLong-service awardsRefer-a-friend bonusesProfessional registration fees coveredCycle to Work schemeGreen Car schemeEnhanced maternity and paternity payFlexible hybrid workingPrivate healthcare insurance (Aviva)Employee Assistance ProgrammeIn-house mental health supportDiscounted gym memberships via Blue Light CardRegular health and wellbeing initiativesCPD, training and professional development25 days annual leave, increasing with serviceBuy and sell holiday schemeBlue Light CardEmployee discount platformExclusive discounts at The Springs, Leeds

Qualifications

Strong Python engineering skills
Experience in ML frameworks such as scikit-learn, PyTorch, or TensorFlow
Familiarity with experiment tracking
Comfortable working in regulated environments
Understanding of privacy, auditability, change control, and handling sensitive data
Strong DevOps/SRE background, including CI/CD, Infrastructure as Code, monitoring and alerting, incident management, and reliability engineering
Hands-on experience with containerisation using tools such as Docker and Kubernetes (e.g., AKS)
Experience with debugging, performance tuning, and working with container registries
Experience working with Azure, including Azure Machine Learning (pipelines, registries, online and batch endpoints) and Azure Monitor or Log Analytics
Experience operationalising ML pipelines, including training, batch scoring, feature engineering workflows, and preventing training-serving skew
Experience implementing safe deployment practices such as blue/green or canary releases, supported by automated validation
Understanding of data contracts, schema evolution, and data quality practices
Ability to troubleshoot data drift and missing features

Responsibilities

Drive the operation of production-grade Machine Learning and LLM services on Azure
Ensure models run as reliable, scalable, and high-performing systems
Own the end-to-end MLOps/LLMOps lifecycle
Lead on CI/CD, deployment automation, monitoring, and incident response
Turn models into robust production services
Bring strong governance, observability, and continuous optimisation for fast, safe, and efficient delivery at scale
Design and operate CI/CD pipelines for ML models and LLM prompt-flows, covering build, test, validation, deployment, and rollback
Own model registration and promotion across environments, ensuring traceability, governance, and auditability
Implement safe deployment strategies (e.g., blue/green, canary, champion/challenger)
Package and deploy containerised inference services and batch pipelines, ensuring repeatability and rapid rollback
Run ML and LLM services as production-grade systems, defining SLOs/SLIs, dashboards, and alerting
Lead incident response for runtime issues, including triage, mitigation, recovery, and post-incident reviews
Develop and maintain operational runbooks covering restart, rollback, secret rotation, and safe-mode scenarios
Improve service resilience and reduce MTTR through automation (e.g., self-healing, retries, fallbacks, circuit breakers)
Implement monitoring for availability, latency, errors, resource usage, and job performance
Monitor data quality including freshness, volume, completeness, schema drift, and distribution changes
Monitor model performance, including drift and prediction distribution shifts, and track accuracy where labels exist
Instrument LLM services for token usage, latency, and safety signals, with clear visibility into cost, quotas, and risks
Manage prompts and workflows as code, including versioning, code reviews, and automated regression testing
Own production configuration for LLM deployments, including model updates, limits, and safeguards

Skills

AKSAzureAzure MonitorDockerKubernetesPython

Work schedule

Core hours principle09:30 - 16:00Out-of-hours rota

Industry

PharmacyDigital healthcare

Relocation

Apply Now