Jobs / Chemify Ltd

Senior ML Infrastructure Engineer

Apply Now

Chemify Ltd · Glasgow, SCT, United Kingdom

Glasgow, SCT, United KingdomRemote

Apply Now

Remuneration

Not specified

Location

Glasgow, SCT, United Kingdom

Visa sponsorship

Not specified

Job summary

Chemify is seeking a Senior ML Infrastructure Engineer to build, enable, and operate the core platform that powers Chemify's machine learning and scientific AI computing workloads. This role involves distributed systems engineering, machine learning infrastructure, scientific computing, and platform engineering. The engineer will build and operate the operational backbone of the ML platform, ensuring reliable pipeline execution across Kubernetes clusters, on-premise GPU infrastructure, and serverless compute environments.

Qualifications

Degree in Science, Engineering, or related field, or equivalent practical experience
Strong Python engineering skills
Experience operating workflow orchestration platforms
Strong Kubernetes platform experience
Experience with containerisation and CI/CD pipelines
Experience with cloud infrastructure such as AWS and GCP
Experience operating distributed systems in production
Strong Linux systems engineering skills

Responsibilities

Implement routing logic for dispatching workloads to compute backends
Maintain workflow reliability, including retries, dependency management, and failure recovery
Administer and support Linux servers, including security and scaling
Operate Kubernetes clusters for ML training, inference, and batch workloads
Maintain container build pipelines and GitOps deployment workflows
Optimize cluster scheduling, autoscaling, and GPU utilization
Integrate orchestration systems with HPC job schedulers
Maintain execution paths for workloads on GPU clusters
Ensure artifacts and results from HPC jobs are captured and versioned
Operate model registry and experiment tracking platforms
Ensure training runs are reproducible and linked to code and datasets
Support promotion of models from staging to production
Implement dataset versioning and lineage tracking across ML pipelines
Ensure predictions are traceable to model versions and datasets
Maintain reproducible ML training pipelines
Develop platform CLI tools and pipeline templates
Maintain base container images for ML workloads
Improve developer workflows for ML engineers and scientists
Implement monitoring, logging, and alerting across orchestration systems
Maintain infrastructure as code for platform resources

Skills

Argo WorkflowsAWSGCPKubernetesLinuxPythonGitGitHub Actions

Relocation

Apply Now