Jobs / Chemify Ltd
Senior ML Infrastructure Engineer
Chemify Ltd · Glasgow, SCT, United Kingdom
Glasgow, SCT, United KingdomRemote
Remuneration
Not specified
Location
Glasgow, SCT, United Kingdom
Visa sponsorship
Not specified
Job summary
Chemify is seeking a Senior ML Infrastructure Engineer to build, enable, and operate the core platform that powers Chemify's machine learning and scientific AI computing workloads. This role involves distributed systems engineering, machine learning infrastructure, scientific computing, and platform engineering. The engineer will build and operate the operational backbone of the ML platform, ensuring reliable pipeline execution across Kubernetes clusters, on-premise GPU infrastructure, and serverless compute environments.
Qualifications
- Degree in Science, Engineering, or related field, or equivalent practical experience
- Strong Python engineering skills
- Experience operating workflow orchestration platforms
- Strong Kubernetes platform experience
- Experience with containerisation and CI/CD pipelines
- Experience with cloud infrastructure such as AWS and GCP
- Experience operating distributed systems in production
- Strong Linux systems engineering skills
Responsibilities
- Implement routing logic for dispatching workloads to compute backends
- Maintain workflow reliability, including retries, dependency management, and failure recovery
- Administer and support Linux servers, including security and scaling
- Operate Kubernetes clusters for ML training, inference, and batch workloads
- Maintain container build pipelines and GitOps deployment workflows
- Optimize cluster scheduling, autoscaling, and GPU utilization
- Integrate orchestration systems with HPC job schedulers
- Maintain execution paths for workloads on GPU clusters
- Ensure artifacts and results from HPC jobs are captured and versioned
- Operate model registry and experiment tracking platforms
- Ensure training runs are reproducible and linked to code and datasets
- Support promotion of models from staging to production
- Implement dataset versioning and lineage tracking across ML pipelines
- Ensure predictions are traceable to model versions and datasets
- Maintain reproducible ML training pipelines
- Develop platform CLI tools and pipeline templates
- Maintain base container images for ML workloads
- Improve developer workflows for ML engineers and scientists
- Implement monitoring, logging, and alerting across orchestration systems
- Maintain infrastructure as code for platform resources
Skills
Argo WorkflowsAWSGCPKubernetesLinuxPythonGitGitHub Actions
Relocation
No