Jobs / LUMAI
MLOps Engineer
LUMAI · Oxford, ENG, United Kingdom
Oxford, ENG, United KingdomExp: 5+ yrsRemote
Remuneration
Not specified
Location
Oxford, ENG, United Kingdom
Visa sponsorship
Not specified
Job summary
Lumai is seeking an MLOps Engineer to design, build, and operate the infrastructure for taking AI models from research to silicon-validated production. This high-impact role involves working at the intersection of ML research, compiler stacks, and novel hardware, contributing to a breakthrough AI accelerator for data centers. The successful candidate will be crucial in enabling AI and hardware teams to move quickly and efficiently.
Benefits
Highly Competitive SalaryShare Option SchemePension SchemePrivate Health InsuranceCycle to WorkL&D AllowanceSubsidised On-site Lunches25 days paid holiday (plus bank holidays)Socials
Qualifications
- 5+ years of software or infrastructure engineering experience, with at least 2 years in an ML or AI-adjacent role
- Strong Python skills and familiarity with major ML frameworks (PyTorch or JAX); comfortable reading and modifying model code
- Hands-on experience building and operating ML pipelines in production: data pipelines, training orchestration, evaluation, and serving
- Experience with experiment tracking and model lifecycle management tools (MLflow, W&B, DVC, or similar)
- Solid understanding of containerisation (Docker) and orchestration (Kubernetes or Slurm) for distributed compute workloads
- Infrastructure-as-code mindset: Terraform, Ansible, or equivalent; CI/CD pipelines (GitHub Actions, Jenkins, or similar)
- Experience with hardware-accelerated compute (CUDA/GPU workflows, profiling, performance tuning) — even if not on custom silicon
- Strong debugging and observability skills: distributed tracing, logging, metrics dashboards
- Ability to work effectively in a fast-moving, ambiguous environment where the hardware and software are both being built simultaneously
- Experience with custom or novel accelerator hardware (FPGAs, ASICs, NPUs, or research chips)
- Familiarity with ML compiler stacks: MLIR, LLVM, TVM, XLA, or vendor-specific compilers (NVCC, TensorRT, etc.)
- Experience with model optimisation techniques: quantisation (INT8/INT4/FP8), pruning, distillation, or mixed-precision training
- Background in on-chip performance profiling and roofline analysis
- Exposure to chip bring-up workflows: running early software stacks on pre-silicon simulation or first-silicon hardware
- Contributions to open-source ML infrastructure or compiler tooling
- Experience in a deeptech, semiconductor, or hardware startup environment
Responsibilities
- Design and operate end-to-end ML pipelines: data ingest, training, evaluation, quantisation, and deployment onto custom AI accelerator hardware
- Build and maintain experiment tracking, model registry, and versioning infrastructure (e.g. MLflow, W&B, or equivalent) tuned to hardware-in-the-loop workflows
- Own CI/CD for ML: automated testing of model correctness, numerical accuracy, and on-chip performance after every change to models, compilers, or firmware
- Develop and maintain tooling for benchmarking model inference on custom silicon, including latency, throughput, power, and utilisation metrics
- Collaborate closely with ML researchers, compiler engineers, and hardware architects to identify and remove bottlenecks across the model-to-chip workflow
- Instrument and monitor production inference deployments; design alerting and rollback strategies appropriate to hardware-accelerated serving
- Manage compute resource scheduling across on-premises accelerator clusters and cloud (GPU/CPU) for training and simulation workloads
- Drive infrastructure-as-code practices: containerisation, orchestration (Kubernetes/Slurm), and reproducible environment management
- Contribute to the internal developer platform: self-service tooling, documentation, and runbooks that raise engineering productivity across the company
Skills
AnsibleDockerGitHubGitHub ActionsJenkinsKubernetesPythonTerraform
Languages
Python
Industry
DeeptechSemiconductorHardware startup
Relocation
No