Jobs / Magnet Forensics
Senior Site Reliability Engineer
Magnet Forensics · Canada
CanadaFull timeExp: 5+ yrs110,000-160,000 CAD/yearlyRemote
Remuneration
110,000-160,000 CAD/yearly
Location
Canada
Visa sponsorship
Not specified
Job summary
Magnet Forensics is seeking a Senior Site Reliability Engineer to join their SaaS-Ops team. This role involves owning reliability and operational excellence for their highly available SaaS platform, a production Kubernetes environment. The engineer will work with AWS, infrastructure-as-code, and CI/CD best practices to drive secure architectures and improve automation and reliability.
Benefits
Generous time off policiesCompetitive compensationVolunteer opportunitiesReward and recognition programsEmployee committees & resource groupsHealthcare benefitsRetirement benefits
Qualifications
- 5+ years of industry experience demonstrating growing depth in cloud infrastructure and SRE practices
- Experience managing production Kubernetes environments at scale, including cluster health, upgrades, and failure modes
- Experience responding to production incidents in high-stakes environments where downtime has real consequences
- Experience writing and maintaining Terraform at the module level, understanding state, dependencies, and operational burden of drift
- Experience operating in an environment that uses GitOps, with a good understanding of Helm chart organization, ArgoCD app-of-apps patterns, or equivalent
- Ability to balance reactive operational work with proactive roadmap delivery, protecting time for improvements while maintaining production stability
- Experience with observability as a first-class discipline, including building meaningful dashboards, eliminating alert fatigue, and using metrics for operational decisions
- Experience contributing to security hardening in a regulated or compliance-adjacent environment (e.g., FedRAMP, SOC 2)
Responsibilities
- Own and operate production Kubernetes clusters (Amazon EKS), including upgrades, scaling, security hardening, and cluster lifecycle management
- Design, implement, and maintain infrastructure-as-code using Terraform
- Contribute to shared module libraries and enforce IaC standards across the team
- Manage and evolve Helm chart definitions and ArgoCD GitOps workflows for multi-region SaaS deployments
- Operate and maintain observability infrastructure, including Grafana, alerts, dashboards, and log pipelines
- Eliminate noise and surface signal in observability data
- Contribute to pipeline reliability by identifying flaky stages, reducing build times, and improving developer experience across CI/CD pipelines
- Remediate security vulnerabilities (CVEs) in container images and infrastructure components
- Participate in compliance work, including FedRAMP support activities
- Develop and maintain runbooks, change management procedures, and operational documentation
- Ensure alignment with internal policies and frameworks such as ISO 27001, SOC2, and NIST
- Contribute to AI-assisted tooling and automation (e.g., Claude-based Terraform agents, automated triage tools) as part of the team's operational efficiency roadmap
- Participate in on-call incident response rotation
- Lead or support incident command during active production incidents, including root cause analysis and post-incident review
Skills
Argo CDAWSEKSGrafanaHelmKubernetesTerraform
Work schedule
On-call rotation
Relocation
No