Jobs / Magnet Forensics

Senior Site Reliability Engineer

Apply Now

Magnet Forensics · Canada

CanadaFull timeExp: 5+ yrs110,000-160,000 CAD/yearlyRemote

Apply Now

Remuneration

110,000-160,000 CAD/yearly

Location

Canada

Visa sponsorship

Not specified

Job summary

Magnet Forensics is seeking a Senior Site Reliability Engineer to join their SaaS-Ops team. This role involves owning reliability and operational excellence for their highly available SaaS platform, a production Kubernetes environment. The engineer will work with AWS, infrastructure-as-code, and CI/CD best practices to drive secure architectures and improve automation and reliability.

Benefits

Generous time off policiesCompetitive compensationVolunteer opportunitiesReward and recognition programsEmployee committees & resource groupsHealthcare benefitsRetirement benefits

Qualifications

5+ years of industry experience demonstrating growing depth in cloud infrastructure and SRE practices
Experience managing production Kubernetes environments at scale, including cluster health, upgrades, and failure modes
Experience responding to production incidents in high-stakes environments where downtime has real consequences
Experience writing and maintaining Terraform at the module level, understanding state, dependencies, and operational burden of drift
Experience operating in an environment that uses GitOps, with a good understanding of Helm chart organization, ArgoCD app-of-apps patterns, or equivalent
Ability to balance reactive operational work with proactive roadmap delivery, protecting time for improvements while maintaining production stability
Experience with observability as a first-class discipline, including building meaningful dashboards, eliminating alert fatigue, and using metrics for operational decisions
Experience contributing to security hardening in a regulated or compliance-adjacent environment (e.g., FedRAMP, SOC 2)

Responsibilities

Own and operate production Kubernetes clusters (Amazon EKS), including upgrades, scaling, security hardening, and cluster lifecycle management
Design, implement, and maintain infrastructure-as-code using Terraform
Contribute to shared module libraries and enforce IaC standards across the team
Manage and evolve Helm chart definitions and ArgoCD GitOps workflows for multi-region SaaS deployments
Operate and maintain observability infrastructure, including Grafana, alerts, dashboards, and log pipelines
Eliminate noise and surface signal in observability data
Contribute to pipeline reliability by identifying flaky stages, reducing build times, and improving developer experience across CI/CD pipelines
Remediate security vulnerabilities (CVEs) in container images and infrastructure components
Participate in compliance work, including FedRAMP support activities
Develop and maintain runbooks, change management procedures, and operational documentation
Ensure alignment with internal policies and frameworks such as ISO 27001, SOC2, and NIST
Contribute to AI-assisted tooling and automation (e.g., Claude-based Terraform agents, automated triage tools) as part of the team's operational efficiency roadmap
Participate in on-call incident response rotation
Lead or support incident command during active production incidents, including root cause analysis and post-incident review

Skills

Argo CDAWSEKSGrafanaHelmKubernetesTerraform

Work schedule

On-call rotation

Relocation

Apply Now