Jobs / Magnet Forensics

Senior Site Reliability Engineer

Magnet Forensics · Canada
CanadaFull timeExp: 5+ yrs110,000-160,000 CAD/yearlyRemote
Remuneration
110,000-160,000 CAD/yearly
Location
Canada
Visa sponsorship
Not specified

Job summary

Magnet Forensics is seeking a Senior Site Reliability Engineer to join their SaaS-Ops team. This role involves owning reliability and operational excellence for their highly available SaaS platform, a production Kubernetes environment. The engineer will work with AWS, infrastructure-as-code, and CI/CD best practices to drive secure architectures and improve automation and reliability.

Benefits

Generous time off policiesCompetitive compensationVolunteer opportunitiesReward and recognition programsEmployee committees & resource groupsHealthcare benefitsRetirement benefits

Qualifications

  • 5+ years of industry experience demonstrating growing depth in cloud infrastructure and SRE practices
  • Experience managing production Kubernetes environments at scale, including cluster health, upgrades, and failure modes
  • Experience responding to production incidents in high-stakes environments where downtime has real consequences
  • Experience writing and maintaining Terraform at the module level, understanding state, dependencies, and operational burden of drift
  • Experience operating in an environment that uses GitOps, with a good understanding of Helm chart organization, ArgoCD app-of-apps patterns, or equivalent
  • Ability to balance reactive operational work with proactive roadmap delivery, protecting time for improvements while maintaining production stability
  • Experience with observability as a first-class discipline, including building meaningful dashboards, eliminating alert fatigue, and using metrics for operational decisions
  • Experience contributing to security hardening in a regulated or compliance-adjacent environment (e.g., FedRAMP, SOC 2)

Responsibilities

  • Own and operate production Kubernetes clusters (Amazon EKS), including upgrades, scaling, security hardening, and cluster lifecycle management
  • Design, implement, and maintain infrastructure-as-code using Terraform
  • Contribute to shared module libraries and enforce IaC standards across the team
  • Manage and evolve Helm chart definitions and ArgoCD GitOps workflows for multi-region SaaS deployments
  • Operate and maintain observability infrastructure, including Grafana, alerts, dashboards, and log pipelines
  • Eliminate noise and surface signal in observability data
  • Contribute to pipeline reliability by identifying flaky stages, reducing build times, and improving developer experience across CI/CD pipelines
  • Remediate security vulnerabilities (CVEs) in container images and infrastructure components
  • Participate in compliance work, including FedRAMP support activities
  • Develop and maintain runbooks, change management procedures, and operational documentation
  • Ensure alignment with internal policies and frameworks such as ISO 27001, SOC2, and NIST
  • Contribute to AI-assisted tooling and automation (e.g., Claude-based Terraform agents, automated triage tools) as part of the team's operational efficiency roadmap
  • Participate in on-call incident response rotation
  • Lead or support incident command during active production incidents, including root cause analysis and post-incident review

Skills

Argo CDAWSEKSGrafanaHelmKubernetesTerraform

Work schedule

On-call rotation

Relocation

No