Jobs / McCain Foods
Sr. Site Reliability Engineer, IE&O
McCain Foods · Toronto, ON, Canada
Toronto, ON, CanadaFull timeExp: 9+ yrs102,700-137,000 CAD/yearlyHybrid
Remuneration
102,700-137,000 CAD/yearly
Location
Toronto, ON, Canada
Visa sponsorship
Not specified
Job summary
The Sr. Site Reliability Engineer will enhance the reliability, availability, performance, and operability of McCain’s critical technology platforms. This role involves designing resilient cloud-native systems, embedding observability, and automating operational workflows. The engineer will also contribute to building and driving McCain’s AIOps capabilities.
Benefits
Health coverage (medical, dental, vision, prescription drug)Retirement savings benefitsLeave support including medical, family and bereavementVacationHolidaysCompany-supported volunteering timeMental health resources
Qualifications
- 9+ years of experience in software engineering, platform engineering, cloud engineering, DevOps, production engineering, or site reliability engineering.
- Strong hands-on experience with Azure, Kubernetes, containers, APIs, distributed systems, and modern deployment patterns.
- Strong scripting or software engineering experience using Python, Go, PowerShell, Bash, Java, or similar languages.
- Experience with observability, including metrics, logs, traces, dashboards, alerts, OpenTelemetry, and telemetry-driven reliability practices.
- Experience with Infrastructure as Code, CI/CD, automation, and deployment tooling such as Terraform, Bicep, GitHub Actions, Azure DevOps, or similar technologies.
- Good understanding of SLOs, SLIs, Error Budgets, resiliency patterns, incident management, production readiness, and capacity planning.
- Strong troubleshooting, communication, and stakeholder influencing skills.
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field.
- Azure certifications are preferred.
- Experience with AIOps, event correlation, alert enrichment, noise reduction, automated triage, or incident automation.
- Experience using AI-assisted capabilities for incident triage, root cause analysis, knowledge management, operational automation, or engineering productivity.
- Experience building self-service platforms, reusable automation frameworks, golden paths, or internal developer platforms.
Responsibilities
- Design, build, and improve reliable, scalable, and secure systems across Azure cloud and hybrid environments.
- Embed observability into applications and platforms using metrics, logs, traces, dashboards, alerts, and OpenTelemetry standards.
- Build and drive AIOps capabilities by improving alert quality, event correlation, incident enrichment, noise reduction, automated triage, and operational automation.
- Partner with engineering teams to define SLOs, SLIs, Error Budgets, production readiness standards, and reliability scorecards.
- Build automation to reduce toil across infrastructure, deployments, incident response, monitoring, and operational workflows.
- Use Infrastructure as Code, CI/CD pipelines, scripting, and self-healing patterns to improve reliability and delivery speed.
- Support incident response, root cause analysis, postmortems, escalation workflows, and continuous reliability improvements.
- Troubleshoot complex issues across application, infrastructure, cloud, network, database, and integration layers.
- Build reusable SRE playbooks, standards, templates, and automation patterns for broader enterprise adoption.
- Collaborate with developers, platform teams, operations teams, vendors, and stakeholders to improve system reliability and operational maturity.
Skills
AzureAzure DevOpsBashBicepGitHubGitHub ActionsGoJavaKubernetesOpenTelemetryPowerShellPythonTerraform
Certifications
Azure certifications
Degrees
Bachelor’s degree in Computer ScienceBachelor’s degree in EngineeringBachelor’s degree in Information Technology
Languages
PythonGoPowerShellBashJava
Relocation
No