Jobs / McCain Foods

Sr. Site Reliability Engineer, IE&O

Apply Now

McCain Foods · Toronto, ON, Canada

Toronto, ON, CanadaFull timeExp: 9+ yrs102,700-137,000 CAD/yearlyHybrid

Apply Now

Remuneration

102,700-137,000 CAD/yearly

Location

Toronto, ON, Canada

Visa sponsorship

Not specified

Job summary

The Sr. Site Reliability Engineer will enhance the reliability, availability, performance, and operability of McCain’s critical technology platforms. This role involves designing resilient cloud-native systems, embedding observability, and automating operational workflows. The engineer will also contribute to building and driving McCain’s AIOps capabilities.

Benefits

Health coverage (medical, dental, vision, prescription drug)Retirement savings benefitsLeave support including medical, family and bereavementVacationHolidaysCompany-supported volunteering timeMental health resources

Qualifications

9+ years of experience in software engineering, platform engineering, cloud engineering, DevOps, production engineering, or site reliability engineering.
Strong hands-on experience with Azure, Kubernetes, containers, APIs, distributed systems, and modern deployment patterns.
Strong scripting or software engineering experience using Python, Go, PowerShell, Bash, Java, or similar languages.
Experience with observability, including metrics, logs, traces, dashboards, alerts, OpenTelemetry, and telemetry-driven reliability practices.
Experience with Infrastructure as Code, CI/CD, automation, and deployment tooling such as Terraform, Bicep, GitHub Actions, Azure DevOps, or similar technologies.
Good understanding of SLOs, SLIs, Error Budgets, resiliency patterns, incident management, production readiness, and capacity planning.
Strong troubleshooting, communication, and stakeholder influencing skills.
Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field.
Azure certifications are preferred.
Experience with AIOps, event correlation, alert enrichment, noise reduction, automated triage, or incident automation.
Experience using AI-assisted capabilities for incident triage, root cause analysis, knowledge management, operational automation, or engineering productivity.
Experience building self-service platforms, reusable automation frameworks, golden paths, or internal developer platforms.

Responsibilities

Design, build, and improve reliable, scalable, and secure systems across Azure cloud and hybrid environments.
Embed observability into applications and platforms using metrics, logs, traces, dashboards, alerts, and OpenTelemetry standards.
Build and drive AIOps capabilities by improving alert quality, event correlation, incident enrichment, noise reduction, automated triage, and operational automation.
Partner with engineering teams to define SLOs, SLIs, Error Budgets, production readiness standards, and reliability scorecards.
Build automation to reduce toil across infrastructure, deployments, incident response, monitoring, and operational workflows.
Use Infrastructure as Code, CI/CD pipelines, scripting, and self-healing patterns to improve reliability and delivery speed.
Support incident response, root cause analysis, postmortems, escalation workflows, and continuous reliability improvements.
Troubleshoot complex issues across application, infrastructure, cloud, network, database, and integration layers.
Build reusable SRE playbooks, standards, templates, and automation patterns for broader enterprise adoption.
Collaborate with developers, platform teams, operations teams, vendors, and stakeholders to improve system reliability and operational maturity.

Skills

AzureAzure DevOpsBashBicepGitHubGitHub ActionsGoJavaKubernetesOpenTelemetryPowerShellPythonTerraform

Certifications

Azure certifications

Degrees

Bachelor’s degree in Computer ScienceBachelor’s degree in EngineeringBachelor’s degree in Information Technology

Languages

PythonGoPowerShellBashJava

Relocation

Apply Now