Jobs / Marsh McLennan

Senior Observability & SRE Leader

Apply Now

Marsh McLennan · Toronto, ON, Canada

Toronto, ON, CanadaExp: 15+ yrsRemote

Apply Now

Remuneration

Not specified

Location

Toronto, ON, Canada

Visa sponsorship

Not specified

Job summary

Marsh is seeking a visionary leader to rebuild and transform its Observability and Site Reliability Engineering (SRE) function. This role involves shifting to a predictive, data-driven engineering discipline to prevent outages, embed reliability into systems, and treat observability data as a strategic asset. The leader will build a world-class observability and SRE organization at Fortune 500 scale.

Qualifications

15+ years in technology with 8+ years in progressively senior observability, SRE, or platform reliability leadership roles.
Demonstrated track record of transforming reactive monitoring organizations into proactive, engineering-driven SRE functions at enterprise scale (10,000+ employees, 1,000+ applications).
Deep expertise across the full observability stack: metrics (Prometheus, Datadog, CloudWatch), distributed tracing (Jaeger, OpenTelemetry, Datadog APM), log aggregation (Splunk, ELK, Datadog Logs), synthetic monitoring, and RUM.
Hands-on experience defining and operationalizing SLO/SLI/Error Budget frameworks.
Proven experience building AIOps / ML-driven anomaly detection and automated remediation capabilities in production systems.
Strong background in chaos engineering, resilience testing, and reliability-by-design practices (circuit breakers, bulkheads, graceful degradation, retry/backoff patterns).
Experience operating across hybrid infrastructure: on-premises data centers, AWS, Azure, containerized workloads (Kubernetes), and SaaS platforms.
Demonstrated ability to drive cultural and organizational transformation across large, complex enterprises.
Experience managing $5M+ observability platform budgets and optimizing total cost of ownership.
Executive communication skills to present reliability strategy, risk posture, and investment cases to C-suite and board-level audiences.
Visionary thinker capable of articulating a compelling future state, building the roadmap, and executing relentlessly.

Responsibilities

Define and execute an observability and SRE strategy to shift from reactive operations to predictive reliability engineering.
Architect and deliver a unified, full-stack observability platform covering metrics, traces, logs, real-user monitoring (RUM), synthetic monitoring, and business-level KPIs across on-prem, multi-cloud (AWS/Azure), containers, and SaaS integrations.
Rationalize and consolidate fragmented tooling into a cohesive, cost-optimized platform, eliminating redundant tools and reducing alert noise.
Establish a single pane of glass for system health.
Drive adoption of OpenTelemetry as the standard instrumentation framework for vendor-agnostic telemetry collection.
Build and operationalize AIOps and ML-driven capabilities to detect anomalies, predict failures, and surface emerging risks.
Establish automated correlation engines to link infrastructure signals, application traces, deployment events, and change records to reduce diagnostic time and identify root cause.
Design and implement self-healing automation to detect, diagnose, and remediate common failure patterns without human intervention.
Introduce chaos engineering and reliability testing programs (GameDays, fault injection, load testing) to proactively discover weaknesses.
Transform the operations-centric team into a modern SRE organization with embedded reliability engineers operating under a "you build it, you own it" model.
Define and implement SLO/SLI/Error Budget frameworks across critical services.
Drive adoption of DevOps practices, CI/CD pipelines, and infrastructure as code using tools like Terraform or CloudFormation.
Champion reliability-first design principles, ensuring observability, graceful degradation, circuit breaking, and failure isolation are architected into systems from day one.
Partner with Major Incident Management and Problem Management to build closed-loop feedback systems, ensuring every incident produces a reliability improvement.
Drive Mean Time To Resolution (MTTR) toward minutes through automated diagnostics, pre-built remediation playbooks, and intelligent correlation.
Establish "Incidents Prevented" as a primary success metric.
Elevate observability from infrastructure metrics to business outcomes, building real-time dashboards connecting system health to revenue impact, customer experience, and SLA compliance.
Integrate observability insights into ITSM (ServiceNow), data platforms, and executive reporting.
Own the total cost of ownership of the observability platform, optimizing spend through data tiering, intelligent sampling, retention policies, and vendor negotiations.
Manage strategic vendor relationships (Datadog, Splunk, Logic Monitor, cloud-native tooling) to maximize value extraction.

Skills

AWSAzureCloudFormationCloudWatchDatadogJaegerKubernetesOpenTelemetryPrometheusServiceNowSplunkTerraform

Industry

RiskReinsuranceCapitalPeople and InvestmentsManagement Consulting

Company size

Fortune 50010,000+ employees95,000 colleagues

Relocation

Apply Now