Jobs / Marsh McLennan

Senior Observability & SRE Leader

Marsh McLennan · Toronto, ON, Canada
Toronto, ON, CanadaExp: 15+ yrsRemote
Remuneration
Not specified
Location
Toronto, ON, Canada
Visa sponsorship
Not specified

Job summary

Marsh is seeking a visionary leader to rebuild and transform its Observability and Site Reliability Engineering (SRE) function. This role involves shifting to a predictive, data-driven engineering discipline to prevent outages, embed reliability into systems, and treat observability data as a strategic asset. The leader will build a world-class observability and SRE organization at Fortune 500 scale.

Qualifications

  • 15+ years in technology with 8+ years in progressively senior observability, SRE, or platform reliability leadership roles.
  • Demonstrated track record of transforming reactive monitoring organizations into proactive, engineering-driven SRE functions at enterprise scale (10,000+ employees, 1,000+ applications).
  • Deep expertise across the full observability stack: metrics (Prometheus, Datadog, CloudWatch), distributed tracing (Jaeger, OpenTelemetry, Datadog APM), log aggregation (Splunk, ELK, Datadog Logs), synthetic monitoring, and RUM.
  • Hands-on experience defining and operationalizing SLO/SLI/Error Budget frameworks.
  • Proven experience building AIOps / ML-driven anomaly detection and automated remediation capabilities in production systems.
  • Strong background in chaos engineering, resilience testing, and reliability-by-design practices (circuit breakers, bulkheads, graceful degradation, retry/backoff patterns).
  • Experience operating across hybrid infrastructure: on-premises data centers, AWS, Azure, containerized workloads (Kubernetes), and SaaS platforms.
  • Demonstrated ability to drive cultural and organizational transformation across large, complex enterprises.
  • Experience managing $5M+ observability platform budgets and optimizing total cost of ownership.
  • Executive communication skills to present reliability strategy, risk posture, and investment cases to C-suite and board-level audiences.
  • Visionary thinker capable of articulating a compelling future state, building the roadmap, and executing relentlessly.

Responsibilities

  • Define and execute an observability and SRE strategy to shift from reactive operations to predictive reliability engineering.
  • Architect and deliver a unified, full-stack observability platform covering metrics, traces, logs, real-user monitoring (RUM), synthetic monitoring, and business-level KPIs across on-prem, multi-cloud (AWS/Azure), containers, and SaaS integrations.
  • Rationalize and consolidate fragmented tooling into a cohesive, cost-optimized platform, eliminating redundant tools and reducing alert noise.
  • Establish a single pane of glass for system health.
  • Drive adoption of OpenTelemetry as the standard instrumentation framework for vendor-agnostic telemetry collection.
  • Build and operationalize AIOps and ML-driven capabilities to detect anomalies, predict failures, and surface emerging risks.
  • Establish automated correlation engines to link infrastructure signals, application traces, deployment events, and change records to reduce diagnostic time and identify root cause.
  • Design and implement self-healing automation to detect, diagnose, and remediate common failure patterns without human intervention.
  • Introduce chaos engineering and reliability testing programs (GameDays, fault injection, load testing) to proactively discover weaknesses.
  • Transform the operations-centric team into a modern SRE organization with embedded reliability engineers operating under a "you build it, you own it" model.
  • Define and implement SLO/SLI/Error Budget frameworks across critical services.
  • Drive adoption of DevOps practices, CI/CD pipelines, and infrastructure as code using tools like Terraform or CloudFormation.
  • Champion reliability-first design principles, ensuring observability, graceful degradation, circuit breaking, and failure isolation are architected into systems from day one.
  • Partner with Major Incident Management and Problem Management to build closed-loop feedback systems, ensuring every incident produces a reliability improvement.
  • Drive Mean Time To Resolution (MTTR) toward minutes through automated diagnostics, pre-built remediation playbooks, and intelligent correlation.
  • Establish "Incidents Prevented" as a primary success metric.
  • Elevate observability from infrastructure metrics to business outcomes, building real-time dashboards connecting system health to revenue impact, customer experience, and SLA compliance.
  • Integrate observability insights into ITSM (ServiceNow), data platforms, and executive reporting.
  • Own the total cost of ownership of the observability platform, optimizing spend through data tiering, intelligent sampling, retention policies, and vendor negotiations.
  • Manage strategic vendor relationships (Datadog, Splunk, Logic Monitor, cloud-native tooling) to maximize value extraction.

Skills

AWSAzureCloudFormationCloudWatchDatadogJaegerKubernetesOpenTelemetryPrometheusServiceNowSplunkTerraform

Industry

RiskReinsuranceCapitalPeople and InvestmentsManagement Consulting

Company size

Fortune 50010,000+ employees95,000 colleagues

Relocation

No