Jobs / Marsh McLennan
Senior Observability & SRE Leader
Marsh McLennan · Toronto, ON, Canada
Toronto, ON, CanadaExp: 15+ yrsRemote
Remuneration
Not specified
Location
Toronto, ON, Canada
Visa sponsorship
Not specified
Job summary
Marsh is seeking a visionary leader to rebuild and transform its Observability and Site Reliability Engineering (SRE) function. This role involves shifting to a predictive, data-driven engineering discipline to prevent outages, embed reliability into systems, and treat observability data as a strategic asset. The leader will build a world-class observability and SRE organization at Fortune 500 scale.
Qualifications
- 15+ years in technology with 8+ years in progressively senior observability, SRE, or platform reliability leadership roles.
- Demonstrated track record of transforming reactive monitoring organizations into proactive, engineering-driven SRE functions at enterprise scale (10,000+ employees, 1,000+ applications).
- Deep expertise across the full observability stack: metrics (Prometheus, Datadog, CloudWatch), distributed tracing (Jaeger, OpenTelemetry, Datadog APM), log aggregation (Splunk, ELK, Datadog Logs), synthetic monitoring, and RUM.
- Hands-on experience defining and operationalizing SLO/SLI/Error Budget frameworks.
- Proven experience building AIOps / ML-driven anomaly detection and automated remediation capabilities in production systems.
- Strong background in chaos engineering, resilience testing, and reliability-by-design practices (circuit breakers, bulkheads, graceful degradation, retry/backoff patterns).
- Experience operating across hybrid infrastructure: on-premises data centers, AWS, Azure, containerized workloads (Kubernetes), and SaaS platforms.
- Demonstrated ability to drive cultural and organizational transformation across large, complex enterprises.
- Experience managing $5M+ observability platform budgets and optimizing total cost of ownership.
- Executive communication skills to present reliability strategy, risk posture, and investment cases to C-suite and board-level audiences.
- Visionary thinker capable of articulating a compelling future state, building the roadmap, and executing relentlessly.
Responsibilities
- Define and execute an observability and SRE strategy to shift from reactive operations to predictive reliability engineering.
- Architect and deliver a unified, full-stack observability platform covering metrics, traces, logs, real-user monitoring (RUM), synthetic monitoring, and business-level KPIs across on-prem, multi-cloud (AWS/Azure), containers, and SaaS integrations.
- Rationalize and consolidate fragmented tooling into a cohesive, cost-optimized platform, eliminating redundant tools and reducing alert noise.
- Establish a single pane of glass for system health.
- Drive adoption of OpenTelemetry as the standard instrumentation framework for vendor-agnostic telemetry collection.
- Build and operationalize AIOps and ML-driven capabilities to detect anomalies, predict failures, and surface emerging risks.
- Establish automated correlation engines to link infrastructure signals, application traces, deployment events, and change records to reduce diagnostic time and identify root cause.
- Design and implement self-healing automation to detect, diagnose, and remediate common failure patterns without human intervention.
- Introduce chaos engineering and reliability testing programs (GameDays, fault injection, load testing) to proactively discover weaknesses.
- Transform the operations-centric team into a modern SRE organization with embedded reliability engineers operating under a "you build it, you own it" model.
- Define and implement SLO/SLI/Error Budget frameworks across critical services.
- Drive adoption of DevOps practices, CI/CD pipelines, and infrastructure as code using tools like Terraform or CloudFormation.
- Champion reliability-first design principles, ensuring observability, graceful degradation, circuit breaking, and failure isolation are architected into systems from day one.
- Partner with Major Incident Management and Problem Management to build closed-loop feedback systems, ensuring every incident produces a reliability improvement.
- Drive Mean Time To Resolution (MTTR) toward minutes through automated diagnostics, pre-built remediation playbooks, and intelligent correlation.
- Establish "Incidents Prevented" as a primary success metric.
- Elevate observability from infrastructure metrics to business outcomes, building real-time dashboards connecting system health to revenue impact, customer experience, and SLA compliance.
- Integrate observability insights into ITSM (ServiceNow), data platforms, and executive reporting.
- Own the total cost of ownership of the observability platform, optimizing spend through data tiering, intelligent sampling, retention policies, and vendor negotiations.
- Manage strategic vendor relationships (Datadog, Splunk, Logic Monitor, cloud-native tooling) to maximize value extraction.
Skills
AWSAzureCloudFormationCloudWatchDatadogJaegerKubernetesOpenTelemetryPrometheusServiceNowSplunkTerraform
Industry
RiskReinsuranceCapitalPeople and InvestmentsManagement Consulting
Company size
Fortune 50010,000+ employees95,000 colleagues
Relocation
No