Jobs / LSEG (London Stock Exchange Group)

Director of SRE

Apply Now

LSEG (London Stock Exchange Group) · London, ENG, United Kingdom

London, ENG, United KingdomRemote

Director of SRE

Apply Now

Remuneration

Not specified

Location

London, ENG, United Kingdom

Visa sponsorship

Not specified

Job summary

LSEG is seeking a highly technical and strategic Director of Site Reliability Engineering (SRE) to lead the design, operation, and continuous improvement of highly available, scalable, and resilient platforms across FTSE Russell Engineering. This role will drive operational and engineering excellence in observability, incident management, automation, and resilience, ensuring mission-critical financial systems meet stringent reliability, performance, and regulatory requirements. The Director of SRE will report to the COO, FTSE Russell Engineering.

Benefits

HealthcareRetirement planningPaid volunteering daysWellbeing initiatives

Qualifications

Proven experience leading SRE, DevOps, and/or Platform Engineering teams in large-scale, regulated environments.
Deep technical expertise in both on-premise and AWS cloud-native architecture and systems design.
Strong experience with observability tooling and frameworks for metrics, logs, and traces (Prometheus, Grafana, OpenTelemetry, ELK/EFK stacks, CloudWatch, Datadog, or similar).
Demonstrated ownership of incident management and production operations at scale.
Hands-on experience with CI/CD pipelines, Git, automation, scripting, and IaC tools (Python, Go, Ansible, Terraform, etc.).
Demonstrable application of agentic AI engineering in observability, incident management, and automated recovery.
Strong understanding of networking, security, and reliability engineering principles.
Experience defining and implementing SLOs, SLIs, and error budgets.
Experience in financial services or other data intensive, regulated industries.
Familiarity with multi-region AWS architecture, high-availability, mission critical applications.
Knowledge of chaos engineering practices and resilience testing.
Exposure to AIOps and intelligent automation frameworks.
Calm, decisive, and effective under pressure.
Strong communicator with the ability to engage both technical and non-technical stakeholders.
Strategic thinker with a hands-on approach to problem-solving.
Passionate about service reliability, resilient architecture, and continuous improvement.

Responsibilities

Lead, mentor, and scale a high-performing global SRE organization.
Partner with product, platform, operations and security teams to embed reliability into the software development lifecycle (SDLC).
Define and track KPIs for reliability, performance, and operational efficiency.
Foster a culture of continuous improvement, accountability, and engineering excellence.
Promote automation-first principles and self-service observability tooling for engineering teams.
Own and evolve production incident management frameworks, including detection, triage, escalation, and resolution.
Lead major incident response (MIR) for critical outages, ensuring rapid mitigation and clear stakeholder communication.
Implement and enforce blameless postmortems, ensuring actionable follow-ups and systemic improvements.
Establish runbooks, playbooks, and operational readiness standards across all services.
Drive continuous improvement in incident response processes, tooling, and team readiness.
Drive automation of operational processes including incident response, failover, scaling, and recovery.
Implement self-healing mechanisms using auto-remediation, event-driven workflows, and AI/ML-assisted operations.
Promote Infrastructure as Code (IaC) using tools such as Terraform or CloudFormation.
Reduce manual toil through CI/CD pipelines, automated testing, and deployment strategies (blue/green, canary releases).
Define and implement best-in-class observability frameworks across metrics, logs, and traces.
Standardize tooling (e.g., Prometheus, Grafana, OpenTelemetry, ELK/EFK stacks, CloudWatch, Datadog, or similar) across engineering teams.
Champion distributed tracing and real-time telemetry to enable deep system visibility and rapid root cause analysis.
Drive a data-driven reliability culture, using observability insights to proactively identify and eliminate system risks.
Establish and enforce SRE principles including SLIs, SLOs, SLAs, and error budgets across all critical services.
Drive adoption of resilient design patterns (multi-region failover, active-active architectures, circuit breakers, bulkheads).

Skills

AnsibleAWSCloudFormationCloudWatchC#DatadogDynamoDBECSEKSGitGoGrafanaJavaAWS LambdaLinuxOpenTelemetryPostgreSQLPrometheusPythonS3Terraform

Languages

PythonGo

Industry

Financial services

Relocation

Apply Now