Jobs / Apple

Site Reliability Engineer, iCloud

Apple · London, ENG, United Kingdom
London, ENG, United KingdomExp: 5+ yrsHybrid
Remuneration
Not specified
Location
London, ENG, United Kingdom
Visa sponsorship
Not specified

Job summary

Apple Services Engineering (ASE) is seeking a Site Reliability Engineer (SRE) to join their team, responsible for the reliability and performance of server software stacks powering products like iCloud Photos, Mail, Drive, and Backup. The role involves solving unique challenges at Apple's large scale, across multiple geographies, and servicing hundreds of millions of users. The SRE will engage with product teams, operate and monitor environments, and contribute to improving system reliability, security, and performance.

Qualifications

  • Strong sense of ownership, customer service, and integrity proven through clear communication.
  • BS in Computer Science or related field, or equivalent employment.
  • 5+ years experience in managing and scaling distributed systems in a public, private, or hybrid cloud environment.
  • Strong experience with deploying, supporting and supervising new and existing services, platforms, and application stacks.
  • Experience with scale testing, disaster recovery, and capacity planning.
  • Experience with observability platforms including Splunk, Grafana, and Prometheus.
  • Demonstrable fluency in at least one of the following languages: Java, Python, or Go.
  • Experience with Kubernetes, Nginx, Envoy, Prometheus, and/or Docker.
  • Understanding of standard networking protocols and components such as: HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies.
  • Understanding of the Linux Operating System, including Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, and Signals.
  • Experience in developing iOS apps using Xcode and Swift.
  • Experience in OpenTelemetry Standards / distributed tracing like Jaeger.

Responsibilities

  • Engage with product teams to understand requirements, design and implement resilient and scalable infrastructure solutions.
  • Operate, monitor, and triage all aspects of production and non-production environments.
  • Collaborate on code, infrastructure, design reviews, and process enhancements.
  • Evaluate and integrate new technologies to improve system reliability, security, and performance.
  • Develop and implement automation to provision, configure, deploy, and monitor Apple services.
  • Participate in an on-call rotation providing hands-on technical expertise during service impacting events.
  • Contribute to capacity planning, scale testing, and disaster recovery exercises.
  • Approach operational problems with a software engineering mindset.

Skills

DockerEnvoyGoGrafanaJaegerJavaKubernetesLinuxNGINXOpenTelemetryPrometheusPythonSplunk

Degrees

BS in Computer Science or related field

Languages

JavaPythonGo

Work schedule

Oncall rotation

Relocation

No