Jobs / TikTok

Senior Site Reliability Engineer, Compute - USDS

TikTok · Seattle, WA, United States
Seattle, WA, United StatesExp: 3+ yrs177,688-341,734 USD/yearlyHybrid
Remuneration
177,688-341,734 USD/yearly
Location
Seattle, WA, United States
Visa sponsorship
Not specified

Job summary

Site Reliability Engineering (SRE) at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. This role involves managing complex challenges of scale, using expertise in coding, algorithms, complexity analysis, and large-scale system design. The team embraces a culture of diversity, intellectual curiosity, openness, and problem-solving, encouraging close collaboration while promoting self-direction.

Qualifications

  • Bachelor's degree in Computer Science, Information Technology, or a related field with 3+ years of experience.
  • Proven work experience as a Site Reliability Engineer, Systems Engineer, or similar software engineering role.
  • Passion for operational excellence through methodical automation and engineering processes using programming languages such as Go, Python, or other languages.
  • Experience in network architecture, database modeling, cloud systems, and large-scale distributed systems.
  • Strong understanding of Linux operating systems and open-source technologies.
  • Excellent problem-solving skills, strategic thinking, and ability to debug complex systems.
  • Exceptional communication skills and ability to effectively collaborate with cross-functional teams.
  • Knowledge of monitoring tools and methodologies such as Prometheus, Grafana.
  • Experience with containers and container orchestration platforms such as Docker, Kubernetes, or equivalent.

Responsibilities

  • Develop and maintain automation procedures to maximize system efficiency and minimize human intervention.
  • Work closely with software engineering teams to design, deploy, and operate elements to ensure that systems are functionally robust.
  • Ensure system scalability to handle growth in web traffic and data.
  • Implement monitoring tools and set up metrics to track system health and performance.
  • Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues.
  • Conduct performance tests to find and address system bottlenecks.
  • Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
  • Practice sustainable user support, incident response, and blameless postmortems.

Skills

DockerGoGrafanaKubernetesLinuxPrometheusPython

Degrees

Bachelor's degree in Computer ScienceBachelor's degree in Information TechnologyBachelor's degree in a related field

Relocation

No