Jobs / TikTok

Senior Site Reliability Engineer, Compute - USDS

Apply Now

TikTok · Seattle, WA, United States

Seattle, WA, United StatesExp: 3+ yrs177,688-341,734 USD/yearlyHybrid

Apply Now

Remuneration

177,688-341,734 USD/yearly

Location

Seattle, WA, United States

Visa sponsorship

Not specified

Job summary

Site Reliability Engineering (SRE) at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. This role involves managing complex challenges of scale, using expertise in coding, algorithms, complexity analysis, and large-scale system design. The team embraces a culture of diversity, intellectual curiosity, openness, and problem-solving, encouraging close collaboration while promoting self-direction.

Qualifications

Bachelor's degree in Computer Science, Information Technology, or a related field with 3+ years of experience.
Proven work experience as a Site Reliability Engineer, Systems Engineer, or similar software engineering role.
Passion for operational excellence through methodical automation and engineering processes using programming languages such as Go, Python, or other languages.
Experience in network architecture, database modeling, cloud systems, and large-scale distributed systems.
Strong understanding of Linux operating systems and open-source technologies.
Excellent problem-solving skills, strategic thinking, and ability to debug complex systems.
Exceptional communication skills and ability to effectively collaborate with cross-functional teams.
Knowledge of monitoring tools and methodologies such as Prometheus, Grafana.
Experience with containers and container orchestration platforms such as Docker, Kubernetes, or equivalent.

Responsibilities

Develop and maintain automation procedures to maximize system efficiency and minimize human intervention.
Work closely with software engineering teams to design, deploy, and operate elements to ensure that systems are functionally robust.
Ensure system scalability to handle growth in web traffic and data.
Implement monitoring tools and set up metrics to track system health and performance.
Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues.
Conduct performance tests to find and address system bottlenecks.
Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
Practice sustainable user support, incident response, and blameless postmortems.

Skills

DockerGoGrafanaKubernetesLinuxPrometheusPython

Degrees

Bachelor's degree in Computer ScienceBachelor's degree in Information TechnologyBachelor's degree in a related field

Relocation

Apply Now