Jobs / TikTok
Senior Site Reliability Engineer, Compute - USDS
TikTok · Seattle, WA, United States
Seattle, WA, United StatesExp: 3+ yrs177,688-341,734 USD/yearlyHybrid
Remuneration
177,688-341,734 USD/yearly
Location
Seattle, WA, United States
Visa sponsorship
Not specified
Job summary
Site Reliability Engineering (SRE) at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. This role involves managing complex challenges of scale, using expertise in coding, algorithms, complexity analysis, and large-scale system design. The team embraces a culture of diversity, intellectual curiosity, openness, and problem-solving, encouraging close collaboration while promoting self-direction.
Qualifications
- Bachelor's degree in Computer Science, Information Technology, or a related field with 3+ years of experience.
- Proven work experience as a Site Reliability Engineer, Systems Engineer, or similar software engineering role.
- Passion for operational excellence through methodical automation and engineering processes using programming languages such as Go, Python, or other languages.
- Experience in network architecture, database modeling, cloud systems, and large-scale distributed systems.
- Strong understanding of Linux operating systems and open-source technologies.
- Excellent problem-solving skills, strategic thinking, and ability to debug complex systems.
- Exceptional communication skills and ability to effectively collaborate with cross-functional teams.
- Knowledge of monitoring tools and methodologies such as Prometheus, Grafana.
- Experience with containers and container orchestration platforms such as Docker, Kubernetes, or equivalent.
Responsibilities
- Develop and maintain automation procedures to maximize system efficiency and minimize human intervention.
- Work closely with software engineering teams to design, deploy, and operate elements to ensure that systems are functionally robust.
- Ensure system scalability to handle growth in web traffic and data.
- Implement monitoring tools and set up metrics to track system health and performance.
- Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues.
- Conduct performance tests to find and address system bottlenecks.
- Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
- Practice sustainable user support, incident response, and blameless postmortems.
Skills
DockerGoGrafanaKubernetesLinuxPrometheusPython
Degrees
Bachelor's degree in Computer ScienceBachelor's degree in Information TechnologyBachelor's degree in a related field
Relocation
No