Jobs / TikTok
Site Reliability Engineer, Tech Infrastructure - USDS
TikTok · New York, NY, United States
New York, NY, United StatesExp: 3+ yrs136,800-259,200 USD/yearlyHybrid
Remuneration
136,800-259,200 USD/yearly
Location
New York, NY, United States
Visa sponsorship
Not specified
Job summary
Site Reliability Engineer at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. This role involves managing complex challenges of scale, using expertise in coding, algorithms, complexity analysis, and large-scale system design.
Qualifications
- Bachelor's degree in Computer Science, Information Technology, or a related field with 3+ years of experience.
- Proven work experience as a Site Reliability Engineer, Systems Engineer, or similar software engineering role.
- Proficient knowledge of high-level programming languages (e.g., Python, Go, Java, and Shell script).
- Experience in network architecture, database modeling, cloud systems, and large-scale distributed systems.
- Strong understanding of Linux operating systems and open-source technologies.
- Experience with containers and container orchestration platforms such as Docker, Kubernetes or equivalent.
- Knowledge of monitoring tools and methodologies (such as Prometheus, Grafana).
- Excellent problem-solving skills, strategic thinking, and ability to debug complex systems.
- Exceptional communication skills and ability to effectively collaborate with cross-functional teams.
Responsibilities
- Develop and maintain automation procedures to maximize system efficiency and minimize human intervention.
- Work closely with software engineering teams to design, deploy, and operate elements to ensure systems are functionally robust.
- Ensure system scalability to handle growth in web traffic and data.
- Implement monitoring tools and set up metrics to track system health and performance.
- Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues.
- Conduct performance tests to find and address system bottlenecks.
- Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
- Practice sustainable user support, incident response, and blameless postmortems.
Skills
BashDockerGoGrafanaJavaKubernetesLinuxPrometheusPython
Degrees
Bachelor's degree in Computer ScienceBachelor's degree in Information TechnologyBachelor's degree in a related field
Relocation
No