Jobs / TikTok

Site Reliability Engineer, Tech Infrastructure - USDS

TikTok · New York, NY, United States
New York, NY, United StatesExp: 3+ yrs136,800-259,200 USD/yearlyHybrid
Remuneration
136,800-259,200 USD/yearly
Location
New York, NY, United States
Visa sponsorship
Not specified

Job summary

Site Reliability Engineer at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. This role involves managing complex challenges of scale, using expertise in coding, algorithms, complexity analysis, and large-scale system design.

Qualifications

  • Bachelor's degree in Computer Science, Information Technology, or a related field with 3+ years of experience.
  • Proven work experience as a Site Reliability Engineer, Systems Engineer, or similar software engineering role.
  • Proficient knowledge of high-level programming languages (e.g., Python, Go, Java, and Shell script).
  • Experience in network architecture, database modeling, cloud systems, and large-scale distributed systems.
  • Strong understanding of Linux operating systems and open-source technologies.
  • Experience with containers and container orchestration platforms such as Docker, Kubernetes or equivalent.
  • Knowledge of monitoring tools and methodologies (such as Prometheus, Grafana).
  • Excellent problem-solving skills, strategic thinking, and ability to debug complex systems.
  • Exceptional communication skills and ability to effectively collaborate with cross-functional teams.

Responsibilities

  • Develop and maintain automation procedures to maximize system efficiency and minimize human intervention.
  • Work closely with software engineering teams to design, deploy, and operate elements to ensure systems are functionally robust.
  • Ensure system scalability to handle growth in web traffic and data.
  • Implement monitoring tools and set up metrics to track system health and performance.
  • Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues.
  • Conduct performance tests to find and address system bottlenecks.
  • Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
  • Practice sustainable user support, incident response, and blameless postmortems.

Skills

BashDockerGoGrafanaJavaKubernetesLinuxPrometheusPython

Degrees

Bachelor's degree in Computer ScienceBachelor's degree in Information TechnologyBachelor's degree in a related field

Relocation

No