Jobs / TikTok

Site Reliability Engineer, Platform Responsibility - USDS

TikTok · Seattle, WA, United States
Seattle, WA, United StatesExp: 1+ yrs129,960-246,240 USD/yearlyHybrid
Remuneration
$129960 - $246240 annually + additional discretionary bonuses/incentives, and restricted stock units + medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short-term and long-term disability coverage, life insurance, wellbeing benefits, 10 paid holidays per year, 10 paid sick days per year and 17 days of Paid Personal Time
Location
Seattle, WA, United States
Visa sponsorship
Not specified

Job summary

The Platform Responsibility engineering team at TikTok USDS is seeking a Site Reliability Engineer to build machine learning models and systems for identifying and defending against internet abuse and fraud. This role involves managing data services and pipelines, designing AI/LLM-powered automation for incident response, and improving operational efficiency through tool creation. The position requires 24/7 support, including scheduled shifts and holidays.

Benefits

Additional discretionary bonuses/incentivesRestricted stock unitsMedical insuranceDental insuranceVision insurance401(k) savings plan with company matchPaid parental leaveShort-term disability coverageLong-term disability coverageLife insuranceWellbeing benefits10 paid holidays per year10 paid sick days per year17 days of Paid Personal Time

Qualifications

  • Bachelor or above degree in computer science or a related technical discipline
  • At least 1 year of industrial experience
  • Experience integrating AI/LLM APIs into internal workflows or infrastructure tooling
  • Demonstrated independent thinking capabilities and troubleshooting skills
  • Familiar with Unix/Linux system internals, networking, and distributed systems
  • Expertise in monitoring tools (e.g., Prometheus, Grafana, DataDog) and fundamental observability approaches
  • In-depth knowledge of Unix/Linux systems, networking fundamentals and system performance tuning
  • Familiar with backend systems such as MySQL/Redis/Nginx/Kafka/Kubernetes/Docker and big data technologies such as Hadoop/Spark/Flink/Hive/OLAP/ClickHouse, etc.

Responsibilities

  • Manage day-to-day operations of data service, realtime/batch data pipelines, such as SLA/SLO/SLI management, system deployment, performance tuning and troubleshooting
  • Design and deploy AI Agents and LLM-powered automation to streamline incident response, root cause analysis, and proactive system monitoring
  • Create tools and automation to improve system administration and operational efficiency, leveraging AI-assisted development tools to accelerate delivery and code quality
  • Engage in and improve the whole lifecycle of services from inception and design, development, capacity planning, and launch reviews, to deployment, operation, and refinement
  • Practice sustainable user support, incident response, and post mortem
  • This position is part of a team that provides 24/7 support and requires working scheduled shifts, which may include holidays

Skills

ClickHouseDatadogDockerGrafanaHadoopHiveKafkaKubernetesLinuxMySQLNGINXPrometheusRedisSpark

Degrees

Bachelor

Work schedule

24/7 supportScheduled shiftsHolidays

Relocation

No