Jobs / TikTok

Site Reliability Engineer, Platform Responsibility - USDS

Apply Now

TikTok · Seattle, WA, United States

Seattle, WA, United StatesExp: 1+ yrs129,960-246,240 USD/yearlyHybrid

Apply Now

Remuneration

$129960 - $246240 annually + additional discretionary bonuses/incentives, and restricted stock units + medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short-term and long-term disability coverage, life insurance, wellbeing benefits, 10 paid holidays per year, 10 paid sick days per year and 17 days of Paid Personal Time

Location

Seattle, WA, United States

Visa sponsorship

Not specified

Job summary

The Platform Responsibility engineering team at TikTok USDS is seeking a Site Reliability Engineer to build machine learning models and systems for identifying and defending against internet abuse and fraud. This role involves managing data services and pipelines, designing AI/LLM-powered automation for incident response, and improving operational efficiency through tool creation. The position requires 24/7 support, including scheduled shifts and holidays.

Benefits

Additional discretionary bonuses/incentivesRestricted stock unitsMedical insuranceDental insuranceVision insurance401(k) savings plan with company matchPaid parental leaveShort-term disability coverageLong-term disability coverageLife insuranceWellbeing benefits10 paid holidays per year10 paid sick days per year17 days of Paid Personal Time

Qualifications

Bachelor or above degree in computer science or a related technical discipline
At least 1 year of industrial experience
Experience integrating AI/LLM APIs into internal workflows or infrastructure tooling
Demonstrated independent thinking capabilities and troubleshooting skills
Familiar with Unix/Linux system internals, networking, and distributed systems
Expertise in monitoring tools (e.g., Prometheus, Grafana, DataDog) and fundamental observability approaches
In-depth knowledge of Unix/Linux systems, networking fundamentals and system performance tuning
Familiar with backend systems such as MySQL/Redis/Nginx/Kafka/Kubernetes/Docker and big data technologies such as Hadoop/Spark/Flink/Hive/OLAP/ClickHouse, etc.

Responsibilities

Manage day-to-day operations of data service, realtime/batch data pipelines, such as SLA/SLO/SLI management, system deployment, performance tuning and troubleshooting
Design and deploy AI Agents and LLM-powered automation to streamline incident response, root cause analysis, and proactive system monitoring
Create tools and automation to improve system administration and operational efficiency, leveraging AI-assisted development tools to accelerate delivery and code quality
Engage in and improve the whole lifecycle of services from inception and design, development, capacity planning, and launch reviews, to deployment, operation, and refinement
Practice sustainable user support, incident response, and post mortem
This position is part of a team that provides 24/7 support and requires working scheduled shifts, which may include holidays

Skills

ClickHouseDatadogDockerGrafanaHadoopHiveKafkaKubernetesLinuxMySQLNGINXPrometheusRedisSpark

Degrees

Bachelor

Work schedule

24/7 supportScheduled shiftsHolidays

Relocation

Apply Now