Jobs / Fabric Health

Site Reliability Engineer

Apply Now

Fabric Health · New York, NY, United States

New York, NY, United StatesFull timeExp: 5+ yrs135,000-160,000 USD/yearlyRemote

Apply Now

Remuneration

135,000-160,000 USD/yearly

Location

New York, NY, United States

Visa sponsorship

No visa sponsorship

Applicants must be currently authorized to work in the United States without the need for current or future visa sponsorship. Fabric does not currently offer, and will not provide, H-1B or any other employment-based visa sponsorship, now or in the future. This includes candidates on F-1 OPT or STEM OPT. Your legal right to work must be PERMANENT AND UNRESTRICTED.

Job summary

Fabric Health is seeking a Site Reliability Engineer to own and evolve the infrastructure powering healthcare experiences for millions of patients. This role involves bridging traditional infrastructure excellence with AI-driven operations, acting as a primary architect for AWS and Kubernetes (EKS) environments, and exploring agentic workflows to modernize SRE practices. The engineer will be a steward of Fabric’s production integrity, leading strategy for infrastructure automation, observability, and system resilience.

Benefits

Medical insuranceDental insuranceVision insuranceUnlimited PTO401(k) planStock optionsBonuses

Qualifications

5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale
Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management
Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems
Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go
Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency
A rigor-first mindset with a dedication to HIPAA-compliant, high-availability architecture

Responsibilities

Own and evolve the infrastructure powering healthcare experiences for millions of patients
Act as a primary architect for AWS and Kubernetes (EKS) environment
Ensure the platform is resilient, scalable, and compliant
Explore how agentic workflows can modernize SRE practices
Be a steward of Fabric’s production integrity
Lead the strategy for infrastructure automation, observability, and system resilience
Design, deploy, and maintain production Kubernetes (EKS) clusters to ensure enterprise-grade availability
Eliminate manual configuration by building and managing scalable infrastructure state through Terraform
Optimize the AWS footprint (EC2, RDS, S3) for performance, cost-efficiency, and reliability
Explore and deploy agentic workflows for AI-assisted runbooks to automate complex operational decisions and repetitive tasks
Build and evolve deployment pipelines using GitHub Actions or Semaphore for rapid and safe delivery
Focus on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems
Drive the evolution of the observability stack in Datadog by implementing sophisticated metrics, traces, and logs to meet SLOs
Lead incident response efforts and facilitate blameless postmortems to systematically reduce recovery time (MTTR)
Define and monitor SLIs and SLOs to ensure the platform consistently meets rigorous healthcare performance standards
Ensure all infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements
Mentor engineers across the company on reliability best practices
Contribute a clinical-safety perspective to cross-functional design reviews

Skills

AWSBashDatadogEKSGitHubGitHub ActionsGoKubernetesPythonRESTRubyS3Terraform

Languages

PythonBashRubyGo

Industry

Healthcare

Relocation

Apply Now