Jobs / Fabric Health

Site Reliability Engineer

Fabric Health · New York, NY, United States
New York, NY, United StatesFull timeExp: 5+ yrs135,000-160,000 USD/yearlyRemote
Remuneration
135,000-160,000 USD/yearly
Location
New York, NY, United States
Visa sponsorship
No visa sponsorship
Applicants must be currently authorized to work in the United States without the need for current or future visa sponsorship. Fabric does not currently offer, and will not provide, H-1B or any other employment-based visa sponsorship, now or in the future. This includes candidates on F-1 OPT or STEM OPT. Your legal right to work must be PERMANENT AND UNRESTRICTED.

Job summary

Fabric Health is seeking a Site Reliability Engineer to own and evolve the infrastructure powering healthcare experiences for millions of patients. This role involves bridging traditional infrastructure excellence with AI-driven operations, acting as a primary architect for AWS and Kubernetes (EKS) environments, and exploring agentic workflows to modernize SRE practices. The engineer will be a steward of Fabric’s production integrity, leading strategy for infrastructure automation, observability, and system resilience.

Benefits

Medical insuranceDental insuranceVision insuranceUnlimited PTO401(k) planStock optionsBonuses

Qualifications

  • 5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale
  • Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management
  • Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems
  • Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go
  • Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency
  • A rigor-first mindset with a dedication to HIPAA-compliant, high-availability architecture

Responsibilities

  • Own and evolve the infrastructure powering healthcare experiences for millions of patients
  • Act as a primary architect for AWS and Kubernetes (EKS) environment
  • Ensure the platform is resilient, scalable, and compliant
  • Explore how agentic workflows can modernize SRE practices
  • Be a steward of Fabric’s production integrity
  • Lead the strategy for infrastructure automation, observability, and system resilience
  • Design, deploy, and maintain production Kubernetes (EKS) clusters to ensure enterprise-grade availability
  • Eliminate manual configuration by building and managing scalable infrastructure state through Terraform
  • Optimize the AWS footprint (EC2, RDS, S3) for performance, cost-efficiency, and reliability
  • Explore and deploy agentic workflows for AI-assisted runbooks to automate complex operational decisions and repetitive tasks
  • Build and evolve deployment pipelines using GitHub Actions or Semaphore for rapid and safe delivery
  • Focus on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems
  • Drive the evolution of the observability stack in Datadog by implementing sophisticated metrics, traces, and logs to meet SLOs
  • Lead incident response efforts and facilitate blameless postmortems to systematically reduce recovery time (MTTR)
  • Define and monitor SLIs and SLOs to ensure the platform consistently meets rigorous healthcare performance standards
  • Ensure all infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements
  • Mentor engineers across the company on reliability best practices
  • Contribute a clinical-safety perspective to cross-functional design reviews

Skills

AWSBashDatadogEKSGitHubGitHub ActionsGoKubernetesPythonRESTRubyS3Terraform

Languages

PythonBashRubyGo

Industry

Healthcare

Relocation

No