Jobs / Ellucian

Senior Site Reliability Engineer

Ellucian · VA, United States
VA, United StatesExp: 5+ yrsRemote
Remuneration
Not specified
Location
VA, United States
Visa sponsorship
Not specified

Job summary

Ellucian is seeking a Senior Site Reliability Engineer (SRE) to ensure the reliability, performance, and cost-efficiency of production systems. This role requires deep expertise in DataDog for observability and will focus on DevOps practices, incident management, root cause analysis, and cost optimization across cloud infrastructure and services.

Qualifications

  • 5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
  • Strong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)
  • Experience with cloud platforms (AWS, Azure, or GCP)
  • Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)
  • Strong troubleshooting skills and experience conducting root cause analysis in distributed systems
  • Experience with containers and orchestration (Docker, Kubernetes)
  • Scripting or programming experience (Python, Bash, or similar)
  • Proven ability to analyze and optimize cloud costs
  • Experience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)
  • Familiarity with cloud security and compliance best practices
  • Experience supporting high-availability, customer-facing systems
  • Strong collaboration and communication skills

Responsibilities

  • Own and improve system reliability, availability, and performance for production environments
  • Design, implement, and manage monitoring, alerting, and observability using DataDog
  • Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
  • Perform detailed root cause analysis (RCA) and drive permanent resolutions
  • Partner with engineering and DevOps teams to build scalable, resilient infrastructure
  • Automate operational processes to improve efficiency and reduce risk
  • Analyze and optimize infrastructure and application costs
  • Define and manage SLIs/SLOs to meet reliability targets
  • Continuously improve deployment, monitoring, and operational practices

Skills

AWSAzureBashDatadogDockerGCPKubernetesMakePythonTerraform

Relocation

No