Jobs / Ellucian

Senior Site Reliability Engineer

Apply Now

Ellucian · VA, United States

VA, United StatesExp: 5+ yrsRemote

Apply Now

Remuneration

Not specified

Location

VA, United States

Visa sponsorship

Not specified

Job summary

Ellucian is seeking a Senior Site Reliability Engineer (SRE) to ensure the reliability, performance, and cost-efficiency of production systems. This role requires deep expertise in DataDog for observability and will focus on DevOps practices, incident management, root cause analysis, and cost optimization across cloud infrastructure and services.

Qualifications

5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
Strong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)
Experience with cloud platforms (AWS, Azure, or GCP)
Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)
Strong troubleshooting skills and experience conducting root cause analysis in distributed systems
Experience with containers and orchestration (Docker, Kubernetes)
Scripting or programming experience (Python, Bash, or similar)
Proven ability to analyze and optimize cloud costs
Experience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)
Familiarity with cloud security and compliance best practices
Experience supporting high-availability, customer-facing systems
Strong collaboration and communication skills

Responsibilities

Own and improve system reliability, availability, and performance for production environments
Design, implement, and manage monitoring, alerting, and observability using DataDog
Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
Perform detailed root cause analysis (RCA) and drive permanent resolutions
Partner with engineering and DevOps teams to build scalable, resilient infrastructure
Automate operational processes to improve efficiency and reduce risk
Analyze and optimize infrastructure and application costs
Define and manage SLIs/SLOs to meet reliability targets
Continuously improve deployment, monitoring, and operational practices

Skills

AWSAzureBashDatadogDockerGCPKubernetesMakePythonTerraform

Relocation

Apply Now