Jobs / Oracle

Principal AI Site Reliability Engineer

Apply Now

Oracle · United States

United StatesExp: 7+ yrsRemote

Apply Now

Remuneration

Not specified

Location

United States

Visa sponsorship

Not specified

U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire.

Job summary

As a Principal Site Reliability Engineer, you will play a pivotal role in building and operating the Oracle Health Patient Portal. This involves designing, building, and operating highly reliable, scalable infrastructure for Commercial and Federal customers. You will contribute to the evolution of cloud operations by advancing automation, observability, and AI-assisted reliability practices within a globally distributed team, ensuring robust solutions and continuous improvement in system reliability and operational excellence.

Qualifications

U.S. citizenship is required
Ability to obtain and maintain a U.S. government security clearance
Experience building and operating high-availability, fault-tolerant systems
Strong understanding of distributed systems, performance monitoring, and resiliency patterns
Experience with incident response, root-cause analysis, and production troubleshooting
Experience with one or more cloud environments (OCI, AWS, Azure)
Advanced competency in CI/CD pipelines
Experience with Infrastructure as Code (Terraform)
Experience with observability tools (Prometheus, Grafana)
Strong focus on automation-first operations
Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)
Experience with ETL frameworks and large-scale data processing
Understanding of columnar storage systems
Proficiency in Python, Java, or Go
Experience with Docker, Kubernetes, and shell scripting
Strong troubleshooting skills with ability to perform root-cause analysis
Experience resolving complex production issues in distributed systems
7+ years of software engineering, cloud infrastructure, SRE, or DevOps experience
Proven ownership of production system reliability in cloud environments
Experience in healthcare or regulated environments (HIPAA, compliance frameworks) is preferred

Responsibilities

Build and operate the Oracle Health Patient Portal
Design, build, and operate highly reliable, scalable infrastructure for Commercial and Federal customers
Contribute to cloud operations by advancing automation, observability, and AI-assisted reliability practices
Deliver robust solutions that handle massive load with precision and performance
Continuously improve system reliability and operational excellence
Participate in on-call rotations
Implement preventative and automated remediation solutions
Work closely with engineers to execute technical roadmaps
Contribute to code reviews and infrastructure improvements
Take shared ownership of services and platform components with the SRE team
Develop a strong understanding of end-to-end system architecture, dependencies, and production behavior
Improve system reliability through automation, monitoring, and performance optimization
Contribute to AI-assisted approaches for operations, including enhancing observability and alerting, supporting automated incident detection and remediation, and exploring intelligent automation for infrastructure lifecycle management
Partner with development teams to enhance service architecture, scalability, and operability
Act as an escalation point for complex production issues
Perform root cause analysis and implement long-term fixes
Apply knowledge of distributed systems to troubleshoot issues and optimize system performance
Drive continuous improvement in DevOps/SRE practices, including CI/CD, Infrastructure as Code, and automation at scale

Skills

AzureGoSnowflakeJavaAWSTerraformPrometheusGrafanaCloud MonitoringBashDockerJenkinsKubernetesOracle CloudPython

Languages

PythonJavaGo

Industry

Healthcare

Security clearance

U.S. government security clearance

Relocation

Apply Now