Jobs / Oracle
Principal AI Site Reliability Engineer
Oracle · United States
United StatesExp: 7+ yrsRemote
Remuneration
Not specified
Location
United States
Visa sponsorship
Not specified
U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire.
Job summary
As a Principal Site Reliability Engineer, you will play a pivotal role in building and operating the Oracle Health Patient Portal. This involves designing, building, and operating highly reliable, scalable infrastructure for Commercial and Federal customers. You will contribute to the evolution of cloud operations by advancing automation, observability, and AI-assisted reliability practices within a globally distributed team, ensuring robust solutions and continuous improvement in system reliability and operational excellence.
Qualifications
- U.S. citizenship is required
- Ability to obtain and maintain a U.S. government security clearance
- Experience building and operating high-availability, fault-tolerant systems
- Strong understanding of distributed systems, performance monitoring, and resiliency patterns
- Experience with incident response, root-cause analysis, and production troubleshooting
- Experience with one or more cloud environments (OCI, AWS, Azure)
- Advanced competency in CI/CD pipelines
- Experience with Infrastructure as Code (Terraform)
- Experience with observability tools (Prometheus, Grafana)
- Strong focus on automation-first operations
- Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)
- Experience with ETL frameworks and large-scale data processing
- Understanding of columnar storage systems
- Proficiency in Python, Java, or Go
- Experience with Docker, Kubernetes, and shell scripting
- Strong troubleshooting skills with ability to perform root-cause analysis
- Experience resolving complex production issues in distributed systems
- 7+ years of software engineering, cloud infrastructure, SRE, or DevOps experience
- Proven ownership of production system reliability in cloud environments
- Experience in healthcare or regulated environments (HIPAA, compliance frameworks) is preferred
Responsibilities
- Build and operate the Oracle Health Patient Portal
- Design, build, and operate highly reliable, scalable infrastructure for Commercial and Federal customers
- Contribute to cloud operations by advancing automation, observability, and AI-assisted reliability practices
- Deliver robust solutions that handle massive load with precision and performance
- Continuously improve system reliability and operational excellence
- Participate in on-call rotations
- Implement preventative and automated remediation solutions
- Work closely with engineers to execute technical roadmaps
- Contribute to code reviews and infrastructure improvements
- Take shared ownership of services and platform components with the SRE team
- Develop a strong understanding of end-to-end system architecture, dependencies, and production behavior
- Improve system reliability through automation, monitoring, and performance optimization
- Contribute to AI-assisted approaches for operations, including enhancing observability and alerting, supporting automated incident detection and remediation, and exploring intelligent automation for infrastructure lifecycle management
- Partner with development teams to enhance service architecture, scalability, and operability
- Act as an escalation point for complex production issues
- Perform root cause analysis and implement long-term fixes
- Apply knowledge of distributed systems to troubleshoot issues and optimize system performance
- Drive continuous improvement in DevOps/SRE practices, including CI/CD, Infrastructure as Code, and automation at scale
Skills
AzureGoSnowflakeJavaAWSTerraformPrometheusGrafanaCloud MonitoringBashDockerJenkinsKubernetesOracle CloudPython
Languages
PythonJavaGo
Industry
Healthcare
Security clearance
U.S. government security clearance
Relocation
No