Jobs / Oracle
Senior AI Site Reliability Engineer
Oracle · United States
United StatesExp: 4+ yrsRemote
Remuneration
Not specified
Location
United States
Visa sponsorship
No visa sponsorship
U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire.
Job summary
As a Senior Site Reliability Engineer, you will play a pivotal role in building and operating the Oracle HealthPatient Portal. You will design, build, and operate highly reliable, scalable infrastructure for Commercial and Federal customers. This role involves advancing automation, observability, and AI-assisted reliability practices within a globally distributed team, ensuring continuous improvement in system reliability and operational excellence.
Qualifications
- U.S. citizenship is required.
- Ability to obtain and maintain a U.S. government security clearance after hire.
- Experience building and operating high-availability, fault-tolerant systems.
- Strong understanding of distributed systems, performance monitoring, and resiliency patterns.
- Experience with incident response, root-cause analysis, and production troubleshooting.
- Experience with one or more cloud environments: OCI, AWS, or Azure.
- Advanced competency in CI/CD pipelines (Jenkins, Kubernetes).
- Experience with Infrastructure as Code (Terraform).
- Experience with Observability tools (Prometheus, Grafana).
- Strong focus on automation-first operations.
- Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake).
- Experience with ETL frameworks and large-scale data processing.
- Understanding of columnar storage systems.
- Proficiency in Python, Java, or Go.
- Experience with Docker, Kubernetes, and shell scripting.
- Strong troubleshooting skills with ability to perform root-cause analysis.
- Experience resolving complex production issues in distributed systems.
- Ability to apply DevOps/SRE practices to automate deployments and operations.
- Ability to enhance observability using Prometheus/Grafana and AI-driven insights.
- 4+ years of software engineering, cloud infrastructure, SRE, or DevOps experience.
Responsibilities
- Play a pivotal role in building and operating the Oracle HealthPatient Portal.
- Design, build, and operate highly reliable, scalable infrastructure that supports Commercial and Federal customers.
- Contribute to the next evolution of cloud operations by advancing automation, observability, and AI-assisted reliability practices.
- Work within a globally distributed team to deliver robust solutions that handle massive load by end users with precision and performance.
- Continuously improve system reliability and operational excellence.
- Participate in on-call rotations.
- Implement preventative and automated remediation solutions.
- Work closely with engineers to execute technical roadmaps.
- Contribute to code reviews and infrastructure improvements.
- Take shared ownership of services and platform components with the Site Reliability Engineering (SRE) team.
- Develop a strong understanding of end-to-end system architecture, dependencies, and production behavior.
- Design, build, and operate reliable, scalable, and secure infrastructure supporting large-scale distributed systems.
- Improve system reliability through automation, monitoring, and performance optimization.
- Contribute to the adoption of AI-assisted approaches for operations.
- Enhance observability and alerting.
- Support automated incident detection and remediation.
- Explore intelligent automation for infrastructure lifecycle management.
- Partner with development teams to enhance service architecture, scalability, and operability.
- Act as an escalation point for complex production issues.
- Perform root cause analysis and implement long-term fixes to prevent recurrence.
Skills
AWSAzureBashDockerGoGrafanaJavaJenkinsKubernetesOracle CloudPrometheusPythonSnowflakeTerraform
Languages
PythonJavaGoShell scripting
Work schedule
On-call rotations
Industry
Healthcare
Security clearance
U.S. government security clearance
Relocation
No