Jobs / NVIDIA
Senior Site Reliability Engineer
NVIDIA · Santa Clara, CA, United States
Santa Clara, CA, United StatesExp: 5+ yrs148,000-276,000 USD/yearlyRemote
Remuneration
148,000-276,000 USD/yearly
Location
Santa Clara, CA, United States
Visa sponsorship
Not specified
Job summary
NVIDIA is seeking a seasoned Senior Site Reliability Engineer to join its Infrastructure, Planning and Processes organization. This role involves developing and maintaining NVIDIA's internal Jenkins-based CI/CD product for GPUs and Tegra systems, working with various business units. The SRE will manage on-prem infrastructure, ensure service level agreements, and deploy applications on Kubernetes clusters.
Qualifications
- Experience maintaining cloud infrastructure and highly-available production environments
- Experience handling and maintaining systems installed in on-premises data centers
- Proficiency using BMC interfaces (Redfish), KVM, and IPMI tools for hardware provisioning, remote access, and troubleshooting
- Knowledge and understanding of Openstack architecture and services
- Proven background working with relational databases such as SQL/MySQL
- Experience with time-series databases like Prometheus, including data querying and performance tuning
- Solid understanding of networking principles and protocols, including TCP/IP, DNS, DHCP, and VLANs
- Ability to diagnose connectivity issues and support complex, distributed systems
- Practical experience with data analytics and visualization tools such as Kibana, Grafana, Splunk for monitoring and troubleshooting
- Strong experience with automation tools like Jenkins and/or Temporal
- Experience with configuration tools like Ansible
- Proficiency with Kubernetes, Docker, and virtualization technologies
- Experience deploying, managing, and operating containerized workloads and virtualized infrastructure in production environments
- Advanced knowledge of standard security methodologies and protocols, including system hardening, access control, vulnerability management, and secure operations
- 5+ years of demonstrable experience
- Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience
- Previous experience with SRE teams managing on-prem infrastructure
- Experience managing NVIDIA hardware like GPUs and Tegras
- Ability to thrive in a multi-tasking environment with constantly evolving priorities
- Outstanding interpersonal skills and communication with all levels of management
Responsibilities
- Manage NVIDIA's on-prem infrastructure
- Maintain uptime, reliability, and readiness of on-prem engineering cloud across multiple data centers
- Guard service level agreements (SLAs) for critical engineering services
- Implement monitoring, alerting, and incident response procedures to ensure alignment with defined performance targets
- Perform root cause analysis and post-mortems of incidents for threshold breaches
- Deploy, configure, and manage applications and services on Kubernetes clusters
- Implement logging, monitoring, and alerting solutions (e.g., Prometheus, Grafana, ELK/EFK)
- Ensure high availability, fault tolerance, and disaster recovery for Kubernetes workloads
- Assist in capacity planning, optimization, and utilization efforts
- Support user-reported issues
- Monitor alerts and take necessary action
- Actively participate in WAR room for critical issues
- Drive automation of monitoring to gain insight into applications and system health
- Reuse AI techniques to extract useful signals about machines and jobs from generated data
Skills
AnsibleDockerGrafanaJenkinsKibanaKubernetesMySQLOpenStackPrometheusSplunkElasticsearch
Degrees
Bachelor's degree in Computer ScienceBachelor's degree in Information Technology
Relocation
No