Jobs / NVIDIA

Senior Site Reliability Engineer

Apply Now

NVIDIA · Santa Clara, CA, United States

Santa Clara, CA, United StatesExp: 5+ yrs148,000-276,000 USD/yearlyRemote

Apply Now

Remuneration

148,000-276,000 USD/yearly

Location

Santa Clara, CA, United States

Visa sponsorship

Not specified

Job summary

NVIDIA is seeking a seasoned Senior Site Reliability Engineer to join its Infrastructure, Planning and Processes organization. This role involves developing and maintaining NVIDIA's internal Jenkins-based CI/CD product for GPUs and Tegra systems, working with various business units. The SRE will manage on-prem infrastructure, ensure service level agreements, and deploy applications on Kubernetes clusters.

Qualifications

Experience maintaining cloud infrastructure and highly-available production environments
Experience handling and maintaining systems installed in on-premises data centers
Proficiency using BMC interfaces (Redfish), KVM, and IPMI tools for hardware provisioning, remote access, and troubleshooting
Knowledge and understanding of Openstack architecture and services
Proven background working with relational databases such as SQL/MySQL
Experience with time-series databases like Prometheus, including data querying and performance tuning
Solid understanding of networking principles and protocols, including TCP/IP, DNS, DHCP, and VLANs
Ability to diagnose connectivity issues and support complex, distributed systems
Practical experience with data analytics and visualization tools such as Kibana, Grafana, Splunk for monitoring and troubleshooting
Strong experience with automation tools like Jenkins and/or Temporal
Experience with configuration tools like Ansible
Proficiency with Kubernetes, Docker, and virtualization technologies
Experience deploying, managing, and operating containerized workloads and virtualized infrastructure in production environments
Advanced knowledge of standard security methodologies and protocols, including system hardening, access control, vulnerability management, and secure operations
5+ years of demonstrable experience
Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience
Previous experience with SRE teams managing on-prem infrastructure
Experience managing NVIDIA hardware like GPUs and Tegras
Ability to thrive in a multi-tasking environment with constantly evolving priorities
Outstanding interpersonal skills and communication with all levels of management

Responsibilities

Manage NVIDIA's on-prem infrastructure
Maintain uptime, reliability, and readiness of on-prem engineering cloud across multiple data centers
Guard service level agreements (SLAs) for critical engineering services
Implement monitoring, alerting, and incident response procedures to ensure alignment with defined performance targets
Perform root cause analysis and post-mortems of incidents for threshold breaches
Deploy, configure, and manage applications and services on Kubernetes clusters
Implement logging, monitoring, and alerting solutions (e.g., Prometheus, Grafana, ELK/EFK)
Ensure high availability, fault tolerance, and disaster recovery for Kubernetes workloads
Assist in capacity planning, optimization, and utilization efforts
Support user-reported issues
Monitor alerts and take necessary action
Actively participate in WAR room for critical issues
Drive automation of monitoring to gain insight into applications and system health
Reuse AI techniques to extract useful signals about machines and jobs from generated data

Skills

AnsibleDockerGrafanaJenkinsKibanaKubernetesMySQLOpenStackPrometheusSplunkElasticsearch

Degrees

Bachelor's degree in Computer ScienceBachelor's degree in Information Technology

Relocation

Apply Now