Jobs / NVIDIA

Senior Site Reliability Engineer

NVIDIA · Santa Clara, CA, United States
Santa Clara, CA, United StatesExp: 5+ yrs148,000-276,000 USD/yearlyRemote
Remuneration
148,000-276,000 USD/yearly
Location
Santa Clara, CA, United States
Visa sponsorship
Not specified

Job summary

NVIDIA is seeking a seasoned Senior Site Reliability Engineer to join its Infrastructure, Planning and Processes organization. This role involves developing and maintaining NVIDIA's internal Jenkins-based CI/CD product for GPUs and Tegra systems, working with various business units. The SRE will manage on-prem infrastructure, ensure service level agreements, and deploy applications on Kubernetes clusters.

Qualifications

  • Experience maintaining cloud infrastructure and highly-available production environments
  • Experience handling and maintaining systems installed in on-premises data centers
  • Proficiency using BMC interfaces (Redfish), KVM, and IPMI tools for hardware provisioning, remote access, and troubleshooting
  • Knowledge and understanding of Openstack architecture and services
  • Proven background working with relational databases such as SQL/MySQL
  • Experience with time-series databases like Prometheus, including data querying and performance tuning
  • Solid understanding of networking principles and protocols, including TCP/IP, DNS, DHCP, and VLANs
  • Ability to diagnose connectivity issues and support complex, distributed systems
  • Practical experience with data analytics and visualization tools such as Kibana, Grafana, Splunk for monitoring and troubleshooting
  • Strong experience with automation tools like Jenkins and/or Temporal
  • Experience with configuration tools like Ansible
  • Proficiency with Kubernetes, Docker, and virtualization technologies
  • Experience deploying, managing, and operating containerized workloads and virtualized infrastructure in production environments
  • Advanced knowledge of standard security methodologies and protocols, including system hardening, access control, vulnerability management, and secure operations
  • 5+ years of demonstrable experience
  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience
  • Previous experience with SRE teams managing on-prem infrastructure
  • Experience managing NVIDIA hardware like GPUs and Tegras
  • Ability to thrive in a multi-tasking environment with constantly evolving priorities
  • Outstanding interpersonal skills and communication with all levels of management

Responsibilities

  • Manage NVIDIA's on-prem infrastructure
  • Maintain uptime, reliability, and readiness of on-prem engineering cloud across multiple data centers
  • Guard service level agreements (SLAs) for critical engineering services
  • Implement monitoring, alerting, and incident response procedures to ensure alignment with defined performance targets
  • Perform root cause analysis and post-mortems of incidents for threshold breaches
  • Deploy, configure, and manage applications and services on Kubernetes clusters
  • Implement logging, monitoring, and alerting solutions (e.g., Prometheus, Grafana, ELK/EFK)
  • Ensure high availability, fault tolerance, and disaster recovery for Kubernetes workloads
  • Assist in capacity planning, optimization, and utilization efforts
  • Support user-reported issues
  • Monitor alerts and take necessary action
  • Actively participate in WAR room for critical issues
  • Drive automation of monitoring to gain insight into applications and system health
  • Reuse AI techniques to extract useful signals about machines and jobs from generated data

Skills

AnsibleDockerGrafanaJenkinsKibanaKubernetesMySQLOpenStackPrometheusSplunkElasticsearch

Degrees

Bachelor's degree in Computer ScienceBachelor's degree in Information Technology

Relocation

No