Jobs / NVIDIA

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA · Santa Clara, CA, United States
Santa Clara, CA, United StatesExp: 5+ yrs152,000-287,500 USD/yearlyRemote
Remuneration
152,000-287,500 USD/yearly
Location
Santa Clara, CA, United States
Visa sponsorship
Not specified

Job summary

NVIDIA is seeking a Senior AI Infrastructure Engineer to design, build, and maintain large-scale production systems with high efficiency and availability. This role requires expertise in software and systems engineering, networking, coding, databases, capacity management, and open-source cloud technologies.

Benefits

EquityBenefits

Qualifications

  • Bachelor's degree in Computer Science or a related technical field involving coding, or equivalent experience.
  • Five or more years of experience.
  • Demonstrated ability to initiate projects, collaborate with others, and contribute to team projects.
  • Background in infrastructure automation and distributed systems architecture for managing large-scale private or public cloud platforms.
  • Experience with Python, Go, C/C++, or Java.
  • Comprehensive understanding of Linux, Networking, Storage, or Container Technologies.
  • Experience with Public Cloud, Infrastructure as Code (IAAC), and Terraform.
  • Experience with distributed systems.

Responsibilities

  • Design, build, deploy, and run internal tooling for large-scale AI training and inferencing platforms on cloud infrastructure.
  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  • Improve the lifecycle of services from inception to refinement.
  • Support services through system design consulting, software tool development, capacity management, and launch reviews.
  • Maintain live services by measuring and monitoring availability, latency, and system health.
  • Scale systems sustainably through automation and evolve systems for improved reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  • Participate in an on-call rotation to support production systems.

Skills

CortexC++GoJavaKubernetesLinuxOpenStackPythonTerraform

Degrees

BS degree in Computer ScienceBS degree in a related technical field involving coding

Languages

PythonGoC/C++Java

Work schedule

On-call rotation

Relocation

No