Jobs / NVIDIA
Senior AI Infrastructure Engineer - DGX Cloud
NVIDIA · Santa Clara, CA, United States
Santa Clara, CA, United StatesExp: 5+ yrs152,000-287,500 USD/yearlyRemote
Remuneration
152,000-287,500 USD/yearly
Location
Santa Clara, CA, United States
Visa sponsorship
Not specified
Job summary
NVIDIA is seeking a Senior AI Infrastructure Engineer to design, build, and maintain large-scale production systems with high efficiency and availability. This role requires expertise in software and systems engineering, networking, coding, databases, capacity management, and open-source cloud technologies.
Benefits
EquityBenefits
Qualifications
- Bachelor's degree in Computer Science or a related technical field involving coding, or equivalent experience.
- Five or more years of experience.
- Demonstrated ability to initiate projects, collaborate with others, and contribute to team projects.
- Background in infrastructure automation and distributed systems architecture for managing large-scale private or public cloud platforms.
- Experience with Python, Go, C/C++, or Java.
- Comprehensive understanding of Linux, Networking, Storage, or Container Technologies.
- Experience with Public Cloud, Infrastructure as Code (IAAC), and Terraform.
- Experience with distributed systems.
Responsibilities
- Design, build, deploy, and run internal tooling for large-scale AI training and inferencing platforms on cloud infrastructure.
- Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
- Improve the lifecycle of services from inception to refinement.
- Support services through system design consulting, software tool development, capacity management, and launch reviews.
- Maintain live services by measuring and monitoring availability, latency, and system health.
- Scale systems sustainably through automation and evolve systems for improved reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Participate in an on-call rotation to support production systems.
Skills
CortexC++GoJavaKubernetesLinuxOpenStackPythonTerraform
Degrees
BS degree in Computer ScienceBS degree in a related technical field involving coding
Languages
PythonGoC/C++Java
Work schedule
On-call rotation
Relocation
No