Jobs / NVIDIA

Senior System Software Engineer - DevOps and Infrastructure Automation

NVIDIA · Seattle, WA, United States
Seattle, WA, United StatesExp: 7+ yrs184,000-287,500 USD/yearlyRemote
Remuneration
184,000-287,500 USD/yearly
Location
Seattle, WA, United States
Visa sponsorship
Not specified

Job summary

NVIDIA is seeking a Senior System Software Engineer for its AI Inference Operations Team, focusing on DevOps and Infrastructure Automation. This role involves designing, building, and operating the infrastructure backbone for AI inference products, with a strong emphasis on Kubernetes deployments and CI/CD pipelines. The ideal candidate will thrive at the intersection of systems programming, cloud-native infrastructure, and developer productivity.

Benefits

EquityBenefits

Qualifications

  • BS/MS in Computer Science/Computer Engineering or equivalent experience.
  • 7+ years operating production distributed systems in SRE, DevOps, or Platform Operations roles.
  • Deep Kubernetes expertise, including components, subsystems, on-prem setup, and hands-on debugging of telemetry-heavy microservices.
  • Strong CI/CD experience with GitLab CI and GitHub Actions.
  • Proficiency in Git-based workflows, Linux systems programming, and scripting in Python and Bash.
  • Fluency in Infrastructure-as-Code (IaC) tools such as Terraform, Ansible, Helm, and Crossplane.
  • In-depth knowledge of containerization technologies like Docker, containerd, and OCI.
  • Proven reliability ownership, including experience with SLOs/SLIs, on-call duties, incident response, and post-incident reviews.
  • Hands-on experience with observability stacks like Prometheus, Grafana, and Loki.
  • Clear communication skills, particularly in writing effective runbooks.
  • Experience with MLOps, crafting, deploying, and operating machine learning pipelines (preferred).
  • Experience in open-source development workflows and community engagement (preferred).
  • Familiarity with GPU software stacks, including CUDA, cuDNN, TensorRT, and inference serving frameworks (preferred).
  • Experience building custom test automation frameworks and using data-driven metrics (preferred).
  • Demonstrated ability to debug complex issues spanning kernel modules, container runtimes, and distributed networking (preferred).

Responsibilities

  • Design, build, and operate the infrastructure backbone powering AI inference products, ensuring reliability, performance, and scalability.
  • Own end-to-end Kubernetes deployments across cloud and on-prem environments, including runbooks, canary checks, post-deploy validation, and rollbacks.
  • Architect CI/CD pipelines for automated build, test, packaging, and release of inference libraries and container-based software stacks.
  • Build observability solutions, including dashboards, logs, metrics, and automated checks, to monitor platform health and lead first-level incident triage.
  • Manage cloud and on-prem environments using infrastructure-as-code tools like Terraform, Ansible, Helm, and Crossplane, and automate tasks with GitHub Actions, GitLab CI, and custom tooling.
  • Own the security posture for infrastructure components, including vulnerability scans, CVE remediation, and compliance with internal policies.
  • Collaborate with deep learning framework engineers, compiler teams, and platform architects to streamline end-to-end deployment.

Skills

AnsibleAWSAzureBashcontainerdDockerGCPGitGitHubGitHub ActionsGitLabGitLab CIGrafanaHelmKubernetesLinuxLokiOracle CloudPrometheusPythonTerraform

Degrees

BS in CS/CEMS in CS/CE

Languages

PythonBash

Relocation

No