Jobs / Palo Alto Networks

Principal Site Reliability Engineer

Apply Now

Palo Alto Networks · Santa Clara, CA, United States

Santa Clara, CA, United StatesExp: 10+ yrsRemote

Apply Now

Remuneration

Not specified

Location

Santa Clara, CA, United States

Visa sponsorship

No visa sponsorship

Is role eligible for Immigration Sponsorship? No. Please note that we will not sponsor applicants for work visas for this position.

Job summary

As a Principal Site Reliability Engineer within the Cortex DevOps team, you will be a technical leader driving reliability, scalability, observability, and operational excellence across the Cortex platform. This role involves partnering with engineering, product, and infrastructure teams to influence architecture, establish reliability standards, and build solutions for service availability and performance. You will also help shape the future direction of observability and reliability platforms while mentoring engineers.

Benefits

Restricted stock unitsBonus

Qualifications

10+ years of experience in Site Reliability Engineering, DevOps, Cloud Engineering, or related disciplines.
Deep expertise with Prometheus, Thanos, Grafana, OpenTelemetry, and modern observability platforms.
Strong understanding of SRE principles including SLIs, SLOs, error budgets, incident management, and operational excellence.
Expert knowledge of Google Cloud Platform (GCP), Amazon Web Services (AWS), or similar cloud platforms.
Expert-level experience with Kubernetes, Docker, and cloud-native architectures.
Strong software engineering and automation skills using Python, Linux, Terraform, Ansible, and GitOps practices.
Proven ability to influence technical direction and drive cross-functional initiatives across multiple engineering teams.

Responsibilities

Define and drive reliability, observability, and operational excellence standards across Cortex services and infrastructure.
Design and evolve large-scale observability platforms using technologies such as Prometheus, Thanos, Grafana, OpenTelemetry, and cloud-native monitoring solutions.
Partner with engineering teams to ensure services are designed, instrumented, and operated with reliability and scalability.
Drive improvements in monitoring, alerting, incident management, and service health to proactively identify and prevent customer-impacting issues.
Lead initiatives focused on automation, self-healing systems, operational efficiency, and reduction of operational toil.
Influence architectural decisions and technology adoption to improve platform reliability, performance, and cost efficiency.
Mentor engineers and provide technical leadership across multiple teams and organizations.
Stay current with emerging technologies and industry trends, evaluating and implementing solutions that advance Cortex's operational capabilities.
Provide leadership during major incidents and drive post-incident reviews focused on systemic improvements.

Skills

AnsibleAWSCortexDockerGCPGrafanaKubernetesLinuxOpenTelemetryPrometheusPythonTerraformThanos

Relocation

Apply Now