Jobs / Palo Alto Networks
Principal Site Reliability Engineer
Palo Alto Networks · Santa Clara, CA, United States
Santa Clara, CA, United StatesExp: 10+ yrsRemote
Remuneration
Not specified
Location
Santa Clara, CA, United States
Visa sponsorship
No visa sponsorship
Is role eligible for Immigration Sponsorship? No. Please note that we will not sponsor applicants for work visas for this position.
Job summary
As a Principal Site Reliability Engineer within the Cortex DevOps team, you will be a technical leader driving reliability, scalability, observability, and operational excellence across the Cortex platform. This role involves partnering with engineering, product, and infrastructure teams to influence architecture, establish reliability standards, and build solutions for service availability and performance. You will also help shape the future direction of observability and reliability platforms while mentoring engineers.
Benefits
Restricted stock unitsBonus
Qualifications
- 10+ years of experience in Site Reliability Engineering, DevOps, Cloud Engineering, or related disciplines.
- Deep expertise with Prometheus, Thanos, Grafana, OpenTelemetry, and modern observability platforms.
- Strong understanding of SRE principles including SLIs, SLOs, error budgets, incident management, and operational excellence.
- Expert knowledge of Google Cloud Platform (GCP), Amazon Web Services (AWS), or similar cloud platforms.
- Expert-level experience with Kubernetes, Docker, and cloud-native architectures.
- Strong software engineering and automation skills using Python, Linux, Terraform, Ansible, and GitOps practices.
- Proven ability to influence technical direction and drive cross-functional initiatives across multiple engineering teams.
Responsibilities
- Define and drive reliability, observability, and operational excellence standards across Cortex services and infrastructure.
- Design and evolve large-scale observability platforms using technologies such as Prometheus, Thanos, Grafana, OpenTelemetry, and cloud-native monitoring solutions.
- Partner with engineering teams to ensure services are designed, instrumented, and operated with reliability and scalability.
- Drive improvements in monitoring, alerting, incident management, and service health to proactively identify and prevent customer-impacting issues.
- Lead initiatives focused on automation, self-healing systems, operational efficiency, and reduction of operational toil.
- Influence architectural decisions and technology adoption to improve platform reliability, performance, and cost efficiency.
- Mentor engineers and provide technical leadership across multiple teams and organizations.
- Stay current with emerging technologies and industry trends, evaluating and implementing solutions that advance Cortex's operational capabilities.
- Provide leadership during major incidents and drive post-incident reviews focused on systemic improvements.
Skills
AnsibleAWSCortexDockerGCPGrafanaKubernetesLinuxOpenTelemetryPrometheusPythonTerraformThanos
Relocation
No