Jobs / JPMorganChase

Lead Site Reliability Engineering - Network

Apply Now

JPMorganChase · Palo Alto, CA, United States

Palo Alto, CA, United StatesExp: 5-10 yrs152,000-215,000 USD/yearlyOnsite

Apply Now

Remuneration

152,000-215,000 USD/yearly

Location

Palo Alto, CA, United States

Visa sponsorship

Not specified

Job summary

As a Lead Site Reliability Engineer at JPMorgan Chase within the Network Product, you will hold a leadership role, demonstrating strong knowledge across multiple technical domains and advising others on technical and business issues. You will lead resiliency design reviews, break down complex problems, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.

Qualifications

Formal training or certification in network engineering concepts and 5+ years of applied experience.
10+ years of experience leading technologists to manage and solve complex technical items within your domain of expertise.
Advanced proficiency in network reliability engineering, including Permit to Operate, FMEA, and operational readiness processes.
Experience leading technologists to manage and solve complex network issues at a firmwide level.
Ability to influence team culture by championing innovation and change for success.
Proficiency in SD-WAN, cloud platforms (AWS, Azure), and major network technologies (Palo Alto, Juniper, F5, Broadcom, Arista, Cisco).
Proficiency in observability and monitoring tools such as Grafana, SevOne, Prometheus, Kibana, ThousandEyes, and Splunk.
Demonstrated proficiency in troubleshooting and supporting complex networking environments, including Tier-3 operational support for major incidents.
Experience with continuous integration and delivery tools (e.g., Jenkins, GitLab, Terraform).
Experience in scalable networking design, including high availability, redundancy, failover, and load balancing.
Experience troubleshooting networking protocols such as TCP/IP, HTTPS, and BGP.
Experience in customer-facing migration, including service discovery, assessment, planning, execution, and operations.

Responsibilities

Apply network reliability principles (Permit to Operate, FMEA, operational readiness), balancing feature delivery, efficiency, and stability.
Partner with network engineering domains (Datacenter, Firewall, Proxies, DMZ, Load Balancing) and Lines of Business to align goals and outcomes.
Drive adoption of reliability best practices and observability, demonstrating impact through stability/reliability metrics.
Bridge Engineering, Operations, DevOps, and customers to build resilient, scalable, and secure network services.
Provide Tier-3 network support, leading major incident response, rapid restoration, RCA, and follow-through on corrective actions.
Lead reliability and stability initiatives using data-driven analysis to improve service levels and reduce recurring failure modes.
Define SLI/SLOs and error budgets with stakeholders and customers, ensuring measurable performance targets and trade-off clarity.
Identify and remove technical bottlenecks within core domains of expertise, proactively preventing reliability and capacity risks.
Run blameless, data-driven post-mortems and debriefs, converting learnings into actionable improvements.
Foster continuous improvement and strong knowledge sharing, soliciting real-time feedback, avoiding duplicated work, and promoting innovation via internal communities.
Produce and package thought leadership with specialists/product/engineering teams, documenting best practices and lessons learned for internal assets and industry forums/conferences.

Skills

AWSAzureGitLabGrafanaJenkinsKibanaPrometheusSplunkTerraform

Certifications

CCIE

Relocation

Apply Now