Jobs / JPMorganChase
Lead Site Reliability Engineering - Network
JPMorganChase · Palo Alto, CA, United States
Palo Alto, CA, United StatesExp: 5-10 yrs152,000-215,000 USD/yearlyOnsite
Remuneration
152,000-215,000 USD/yearly
Location
Palo Alto, CA, United States
Visa sponsorship
Not specified
Job summary
As a Lead Site Reliability Engineer at JPMorgan Chase within the Network Product, you will hold a leadership role, demonstrating strong knowledge across multiple technical domains and advising others on technical and business issues. You will lead resiliency design reviews, break down complex problems, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.
Qualifications
- Formal training or certification in network engineering concepts and 5+ years of applied experience.
- 10+ years of experience leading technologists to manage and solve complex technical items within your domain of expertise.
- Advanced proficiency in network reliability engineering, including Permit to Operate, FMEA, and operational readiness processes.
- Experience leading technologists to manage and solve complex network issues at a firmwide level.
- Ability to influence team culture by championing innovation and change for success.
- Proficiency in SD-WAN, cloud platforms (AWS, Azure), and major network technologies (Palo Alto, Juniper, F5, Broadcom, Arista, Cisco).
- Proficiency in observability and monitoring tools such as Grafana, SevOne, Prometheus, Kibana, ThousandEyes, and Splunk.
- Demonstrated proficiency in troubleshooting and supporting complex networking environments, including Tier-3 operational support for major incidents.
- Experience with continuous integration and delivery tools (e.g., Jenkins, GitLab, Terraform).
- Experience in scalable networking design, including high availability, redundancy, failover, and load balancing.
- Experience troubleshooting networking protocols such as TCP/IP, HTTPS, and BGP.
- Experience in customer-facing migration, including service discovery, assessment, planning, execution, and operations.
Responsibilities
- Apply network reliability principles (Permit to Operate, FMEA, operational readiness), balancing feature delivery, efficiency, and stability.
- Partner with network engineering domains (Datacenter, Firewall, Proxies, DMZ, Load Balancing) and Lines of Business to align goals and outcomes.
- Drive adoption of reliability best practices and observability, demonstrating impact through stability/reliability metrics.
- Bridge Engineering, Operations, DevOps, and customers to build resilient, scalable, and secure network services.
- Provide Tier-3 network support, leading major incident response, rapid restoration, RCA, and follow-through on corrective actions.
- Lead reliability and stability initiatives using data-driven analysis to improve service levels and reduce recurring failure modes.
- Define SLI/SLOs and error budgets with stakeholders and customers, ensuring measurable performance targets and trade-off clarity.
- Identify and remove technical bottlenecks within core domains of expertise, proactively preventing reliability and capacity risks.
- Run blameless, data-driven post-mortems and debriefs, converting learnings into actionable improvements.
- Foster continuous improvement and strong knowledge sharing, soliciting real-time feedback, avoiding duplicated work, and promoting innovation via internal communities.
- Produce and package thought leadership with specialists/product/engineering teams, documenting best practices and lessons learned for internal assets and industry forums/conferences.
Skills
AWSAzureGitLabGrafanaJenkinsKibanaPrometheusSplunkTerraform
Certifications
CCIE
Relocation
No