Jobs / JPMorganChase

Lead Site Reliability Engineering - Network

JPMorganChase · Palo Alto, CA, United States
Palo Alto, CA, United StatesExp: 5-10 yrs152,000-215,000 USD/yearlyOnsite
Remuneration
152,000-215,000 USD/yearly
Location
Palo Alto, CA, United States
Visa sponsorship
Not specified

Job summary

As a Lead Site Reliability Engineer at JPMorgan Chase within the Network Product, you will hold a leadership role, demonstrating strong knowledge across multiple technical domains and advising others on technical and business issues. You will lead resiliency design reviews, break down complex problems, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.

Qualifications

  • Formal training or certification in network engineering concepts and 5+ years of applied experience.
  • 10+ years of experience leading technologists to manage and solve complex technical items within your domain of expertise.
  • Advanced proficiency in network reliability engineering, including Permit to Operate, FMEA, and operational readiness processes.
  • Experience leading technologists to manage and solve complex network issues at a firmwide level.
  • Ability to influence team culture by championing innovation and change for success.
  • Proficiency in SD-WAN, cloud platforms (AWS, Azure), and major network technologies (Palo Alto, Juniper, F5, Broadcom, Arista, Cisco).
  • Proficiency in observability and monitoring tools such as Grafana, SevOne, Prometheus, Kibana, ThousandEyes, and Splunk.
  • Demonstrated proficiency in troubleshooting and supporting complex networking environments, including Tier-3 operational support for major incidents.
  • Experience with continuous integration and delivery tools (e.g., Jenkins, GitLab, Terraform).
  • Experience in scalable networking design, including high availability, redundancy, failover, and load balancing.
  • Experience troubleshooting networking protocols such as TCP/IP, HTTPS, and BGP.
  • Experience in customer-facing migration, including service discovery, assessment, planning, execution, and operations.

Responsibilities

  • Apply network reliability principles (Permit to Operate, FMEA, operational readiness), balancing feature delivery, efficiency, and stability.
  • Partner with network engineering domains (Datacenter, Firewall, Proxies, DMZ, Load Balancing) and Lines of Business to align goals and outcomes.
  • Drive adoption of reliability best practices and observability, demonstrating impact through stability/reliability metrics.
  • Bridge Engineering, Operations, DevOps, and customers to build resilient, scalable, and secure network services.
  • Provide Tier-3 network support, leading major incident response, rapid restoration, RCA, and follow-through on corrective actions.
  • Lead reliability and stability initiatives using data-driven analysis to improve service levels and reduce recurring failure modes.
  • Define SLI/SLOs and error budgets with stakeholders and customers, ensuring measurable performance targets and trade-off clarity.
  • Identify and remove technical bottlenecks within core domains of expertise, proactively preventing reliability and capacity risks.
  • Run blameless, data-driven post-mortems and debriefs, converting learnings into actionable improvements.
  • Foster continuous improvement and strong knowledge sharing, soliciting real-time feedback, avoiding duplicated work, and promoting innovation via internal communities.
  • Produce and package thought leadership with specialists/product/engineering teams, documenting best practices and lessons learned for internal assets and industry forums/conferences.

Skills

AWSAzureGitLabGrafanaJenkinsKibanaPrometheusSplunkTerraform

Certifications

CCIE

Relocation

No