Jobs / JPMorganChase

Senior Lead Site Reliability Engineer

JPMorganChase · Palo Alto, CA, United States
Palo Alto, CA, United StatesExp: 10+ yrs171,000-260,000 USD/yearlyRemote
Remuneration
171,000-260,000 USD/yearly
Location
Palo Alto, CA, United States
Visa sponsorship
Not specified

Job summary

As a Senior Lead Site Reliability Engineer at JPMorgan Chase, you will define non-functional requirements and availability targets for services, ensuring they are accounted for in product design and test phases. This role combines SRE excellence with AI/ML capabilities to build autonomous systems for infrastructure operations at an enterprise scale.

Qualifications

  • Formal training or certification in software engineering concepts with 10+ years of applied experience in Site Reliability Engineering, DevOps, or Software Engineering.
  • Advanced knowledge in site reliability culture and principles with demonstrated ability to implement site reliability within an application or platform.
  • Advanced knowledge and experience in observability, including white and black box monitoring, service level objectives, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, and Splunk.
  • Expert-level proficiency in Java, Go (Golang), Python, and Terraform for building enterprise-grade applications, high-performance systems, automation, and infrastructure as code.
  • Advanced knowledge of software applications and technical processes with considerable depth in multiple technical disciplines, including distributed systems, microservices architecture, and cloud-native technologies.
  • Hands-on experience building AI Agents and autonomous systems with proficiency in AI frameworks (LangChain, LangGraph, AutoGen, CrewAI) and leveraging AI development tools (GitHub Copilot, Claude).
  • Expertise in designing and implementing logging pipelines (Fluentd, Logstash, Vector) and systems for metrics collection, analysis, and distributed tracing.
  • Strong experience building production-grade RESTful APIs and designing message queue architectures (Kafka, RabbitMQ, SQS) for event-driven systems.
  • Experience with graph databases (Neo4j, TigerGraph), vector databases (Pinecone, Weaviate, Chroma), and integrating multiple data stores for AI-powered systems.
  • Proficiency with containerization (Docker, Kubernetes), CI/CD pipelines, and GitOps workflows.
  • Ability to communicate data-based solutions with complex reporting and visualization methods.
  • Recognized as an active contributor to the engineering community.
  • Ability to expand network and lead evaluation sessions with vendors to align offerings with firm strategy.
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
  • Experience with MCP (Model Context Protocol) Servers or similar agent frameworks for building autonomous systems.
  • Understanding of LLM integration, prompt engineering, and RAG (Retrieval-Augmented Generation).
  • Familiarity with AI/ML model building, deployment, and lifecycle management using frameworks like TensorFlow, PyTorch, or scikit-learn.
  • Experience with big data technologies (Hadoop, Spark, Flink), analytical databases, NoSQL databases (MongoDB, Cassandra, DynamoDB), and time-series databases (InfluxDB, TimescaleDB).
  • Knowledge of security best practices and compliance requirements in highly regulated industries.
  • Experience in chaos engineering tools (Chaos Monkey, Gremlin, LitmusChaos) and GameDay exercises.

Responsibilities

  • Create high quality designs, roadmaps, and program charters for AI-powered automation systems, intelligent monitoring solutions, and next-generation reliability platforms.
  • Provide advice and mentoring to other engineers, acting as a key resource for technical and business-related issues, especially in SRE and AI/ML technologies.
  • Demonstrate and champion site reliability principles and practices within the team.
  • Collaborate to create and implement robust, stable observability and reliability designs for complex systems, including logging pipelines and systems for metrics and traces across distributed systems.
  • Design and build AI Agents and MCP Servers for autonomous operations, including incident detection, root cause analysis, and auto-remediation.
  • Architect solutions that integrate multiple data stores, including graph, vector, transactional, analytical, and big data platforms.
  • Develop automation scripts and infrastructure-as-code using Java, Go, Python, and Terraform to improve operational efficiency.
  • Build and maintain RESTful services, APIs, and message queue architectures for event-driven systems and platform automation.
  • Contribute significantly to JPMorgan Chase's site reliability community through internal forums, communities of practice, guilds, and conferences.

Skills

AWSAzureCassandraDatadogDockerDynamoDBDynatraceFluentdGCPGitHubGoGrafanaHadoopInfluxDBJavaKafkaKubernetesLogstashMongoDBPrometheusPythonRabbitMQRESTSparkSplunkSQSTerraformTimescaleDB

Relocation

No